# Calculate tf-idf

## How to Access This Feature

### From + (plus) Button

You can access it from 'Add' (Plus) button. "Text Mining..." -> "Calculate TF-IDF". ![](https://2850417076-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M4HLCK3olgduYoe3RVS%2F-M4oMvCUDQwHTJ0eWi_f%2F-M4oNE7gJvG3uj_fk4va%2Fdo_tfidf_add.png?generation=1586795486385799\&alt=media)

## How to Use?

![](https://2850417076-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M4HLCK3olgduYoe3RVS%2F-M4oMvCUDQwHTJ0eWi_f%2F-M4oNE7i_sKCTiNIbBDZ%2Fdo_tfidf_param.png?generation=1586795486264072\&alt=media)

* Select a column as document id - A column considered as document id. If you run [do\_tokenize](https://docs.exploratory.io/main/do_tokenize) beforehand, this can be document\_id.
* Select a column that has tokenized text - Set a column that has tokens. This is "token" column if it's tokenized by [do\_tokenize](https://docs.exploratory.io/main/do_tokenize) function.
* TF Weight (Optional) - The default is "raw".
  * "raw" is count of a term in a document.
  * "binary" is if it exists or not. If it exists, it is 1 and if not, it is 0.
  * "log\_scale" is `1+log(count of a term in a document)`.
* IDF Log Scale Function (Optional) - The default is log. This is a function to suppress the increase of idf value. Idf is calculated by `log_scale_function((the total number of documents)/(the number of documents which have the token))`. It's how rare the token is in the set of documents. It might be worth trying log2 or log10. log2 increases the value more easily and log10 increases it more slowly.
* Normalization (Optional) - The default is l2. How to normalize the tfidf vector.
  * "l2" is normalization that Euclidean distance of the tfidf vector for a document becomes 1.
  * "l1" is normalization that Manhattan distance (sum of values) of the tfidf vector for a document becomes 1.
  * FALSE doesn't normalize the result.
