# Calculate tf-idf

You can access it from 'Add' (Plus) button. "Text Mining..." -> "Calculate TF-IDF".

- Select a column as document id - A column considered as document id. If you run do_tokenize beforehand, this can be document_id.
- Select a column that has tokenized text - Set a column that has tokens. This is "token" column if it's tokenized by do_tokenize function.
- TF Weight (Optional) - The default is "raw".
- "raw" is count of a term in a document.
- "binary" is if it exists or not. If it exists, it is 1 and if not, it is 0.
- "log_scale" is
`1+log(count of a term in a document)`

.

- IDF Log Scale Function (Optional) - The default is log. This is a function to suppress the increase of idf value. Idf is calculated by
`log_scale_function((the total number of documents)/(the number of documents which have the token))`

. It's how rare the token is in the set of documents. It might be worth trying log2 or log10. log2 increases the value more easily and log10 increases it more slowly. - Normalization (Optional) - The default is l2. How to normalize the tfidf vector.
- "l2" is normalization that Euclidean distance of the tfidf vector for a document becomes 1.
- "l1" is normalization that Manhattan distance (sum of values) of the tfidf vector for a document becomes 1.
- FALSE doesn't normalize the result.

Last modified 3yr ago