Tokenize Text
Last updated
Last updated
There are two ways to access. One is to access from 'Add' (Plus) button.
Another way is to access from a column header menu.
Column to Tokenize - Set the text column you want to split or tokenize.
Tokenize By - The default is "words". Select the unit of token from
Words
Sentences
Keep Other Columns - The default is FALSE. Whether existing columns should remain.
Keep Original Column - Whether input column should be removed. The default is No.
With Sentence ID - Whether the sentence ID should be in the output. The default is Yes
Remove Stopwords - Default is Yes.
Language for Stopwords - By default it is automatically selected based on the content of the text.
Additional Stopwords - Words to be added to the default set of stopwords.
Exclude from Stopwords - Words to be excluded from the default set of stopwords.
Words To Be Treated As One Word - If a word or phrase that should be treated as one token is separated into multiple tokens, it can be fixed by specifying the word/phrase here.
Remove Punctuations
Remove Numbers
Clean Up Twitter Data - Whether to remove hashtag (starts with #) and mention (starts with @). The default is No.
Remove Hiragana Only Words - You can treat often meaningless short Japanese Hiragana words as stopwords altogether by selecting an option here.
Column Name for Output Data - The default is "token". Set a column name for the new column to store the tokenized values.
Format for Output Data - Format for the output tokens. The default is lowercase.
Lowercase
Titlecase
Uppercase