# Tokenize Text

## How to Access This Feature

### From + (plus) Button

There are two ways to access. One is to access from 'Add' (Plus) button.

![](https://2850417076-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M4HLCK3olgduYoe3RVS%2Fsync%2F75a2cdc7fc13b00fdce5bae450e3ff910e8d35a2.png?generation=1631253694158017\&alt=media)

Another way is to access from a column header menu.

![](https://2850417076-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M4HLCK3olgduYoe3RVS%2Fsync%2F377241753d8e1fcf6db420bff90bab3de88f2588.png?generation=1631253694160552\&alt=media)

## Parameters

![](https://2850417076-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M4HLCK3olgduYoe3RVS%2Fsync%2Fd17770067b11edf03df88afc17b87c8482e02626.png?generation=1631253694147257\&alt=media)

* Column to Tokenize - Set the text column you want to split or tokenize.
* Tokenize By - The default is "words". Select the unit of token from
  * Words
  * Sentences
* Keep Other Columns - The default is FALSE. Whether existing columns should remain.
* Keep Original Column - Whether input column should be removed. The default is No.
* With Sentence ID - Whether the sentence ID should be in the output. The default is Yes
* Remove Stopwords - Default is Yes.
* Language for Stopwords - By default it is automatically selected based on the content of the text.
* Additional Stopwords - Words to be added to the default set of stopwords.
* Exclude from Stopwords - Words to be excluded from the default set of stopwords.
* Words To Be Treated As One Word - If a word or phrase that should be treated as one token is separated into multiple tokens, it can be fixed by specifying the word/phrase here.
* Remove Punctuations
* Remove Numbers
* Clean Up Twitter Data - Whether to remove hashtag (starts with #) and mention (starts with @). The default is No.
* Remove Hiragana Only Words - You can treat often meaningless short Japanese Hiragana words as stopwords altogether by selecting an option here.
* Column Name for Output Data - The default is "token". Set a column name for the new column to store the tokenized values.
* Format for Output Data - Format for the output tokens. The default is lowercase.
  * Lowercase
  * Titlecase
  * Uppercase
