Tokenize Text

How to Access This Feature

From + (plus) Button

There are two ways to access. One is to access from 'Add' (Plus) button.

Another way is to access from a column header menu.

How to Use?

  • Which column has text data to tokenize? - Set a column of which you want to split the text or tokenize.

  • How do you want to tokenize? (Optional) - The default is "words". Select the unit of token from

    • "By words"

    • "By characters"

    • "By sentences"

    • "By lines"

    • "By paragraphs"

    • "By regular expression"

  • Keep Other Columns (Optional) - The default is FALSE. Whether existing columns should remain.

  • Return Result in (Optional) - The default is TRUE. Whether output should be lower cased.

  • Keep Original Column (Optional) - The default is TRUE. Whether input column should be removed.

  • Output Column Name (Optional) - The default is "token". Set a column name for the new column to store the tokenized values.

  • Generate Document ID (Optional) - The default is TRUE. Whether output should contain original document id and sentence id in each document.