Exploratory
  • Introduction
  • Product Features
    • Summary View
    • Table View
    • Row Filter
    • Column Filter
    • Dashboard
    • Dashboard (日本語)
    • Note
    • Note (日本語)
    • Steps (Right-hand side)
    • Branch
    • Parameter
    • Parameter (日本語)
    • Export
    • Share
      • Share Type
      • Chart / Analytics
      • Data
      • Report (Note / Dashboard)
      • Notification
      • Version History
      • Restore Older Version
      • CSV API
    • Share (日本語)
      • 共有のタイプ
      • チャート / アナリティクス
      • データ
      • レポート (ノート / ダッシュボード)
      • 通知
      • バージョンの履歴
      • 古いバージョンの復元
      • CSV API
    • Schedule
      • Manage Schedules
      • Notification
      • Scheduling History
    • Schedule (日本語)
      • スケジュールの設定
      • 通知
      • スケジュールの履歴
    • Team
      • Manage Teams
    • Team (日本語)
      • チームの設定
    • Project
      • Import
      • Export
      • Search
  • Data Import
    • File Data
      • CSV / Delimited File
      • Amazon S3
      • Google Drive
      • Google Cloud Storage
      • Excel
      • JSON
      • Log File
      • Microsoft Azure
      • Stats - SAS / SPSS / STATA
      • RData / RDS
      • Parquet File
      • EDF - Exploratory
    • Database Data
      • SQL Troubleshooting
      • Create Connection
      • Amazon Athena
      • Amazon Aurora
      • Amazon Redshift
      • Amazon Redshift (日本語)
      • Google BigQuery
      • HP Vertica
      • MariaDB / MySQL DB
      • MariaDB / MySQL DB (日本語)
      • Microsoft Access
      • MongoDB
      • ODBC
      • Oracle
      • PostgreSQL
      • PostgreSQL (日本語)
      • Presto
      • Snowflake
      • SQLServer (DSN)
      • SQLServer
      • Teradata
      • Treasure Data
    • Cloud Apps Data
      • Create Connection
      • FRED - Federal Reserve of Economic Data
      • Github Issues
      • Google Analytics
      • Google Analytics (日本語)
      • Google Spreadsheet
      • Google Cloud Storage
      • Salesforce
      • Twitter Search
      • Stripe
      • Weather Data
      • Stock Price Data
    • Write R Script as Data
      • Currency Exchange Rate
    • Write R Script as Data (日本語)
    • Web Page Scraping
    • Text Input Data
    • Data Source Extension
      • Quandl
      • Holiday
      • RSS Data
    • Create Custom Data Source
  • Data Wrangling
    • Command Line mode for faster and more flexible data interaction in Exploratory
    • Select / Remove Columns
    • Reorder Columns
    • Create New Calculation
    • Create New Calculation for Multiple Columns
    • Summarize (Aggregate)
    • Group
    • Filter
    • Rename
    • Arrange (Sort)
    • Top / Bottom N
    • Join
    • Merge
    • Gather
    • Spread
    • Pivot
    • Expand
    • Complete
    • Separate
    • Unite
    • Bind Rows
    • Bind Columns
    • Keep Only Unique Rows
    • Keep Only Duplicated Rows
    • Slice
    • Drop NA
    • Sample
    • Impute NA
    • Fill
    • Create Buckets
    • Assign New Values to Existing Values - Recode
    • Assign New Values by Setting Conditions - Case When
    • Work with Categories
    • Data Type Conversion
    • Row as header
    • Ungroup
    • Unnest
    • Separate List Items into Columns (Unnest Wider)
    • Separate List Items into Rows (Unnest Longer)
    • Separate Address (Japan)
    • Hoist
    • Remove Empty Rows
    • Remove Empty Columns
    • Clean Column Names
    • Window Calculation
    • Window Calculation (日本語)
    • Add Row
    • Text Wrangling
    • Regular Expression Cheat Sheet
    • Regular Expression Cheat Sheet (日本語)
  • Visualization
    • Types
      • Pivot
      • Summarize Table
      • Table
      • Bar
      • Line
      • Area
      • Pie/Ring
      • Radar
      • Histogram
      • Density Plot
      • Scatter (No Aggregation)
      • Scatter (With Aggregation)
      • Boxplot
      • Violin
      • Error Bar
      • Error Bar (Summarized Data)
      • Map - Standard
      • Map - Extension
      • Map - Long/Lat
      • Map - Heatmap
      • Heatmap
      • Contour
      • Number
      • Word Cloud
      • Word Cloud (日本語)
    • Features
      • Trend Line
      • Reference Line
      • Repeat By
      • Window Calculation
      • Date/Time Aggregation
      • Show Range
      • Highlight
      • Change Marker
      • Multiple Y-Axis Columns
      • Layout Configuration
      • Column Configuration
      • Column Configuration Dialog
      • Color and Group Setting
      • Color and Group Setting (日本語)
      • Color Setting
      • User Color Palette Setting
      • Pin
      • Save as PNG/SVG
      • Save as Exploratory Data File
      • Share/Schedule
      • URL Link
      • Category (Binning)
      • Highlight
      • Limit Values
      • 'Others' Group
      • Edit Display Name
      • Missing Value Handling
      • Rename Column Names
      • Axis Setting
      • Axis Formatting
      • Show Detail
      • Fit to Screen (Table)
      • Number of Unique Values Check
      • Number of Unique Values Check (日本語)
  • Analytics
    • Correlation
    • Distance
    • K-Means Clustering
    • Principal Component Analysis
    • Factor Analysis
    • Correspondence Analysis
    • Linear Regression Analysis
    • Logistic Regression Analysis
    • Generalized Linear Models
    • Survival Curve
    • Cox Regression
    • Random Survival Forest
    • Decision Tree
    • Random Forest
    • XGBoost
    • Time Series Forecasting (Prophet)
    • Time Series Forecasting (ARIMA)
    • Time Series Clustering
    • Anomaly Detection
    • Word Count
    • Text Clustering with Topic Model (LDA)
    • Market Basket Analysis
    • T Test
    • T Test (Aggregated Data)
    • ANOVA
    • Wilcoxon Test
    • Kruskal-Wallis Test
    • Chi-Square Test
    • A/B Test
    • Normality Test
    • Prediction
    • Dictionaries for Text Analysis
  • Statistics
    • Correlation
    • Distance
    • Cosine Similarity
    • SVD
    • Multi Dimensional Scaling
    • T-test
    • F-test
    • Chi-square test
    • A/B Test (Bayesian)
  • Machine Learning
    • Linear Regression
    • Logistic Regression
    • GLM
    • Multinomial Logistic Regression
    • K-means Clustering
    • Random Forest
    • XGBoost
    • Forecasting
    • Time Series Clustering
    • Anomaly Detection
    • Survival Curve
    • Survival Model (Cox Regression)
    • Market Basket
    • Causal Impact
    • Evaluate Prediction - Regression
    • Evaluate Prediction - Binary
    • Calculate ROC
    • Evaluate Prediction - Multiclass
    • Prediction
    • Prediction - Binary Classification
    • Prediction - Survival Model
    • Simulate Survival Curve
    • Extract Summary of Fit
    • Extract Parameter Estimates
    • Run ANOVA Test
    • Fix Imbalanced Data (SMOTE)
  • Text Analysis
    • Tokenize Text
    • Create N-gram Tokens
    • Calculate tf-idf
    • Count Text Pairs
  • Extend with R
    • R Package Install
    • Custom R Script
    • Custom Model Function
  • Setup
    • Disable McAfee virus scan
    • Change Repository Location
    • Change Repository Location (日本語)
    • Holidays Data for Forecast
    • Possible Reasons for Install Error
    • Upgrade Microsoft .NET Framework
  • Diagnostics
    • Log file for debugging
    • Log file for debugging (日本語)
    • Startup Log file for debugging
    • Startup Log file for debugging (日本語)
    • Check version of Exploratory Desktop
    • How to Recover the History Data
  • Keyboard shortcuts
Powered by GitBook
On this page
  • Input Data
  • Properties
  • How to Use This Feature
  • "Summary" View
  • "Top Words" View
  • "Top Words by Category" View
  • "Documents" View
  • "Category (Ratio)" View
  • "Category (Path)" View
  • "Data (Full)" View
  • "Data (Words)" View

Was this helpful?

  1. Analytics

Text Clustering with Topic Model (LDA)

Clusters documents based on the topics they are talking about with topic model. The topic model algorithm is LDA (Latent Dirichlet allocation).

It is a fuzzy clustering, in that the topic model gives ratios of topics for each document, rather than labeling a document with a single topic.

Input Data

There are 2 types of this Analytics View. "Topic Model (Text Data)" can handle raw text data before tokenizing. "Topic Model (Tokenized Data)" can handle tokenized text data.

Input data for "Topic Model (Text Data)" should contain the following columns.

  • Text Column - The column of the text data. Each row is treated as a document.

  • Category (Optional) - Category to compare with the clustering result.

Input data for "Topic Model (Tokenized Data)" should contain the following columns.

  • Words - The tokenized words.

  • Group By - The ID of the document the word belongs to.

  • Category (Optional) - Category to compare with the clustering result.

Properties

  • Sample Data Size - Number of rows to sample before running the analysis.

  • Topic Model (LDA)

    • Number of Topics

    • Random Seed

  • Text Tokenization

    • Remove Stopwords - Default is Yes.

    • Language for Stopwords - By default it is automatically selected based on the content of the text.

    • Additional Stopword Dictionary - Dictionary of the words to be added to the default set of stopwords.

    • Additional Stopwords - Words to be added to the default set of stopwords and the words in the stopword dictionary.

    • Exclude from Stopwords - Words to be excluded from the default set of stopwords.

    • Compound Word Dictionary - Dictionary of the compound words. If a word or phrase that should be treated as one token is separated into multiple tokens, it can be fixed by creating a dictionary and adding the word/phrase to it.

    • Additional Compound Words - Compound words to be added to the words in the compound word dictionary.

    • Remove Punctuations

    • Remove Numbers

    • Remove Alphabets

    • Remove URLs

    • Clean Up Twitter Data - Whether to remove hashtag (starts with #) and mention (starts with @). The default is No.

    • Remove Hiragana Only Words - You can treat often meaningless short Japanese Hiragana words as stopwords altogether by selecting an option here.

  • Top Words - This section is about "Top Words" View.

    • Number of Words to Show - Number of top words to show in the bar charts for each topic.

  • Documents - This section is about "Documents" View.

    • Number of Documents to Show - Number of documents to show for each topic.

    • Baseline Probability for Word Highlight - Word in the documents are highlighted if the word's occurrence probability from the topic in the document is higher than this threshold.

  • Data (Words)

    • Number of Words to Show - Number of words to show in the word-topic matrix in the "Data (Words)" View.

How to Use This Feature

  1. Under Analytics view, select "Word Count" for Analytics Type.

  2. Select a column for Text Column.

  3. Click the Run button to run the analytics.

  4. Select each view type (explained below) see the detail of the analysis.

"Summary" View

"Summary" View shows the number of documents for each topic. Since the topic model gives a ratio of topics for each document, rather than labeling with a single topic, the numbers here are the count of documents by the topic with the highest ratio for the document.

"Top Words" View

"Top Words" View shows a bar chart for each topic that shows the words with the highest probability of occurrence.

"Top Words by Category" View

"Top Words by Category" View appears only when the optional Category column is specified. It shows percent stacked bar charts that show the ratio of the document's category for the occurrences of each topic's top words.

"Documents" View

"Documents" View appears only with the "Topic Model (Text Data)" Analytics View. It shows typical documents for each topic along with the topic ratios of each document. The documents shown here are the ones that have the highest ratio of the topic among all the documents.

"Category (Ratio)" View

"Category (Ratio)" View appears only when the optional Category column is specified. It shows a percent stacked bar chart that shows the ratio of the document's category for each topic.

"Category (Path)" View

"Category (Path)" View appears only when the optional Category column is specified. It shows a parallel categories diagram that shows the ratio of the document's category for each topic.

"Data (Full)" View

"Data (Full)" View shows the data frame with each row representing a document, with additional columns for ratios of topics for each document.

"Data (Words)" View

"Data (Words)" View shows the word-topic matrix. It shows the occurrence probability of each word in each topic. The words are sorted in a way that the top words of each topic shows up close to each other.

PreviousWord CountNextMarket Basket Analysis

Last updated 2 years ago

Was this helpful?