# Distance

## Introduction

Calculate distances of each of the pairs of the variables or categories. This can be used to find similarities too. The closer the distance, the more similar they are.

## How to Access?

There are two ways to access. One is to access from 'Add' (Plus) button.

![](https://2850417076-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M4HLCK3olgduYoe3RVS%2F-M4oMvCUDQwHTJ0eWi_f%2F-M4oNF0Pv64WqyUzWgF2%2Fdist_add.png?generation=1586795488463333\&alt=media)

Another way is to access from a column header menu.

![](https://2850417076-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M4HLCK3olgduYoe3RVS%2F-M4oMvCUDQwHTJ0eWi_f%2F-M4oNF0RTvQf3fOraWfQ%2Fdist_cols.png?generation=1586795488523145\&alt=media)

## How to Use?

### Calculate Distances Among Multiple Columns (Variables)

![](https://2850417076-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M4HLCK3olgduYoe3RVS%2F-M4oMvCUDQwHTJ0eWi_f%2F-M4oNF0T_Nak_0Axead2%2Fdist_cols_dialog.png?generation=1586795488538753\&alt=media)

#### Column Selection

There are many ways to select columns. You can choose

* Select Column Names - Listing up columns selecting one by one
* Range of Column Position - Select columns between columns chosen as Start and End
* Starts with - Select columns whose names start with a certain text
* Ends with - Select columns whose names end with a certain text
* Contains - Select columns whose names contain a certain text.
* Matches Regular Expression - Select columns whose names contain a certain text.
* Range of Suffix (X1, X2...) - Select columns names with prefix and numbers.
* Everything - All columns.
* All Numeric Columns - All numeric columns.

### Calculate Distances Among Categories

![](https://2850417076-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M4HLCK3olgduYoe3RVS%2F-M4oMvCUDQwHTJ0eWi_f%2F-M4oNF0VYAf56CEtPOa9%2Fdist_skv_dialog.png?generation=1586795488427396\&alt=media)

#### Column Selection

Category, dimension and measure are like this.

![](https://2850417076-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M4HLCK3olgduYoe3RVS%2F-M4oMvCUDQwHTJ0eWi_f%2F-M4oNDaNPjridRfO5eqE%2Fskv_origin.png?generation=1586795484528788\&alt=media)

Category column is a column that has categories. They are parameterized by measures with the dimensions.

![](https://2850417076-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M4HLCK3olgduYoe3RVS%2F-M4oMvCUDQwHTJ0eWi_f%2F-M4oNDaSq8iW09ffLg4h%2Fskv_expand.png?generation=1586795484462447\&alt=media)

In this case, distances of airline carriers are calculated based on count of flight. Think that each carrier is represented as a vector with of flight count in each week day and distances of them are calculated.

If there are duplicated values or missing values for a cell, they will be aggregated by "Aggregate with" or filled by "Fill with".

### Parameters

* Distance Method (Optional) - Type of how to calculate distance. The default is "Euclidean".
  * Euclidean - The most common measurement of distance. sqrt(sum((x - y)^2))
  * Maximum - Maximum difference between elements of two vectors.
  * Manhattan - Sum of difference in each elements of two vectors.
  * Canberra - Manhattan distance weighted by the elements. sum(|x\_i - y\_i| / |x\_i + y\_i|)
  * Binary - non-zero elements are regarded as 1 and zero elements are regarded as 0. The distance is the number of elements only one is 1 divided by numbers of them at least one is 1.
  * Minkowski - This is generalized case of euclidean and manhattan. If p=2, it becomes euclidean and if p=1, it becomes manhattan. sum((x - y)^p)^(1/p)
* Keep Only Unique Pairs (Optional) - The default is FALSE. Whether the pair of output should be unique. If this is TRUE, a pair appears only once but if it's FALSE, a pair appears twice in swapped order. If you want to filter the pairs by names, it's better to be FALSE.
* Keep Diagonal Pairs (Optional) - The default is FALSE. Whether the output should contain the similarity of documents with itself.
* Output Type (Optional) - Type of output.
  * "Distance values for each pair" - Returns a pair of categories and their distance in each row.
  * "Position values in X/Y coordinates" - Returns coordinates for plot after dimension reduction by [MDS](https://github.com/exploratory-io/book/blob/master/mds.html).

Take a look at the [reference document](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/dist.html) for the 'dist' function from base R for more details on the parameters.
