Clustering

Learning setup

Per instance (e.g. instances for each person), or per person (groups of people).

Distance metrics:

euclidian distance: length of line between two points (hypotenuse of a triangle)
manhattan distance: summing distances in all dimensions (the two other sides of the triangle, added together)
minkowski distance is a generalized form
important to consider scaling of data
assuming numeric values; for others you can use Gower’s similarity:
- dichotomous (present or not): 1 if both present, 0 otherwise
- categorical: 1 if both are the same, 0 otherwise
- numerical: some calculation for scaling. don’t know if I actually need to know this.

Distance metrics for datasets:

non-temporal
- summarize values per attribute over entire dataset into single value (mean, min, max…)
- estimate parameters of distribution per attributes, and compare those (e.g. normal distribution)
- compare distributions of values for attribute i with a statistical test (e.g. Kolmogorov Smirnov, take 1-p value as distance metric)
temporal
- raw-data based
  - simplest case: assume equal number of points, compute euclidian distance on per point basis
  - if time series are shifted in time, use a lag to compensate (across all attributes), use cross correlation coefficient to choose best value for shift (higher coefficient is better)
  - for frequencies, e.g. people might walk at different frequencies, so use dynamic time warping
    - pairing: time order should be preserved, first and last points should be matched
- feature-based
- model-based
  - fit time series model, use those params

Options:

k-means: take k number of clusters, start with k random points, and cluster closest points (center of cluster is the average)
- performance metric: silhouette
k-medoids: use actual points instead of artifical means
hierarchical:
- divisive: start with one big cluster, split in each step
  - find dissimilarity of a point, move the most dissimilar out of the cluster
  - select cluster with largest diameter
- agglomerative: start with one cluster per point, merge
  - merge using single linkage (by minimum distance between two clusters), complete linkage (by maximum distance), or group average, or Ward’s criterion (minimize standard deviation)
subspace: look at subspace in feature space (otherwise with huge number of features, everything might even out)