Feature engineering

Creating useful features in different domains.

Time domain

Numerical

want to summarize values of numerical attribute i in a time window
assume temporal ordering $x_{1}^{i}, \dots, x_{N}^{i}$
select window size λ
for each time point t, select proper values $[x_{t-\lambda}^{i}, \dots, x_{t}^{i}]$
compute new value of feature, per time point, over each of those values

Categorical

generate patterns combining categorical values over time
consider a window size λ
consider different relationships: succession “(b)”, co-occurrence “(c)”
support: what fraction of all time points does the pattern occur

Mixed

Make categories from numerical values & apply categorical approach:

ranges (low, normal, high)
temporal relations (increasing, decreasing)

Pattern generation

only focus on patterns with sufficient support
start with patterns of single attribute value pairs with sufficient support

Frequency domain

Consider series of values within a certain window of size λ. Perform Fourier transformation to see what frequencies we see in the window – create sinusoid functions with different periods, with a base frequency (lowest frequency with complete sinusoid period).

Get feature values: highest amplitude frequency, normalize.

Unstructured data - text

Perform number of steps:

tokenization: identify sentences and words
lower case everything
stemming: identify stem of each word, map all variations of word to the stem
stop word removal: get rid of words like ‘the’ that are not predictive

Approaches:

bag of words: count occurrences of n-grams (n consecutive words)
TF-IDF: frequency of words giving more weight to unique words
topic modeling: assume the text has some topics associated with words, look at topics instead of words

Machine Learning for the Quantified Self

Table of Contents