Machine Learning

Table of Contents

Models for sequential data

Sequences

They consists of numbers or symbols: - numeric 1 dimensional, e.g. stock price over time. can be n-dimensional - symbolic (categorical) 1-dimensional, like english sequence of words/characters. can be n-dimensional, with multiple categorical features per timestamp (like sheet music)

we could have one sequence per instance, and try to classify the sequences (like email spam/not spam) or the whole dataset is a sequence, and instances are ordered.

single sequence feature extraction:

major key: think about the real-world use case. e.g. if we want to predict future values, the training data shouldn’t contain things that happen later than test data.

you can do walk-forward validation, if target labels have meaningful ordering in time:

519e53c90f0dae96942ac72ed59aacc0.png

When modelling probability, break the sequence into its tokens, like words in a sentence. Each token is modeled as a random variable (not independent). So you end up with joint distribution P(W₄, W₃, W₂, W₁) (with some arbitrary number of parameters.

Can apply chain rule of probability:

$\begin{aligned} P(W_4, W_3, W_2, W_1) &= P(W_4, W_3, W_2 | W_1) P(W_1) \ &= P(W_4, W_3 | W_2, W_1) P(W_2 | W_1) P(W_1) \ &= P(W_4 | W_3, W_2, W_1) P(W_3 | W_2, W_1) P(W_2 | W_1) P(W_1) \end{aligned}$

i.e.: can rewrite probability of sentences as product of probability of each word, with condition on its history. with log probability, you get a sum: $\log{P(\text{sentence})} = \sum_{word} \log{P(\text{word} | \text{words before it})}$

Markov models

Markov assumption: limit the amount of memory for previous tokens. e.g. retain a max of 2 words. The “order” is the number of words retained in the conditional.

For example, if the conditional is $P(x | \text{i, will, graduate, in, a, decade})$ and it’s a third-order model, the Markov assumption is $P(x | \text{i, will, graduate})$.

With Markov assumption and chain rule, can model sequence as limited-memory conditional probabilities. These can be estimated from a corpus (huge piece of text).

For example, to estimate prob of the word ‘prize’ given “won a”, count how often “won a prize” occurs in text as proportion of total occurrences of “won a”:

$P(\text{prize} | \text{a, won}) \approx \frac{\text{# won a prize}}{\text{# won a}}$

The word snippets are “n-grams”. Three words is a trigram, two words is a bigram. I guess one word is just a gram. And maybe 1000 words would be a kilogram.

Sequential sampling: start with small seed of words, then sample next word according to its probability given the previous words.

Embedding models

Model object x by embedding vector ex. The similarities of these vectors represent similarities between words. Creates embedding vectors for words, where distances and directions reflect semantic meaning.

Distributional hypothesis: words that occur in same context often have similar meanings.

1-hot vector: represent words as atomic objects in a monolithic vector

Word2Vec:

Recurrent neural networks

Neural network with cycles in it (used for sequences).

Can be used for:

Example, fully connected network with input x extended by three nodes, to which the hidden layer is copied:

128d6b585d9c1bdaf5a568bc022d8763.png

Visual shorthand:

d32563c1b461ded32d46894a7de2198d.png afb898aeecf247a602f622999280ed61.png

Training RNNs:

Basic RNNs work well, but don’t learn to remember information for a long time. Can’t have a long term mem for everything, need to be selective. In order to remember things long term, you need to forget a lot of other stuff (such is life).

LSTMs

“Long short-term memory”. Selective forgetting and remembering, controlled by learnable “gates”. Side note, from now on I’m not “studying”, but I’m “selectively forgetting and remembering”.

*[tanh]: sigmoid rescaled so its outputs are between -1 and 1

The gating mechanism takes two input vectors, which are combined with sigmoid and tanh activations. It produces an additive value – want to figure out how much of input to add to some other vectors. The tanh is like a mapping of input to range(-1, 1) – limits the effect of the addition vector. The sigmoid is like a selection vector.

c60c95b58a2975c98df7c548b0585b9b.png

Basic operation of LSTM is a “cell”. There are two recurrent connections between cells: the current output y, and the cell state C.

I don’t yet know how much detail we need to know about this, so I’ll fill it in later based on exam questions.

The prof’s summary: “incredibly powerful language models. Tricky to train, very opaque.” Yep, opaque and complicated, indeed.