1 feature x: $f_{w,b}(x) = wx + b$
2 features x1, x2: $f_{w_1,w_2, b}(x_1, x_2) = w_1 x_1 + w_2 x_2 + b$
Generally,
$ \begin{aligned} f_{w, b}(x) &= w_1 x_1 + w_2 x_2 + w_3 x_3 + … + b \ &= w^T x + b \ &= \sum_{i} w_i x_i \ &= |w| |x| \cos{\alpha} \end{aligned} $
with w is vector w1 to wn, x is x1 to xn
with $w = \begin{pmatrix} w_1 \ \dots \ w_n \end{pmatrix}$ and $x = \begin{pmatrix} x_1 \ \dots \ x_n \end{pmatrix}$
Define loss function, then search for model whihc best fits loss function.
$\text{loss}_{x,y}(p) = \frac{1}{n} \sum_j (f_p (xj) - yj)^2$
Defines residuals that show how far from mean (?)
Why square? Make everything positive, but also penalize outliers
$\hat p = \argmin_p \text{loss}_{x,y}(p)$
To escape local minima: add randomness, add multiple models
To converge faster: combine known good models (breeding), inspect the local neighbourhood
Simple, only need to compute loss function, and a few more TODO things
Start with random point p in model space.
loop:
pick random point p' close to p
if loss(p') < loss(p):
p <- p'
You need to define what ‘close to’ means though.
Convexity: the property of having one minimum (i.e. if for any two points, the line between those points is above the function)
The issue with random search is it can get stuck in a local minimum. In many situations, local minima are fine, we don’t always need an algorithm for a guaranteed global minimum.
In discrete model spaces (which have a more graph-like structure), you need to figure out a transition function.
‘Improved’ random search.
pick random point p' close to p
loop:
pick random point p' close to p
..etc TODO the lecturer was going fast as fuck
Can also do these searches in parallel, or even parallel with some communication between searches.
Population methods, eg. evolutionary algorithms:
start with population of k models
loop:
rank population by loss
remove the half with worst loss
"breed" new population of k models
optionally, add a little noise to each child
Coming closer to gradient descent:
pick random point p in model spce
loop:
pick k random points {p_i} close to p
p' <- argmin_p_i loss(p_i)
if
TODO again he switched the fuckin slide
Good, but doesn’t help with global/local minima.
In 2D space, the tangent line is the slope. In higher spaces, the plane/hyperplane is the gradient (analog of slope).
Gradient: $\nabla f(x, y) = (\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y})$
Tangent hyperplane: $g(x) = \nabla f(p)^T x + c$
Gives best approximation at point p.
The direction of steepest ascent:
$ \begin{aligned} g(x) &= w^T x \ &= |w| |x| \cos{\alpha} \ |x| &= 1 \ &\rightarrow ||w|| \cos{\alpha} \end{aligned} $
The angle is maximised when cos(α) is 1, so α is 0. So the gradient is the direction of steepest ascent
pick a random point p in model space
loop:
p <- p - \eta \nabla loss(p)
Usually set η (step size, learning rate) between 0.0001 and 0.1.
Take partial derivatives of loss function, then calculate them.
Cons:
Pros:
If the model is linear, you don’t actually need to search, you could just set partial derivatives equal to zero and solve.
Sometimes the loss function shouldn’t be the same as the evaluation function, because you might not get a smooth function.
Apply the least-squares calculation, you get a smooth function. Then you can do gradient descent.
Learns a feature extractor together with the classifier
Neuron has inputs (dendrites) and one output (axon) The simplified version for computers is the 'perceptron':
Nonlinearity:
sigmoid function $\sigma(x) = \frac{1}{1+e^{-x}}$
ReLU $r(x) = \begin{cases} x &\text{if } x > 0 \ 0 &\text{otherwise} \end{cases}$
Feedforward network: a multilayer perceptron – hidden layer(s) between input and output layers
Every edge has weights, and the network learns by adapting the weights.
It trains both feature extractor and linear model at once.
Binary:
Multiclass:
Stochastic gradient descent:
For complex models, symbolic and numeric methods for computing the gradient are expensive. Use backpropagation:
For feedforward network, you look at derivative of loss function with respect to the weights
Uses a kernel to expand the feature space
Margin: line for which the space to the nearest positive and negative points is as big as possible.
Support vectors: the points that the margin just touches
The support vector machine tries to find this line.
Objective of SVM:
For loss, two options: