Intelligent Systems

Table of Contents

Machine Learning

design of a learning element is affected by:

offline learning: learn based on some data, then apply results to situation

feedback types (get input, make decision, learn based on feedback):

image

Learning problems

Classification

Regression

Ranking

Collaborative filtering

Clustering

Methodology

Data

labeled instances (like spam, not spam)

randomly allocate to data sets.

features: attribute-value pairs characterizing each x

Experimentation

experimentation cycle:

  1. select hypothesis, tune hyperparameters on held-out/validation set
  2. compute accuracy of test set (never “peek” at test set itself)

Evaluation

accuracy – fraction of instances predicted correctly

Cross validation:

Cross validation diagram

create a confusion matrix (TODO: there’s a diagram for this but not in the old slides)

Machine Learning Steps:

  1. choose the features
  2. choose the model class
  3. choose a search method

Choose the features

create a feature space: features on x-y axes, points are individual data, classification would be a color scheme.

Feature space graph

Inductive learning method

Occam’s razor: “one should not increase, beyond what is necessary, the number of entities required to explain anything” basically, choose the simplest option.

Classifying with naive Bayes

Binary classification

calculation for Bayesian classifier: $P(C|F_1,…,F_n)$

using Bayes’ theorem:

$P(C|F_1,…F_n)=\frac{P(C)P(F_1,…,F_n|C)}{P(F_1,…,F_n)}$

rewrite the numerator of the equation:

$P(C)P(F_1,…,F_n |C) = P(C)P(F_1 | C)P(F_2 | C, F_1)P(F_3|C, F_1, F_2)P(F_4,…,F_n | C, F_1, F_2, F_3)$

that uses the chaining rule. but it’s too computationally expensive. so naively assume conditional independence:

$P(F_i | C, F_j) = P(F_i | C)$

This simplifies the formula to:

$P(C)P(F_1,…,F_n | C) = P(C) PI(0 to n) P(F_i | C)$

image

Laplace smoothing helps with really small probabilities. Naive Bayes often works. sometimes performs well even if the assumptions are badly violated. classification is about predicting correct class label, not about accurately estimating probabilities

Clustering with K-nearest neighbor

k nearest neighbor classification: find k most similar case, copy majority label

e.g. to classify unlabeled document, find most similar document and copy label:

K-nearest neighbor example

the k helps get rid of noise which would lead to misclassification

Linear classifier

linear classifier: come up with a line that divides feature space, use that for prediction.

works for stuff like $x \lor y$, but not if we want $x \oplus y$ or other things that are not linearly separable.

you can build a design matrix of all the different features you want to include. here’s an example with 5 different features *age, height, age², age × height, height²) that classifies a person as male/female:

Design matrix

if you go to more dimensions (so more features in a design matrix), you need hyperplanes.

hyperplanes sound fancy af but it’s just a line in some higher dimension. for example, this is how you would use a hyperplane in the third dimension to separate points:

3D hyperplane example

Support vector machine

k(x,y): “distance” function between instances x and y

SVMs create a linear separating hyperplane

they can embed that hyperplane into a higher dimensional domain using the kernel trick – so long as k is a kernel, you don’t have to explicitly compute the design matrix (that drastically reduces the computation)

try to achieve a good margin – a large separation from the line for both classes (so reduce the blur between two classes, make a clear distinction).

watch out for over-fitting – you might end up with the model being trained extremely specifically to your data.

so we have a lot of ways to put a line on a graph. but how do you choose the right one?

Regression

train a learner to produce a model (the model is the line/hyperplane). then you give the model a new instance, and it produces a new value based on a function.

assign a value for each of the points in the feature space.

evaluating regression – what’s a good model for a regression?

you use an error function (the difference between predicted and real values, square to avoid negatives):

$error(p) = \sum_i (f_p (x_i) - y_i)^2$

Error graph

in this example, each of the possible lines is represented by two parameters (s the slope, b the y-intercept), in the left graph. those parameters can be plotted on 2D axes, on the right graph.

Feature and model space

Then you can take those points in the right graph, and plot their respective error rates (from the error function above) on the z axis, so you end up with a 3D graph – an error surface:

Error surface

now the point is to find the one with the lowest error (the lowest point in the surface, colored green). and that’s what calculus is for, specifically differentiation.

if you’ve never done calculus, it’s not easy to explain, but basically taking the derivative means you’re looking for the slope of a certain function at some point in that function. if you set the derivative to zero, you’re looking for the point where the slope is zero – specifically the minimum or maximum of a function.

quick example: if you have a function $y = x^2 + 2$, there’s a minimum at x = 0. you may know that by intuition, but you can also take the first derivative ($y’ = 2x$), set $y’$ equal to zero, and solve for x – you’ll get the same result. it’s trivial with a simple function, but once you get into cubic and quartic functions, you can’t really do it by intuition..

so a derivative $f’(x)$ of a function $f(x)$ gives the slope of $f(x)$ at any point in the domain (where $f(x)$ has a value $y$).

the rules for taking derivatives (don’t need to memorize these, they will be provided on the exam):

Derivative rules

the problems:

Gradient descent

you use a variant of the hill climbing algorithm

the steps:

  1. pick random parameters s (slope), b (intercept)
  2. repeat: 1. compute the gradient 2. move a small step in the opposite direction

but there may be multiple zero points – local optima. there’s no guarantee of convergence.

Neural Networks

Overview

the difference between original machine learning and deep learning:

image

the original perceptron:

but chaining doesn’t make it more interesting, the function collapses to a linear one

image

So introduce an activation function instead of using a standard linear calculation. This puts the values in a certain range, and now the diagram looks like this:

image

Then you can build a feed-forward network with hidden layers between input and output:

image

Training neural networks

Loss function determines how close a given network is to being correct for an input. If you plot neural networks on x-y, and the amount of loss on z axis, you end up with a loss surface. Then you want to find a point in that surface where the loss is the lowest.

image image

You can find the low point in the error surface with gradient descent (differentiation).

Stochastic gradient descent:

  1. pick random weights w
  2. loop: * $w = w - r \triangledown loss (w)$

where r is the learning rate. make sure to pick a good learning rate because if it’s too high (like 0.1), it will be inaccurate, while if it’s too low, it will take a long time.

so in summary:

how do you calculate the gradient for complex data?

backpropagation: if the system is a composition of modules, the gradient is the product of the gradient of each module with respect to its arguments

image

for a given input x, you do a forward pass (calculating the value for each part), then fill in into final calculation.

so like this, with x = 0:

image

neural networks are just a way of thinking about differentiable computation.

Autoencoders: a NN architecture

bottleneck architecture:

image

input should be as close as possible to the output. but it must pass through a small representation

autoencoders map data to a latent representation

Trying it out

The promise of depth

if you have many dimensions, the gradient isn’t as easy to detect anymore. there’s a huge number of parameters.

solutions: