Never judge your performance on the training data (or you’ll fail the course and life).
The proportion of training to test data is not important, the absolute size of the test data is. Aim to have min 500 examples in test data (ideal 10 000 or more).
e.g. k-nearest neighbours, which classifies a point based on the classification of its k nearest neighbours.
Don’t re-use test data:
For temporal data, you’ll probably want to keep the data ordered by time.
Which hyperparameters to try?
You still split your data, but every run, a different slice becomes the validation data. Then you average the results for the final result.
If it’s temporal data, you might want to do walk-forward validation, where you always expand your data slices forward in time.
It depends, just like in every class so far:
Metrics for a single classifier.
The margins give four totals: actual number of each class present in data, number of each class predicted by the classifier.
Also for a single classifier.
You can then calculate rates:
ROC (receiver-operating characteristics) space: plot true positives against false positives. the best classifier is in the top left corner.
Ranking classifier: also gives score of how negative/positive a point is.
Coverage matrix: shows what happens to TPR and FPR if we move threshold from right to left (more or less identical to ROC space)
If we draw line between two classifiers, we can create classifier for every point on that line by picking output of one of the classifiers at random. E.g. with 50/50 probability, end up halfway between the two. The area under the curve of classifiers we can create (“convex hull”) is good indication of quality of classifier – the bigger this area, the more useful classifiers we can achieve. Good way to compare classifiers with class or cost imbalance, if we’re unsure of our preferences.
Loss function: mean squared errors ($\frac{1}{n} \sum_i (\hat{y_i} - y_i)^2$)
Evaluation function: root mean squared error ($\sqrt{\frac{1}{n} \sum_i (\hat{y_i} - y_i)^2}$)
Bias: distance from true MSE (which is unknown) to the optimum MSE.
Variance: spread of different experiments’ MSE around the true MSE
specifically for k-NN regression: increasing k increases bias and decreases variance
Dartboard example:
Statistics tries to answer: can observed results be attributed to real characteristics of the models, or are they observed by chance?
If you see error bars, the author has to indicate what they mean – there’s no convention.
Standard deviation: measure of spread, variance
Standard error, confidence interval: measure of confidence
If the population distribution is normal, the standard error of the mean is calculated by $\frac{\sigma}{\sqrt{n}}$(because the sample distribution is the t distribution)
Re confidence intervals: the correct phrasing is “if we repeat the experiment many times, computing the confidence interval each time, the true mean would be inside the interval in 95% of those experiments”
Use statistics in ML to show confidence and spread.
Answer to question “what is the best ML method/model in general?”
Theorem: “any two optimization algorithms are equivalent when their performance is averaged across all possible problems”
i.e. you can’t say shit in general.
A few outs:
Principle: there is no single best learning method; whether an algorithm is good depends on the domain
Inductive bias: the aspects of a learning algorithm, which implicitly or explicitly make it suitable for certain problems make it unsuitable for others
Simplest way - remove features for which values missing. Maybe they’re not important, probably, hopefully.
Or remove instances (rows) with missing data. The problem is if data wasn’t corrupted uniformly, removing rows with missing values changes the data distribution. An example is if people refuse to answer questions.
Generally, think about the real-world use case – can you also expect missing data there?
Guessing the missing data (“imputation”):
Are they mistakes?:
Can we expect them in production?
Watch out for MSE, it’s based on assumption of normally distributed randomness. If you get data with big outliers, it fucks up.
Even if data is a table, you shouldn’t just use columns as features. Some algorithms work only on numeric features, some only on categorical, some on both.
Converting between categoric/numeric:
Expanding features: adding extra features derived from existing features (improves performance). For example, when you have results that don’t fit on a line, but do fit on a curve, you can add a derived feature x². If we don’t have any intuition for extra features to add, just add all cross products, or use functions like sin/log.
Create a uniform scale.
Fit to [0,1]. Scales the data linearly, smallest point becomes zero, largest point becomes 1: $\chi \leftarrow \frac{\chi - \chi_{min}}{\chi_{max} - \chi_{\min}}$
Fit to 1D standard normal distribution. Rescale data so mean becomes zero, standard deviation becomes 1. Make it look like the data came from a standard normal distribution. $\chi \leftarrow \frac{\chi - \mu}{\sigma}$
Fit to multivariate standard normal distribution. If the data is correlated, you don’t end up with a spherical shape after normalising/standardising. So you have to choose a different basis (coordinate system) for the points.
Back to linear algebra - choose a basis $$B = \begin{bmatrix} c & d \end{bmatrix} = \begin{bmatrix} 1.26 & -0.3 \ 0.9 & 0.5 \end{bmatrix}$$
Then if you want to convert a coordinate to this basis, multiply $Bx$. If you want to convert from this basis to the standard, multiply $B^{-1} x$.
Since the inverse of a matrix is computationally expensive, prefer orthonormal bases (the basis vectors are
Steps:
So whitening means we choose new basis vectors for a coordinate system where the features are not correlated, and variance is 1 in every direction.
Opposite of feature expansion - reducing number of features in data by deriving new features from old ones, hopefully without losing essential information.
Good for efficiency, reducing variance of model performance, and visualisation.
Principal component analysis (PCA): whitening with some extra properties. Afte applying, you throw away all but first k dimensions, and get very good projection of data down to k dimensions.