Machine Learning

Table of Contents

Tree models & ensembles

Tree models

Decision trees (categorical)

Work on numerical and categorical features

Standard decision tree learning algorithm (ID3, C45):

The algorithm in steps:

  1. start with single unlabeled leaf
  2. loop until no unlabeled leaves:
    • for each unlabeled leaf l with segment s:
      • if stop condition, label majority class of S
        • stop when: all inputs are same (no more features left), or all outputs are same (all instances same class)
      • else split L on feature F with highest gain Is(F)

With categoric features, it doesn’t make sense to split on the same feature twice.

Regression trees (numeric)

For numeric features, split at a numeric threshold t.

Of course there’s a trade-off, complicated decision trees lead to overfitting.

Pruning - for every split, ask whether the tree classifies better with or without the split (on validation data)

Using validation data: test is only for final testing, validation for hyperparameter selection. If you want to control search, split training data and use a part of it for ‘validation’.

Label the leaves with the one element, or take the mean.

Instead of entropy, use $I_{S}(V) = Var(S) - \sum_{i} \frac{| S_i |}{|S|} Var(S_i)$

Generalization hierarchy

154f1d15c7808fb3db6bf33d60c61bbb.png

Ensembling methods

Bias and variance:

ba23c24af56d5203ba58cdc8b1b3f8d8.png

Real-life example:

Bootstrapping (from methodology 2):

Ensembling is:

Bagging reduces variance, boosting reduces bias.

Bagging

Bagging: bootstrap aggregating

Random forest: bagging with decision trees

Boosting

train some classifier m0, then iteratively train more classifiers. increase weights for instances misclassified by a classifier. train the next iteration on reweighted data.

weighted loss function: $loss(\theta) = \sum_{i} w_{i} (f_{\theta}(x_i) - t_i)^2$

or resample data by the weights. the weight determines how likely an instance is to end up in the resampled data.

boosting works even if the model only classifies slightly better than random.

AdaBoost (binary classifiers)

TODO: think of a better way to explain this for the exam

each model fits a reweighted dataset, each model defines its own reweighted loss.

Gradient boosting

boosting for regression models

fit the next model to the residuals of the current ensemble

each model fits (pseudo) residuals of previous model, ensemble optimises a global loss – even if individual models don’t optimise a well-defined loss.

Stacking

When you want to combine some number of existing models into a single model. Simply compute outputs of the models for every point in the dataset, and add them to the dataset as a column. Then, train a new model (‘combiner’) on this extended data (usually a logistic regression). If NNs are used for the ensemble, the whole thing turns into a big NN.