Work on numerical and categorical features
Standard decision tree learning algorithm (ID3, C45):
The algorithm in steps:
With categoric features, it doesn’t make sense to split on the same feature twice.
For numeric features, split at a numeric threshold t.
Of course there’s a trade-off, complicated decision trees lead to overfitting.
Pruning - for every split, ask whether the tree classifies better with or without the split (on validation data)
Using validation data: test is only for final testing, validation for hyperparameter selection. If you want to control search, split training data and use a part of it for ‘validation’.
Label the leaves with the one element, or take the mean.
Instead of entropy, use $I_{S}(V) = Var(S) - \sum_{i} \frac{| S_i |}{|S|} Var(S_i)$
Bias and variance:
Real-life example:
Bootstrapping (from methodology 2):
Ensembling is:
Bagging reduces variance, boosting reduces bias.
Bagging: bootstrap aggregating
Random forest: bagging with decision trees
train some classifier m0, then iteratively train more classifiers. increase weights for instances misclassified by a classifier. train the next iteration on reweighted data.
weighted loss function: $loss(\theta) = \sum_{i} w_{i} (f_{\theta}(x_i) - t_i)^2$
or resample data by the weights. the weight determines how likely an instance is to end up in the resampled data.
boosting works even if the model only classifies slightly better than random.
TODO: think of a better way to explain this for the exam
each model fits a reweighted dataset, each model defines its own reweighted loss.
boosting for regression models
fit the next model to the residuals of the current ensemble
each model fits (pseudo) residuals of previous model, ensemble optimises a global loss – even if individual models don’t optimise a well-defined loss.
When you want to combine some number of existing models into a single model. Simply compute outputs of the models for every point in the dataset, and add them to the dataset as a column. Then, train a new model (‘combiner’) on this extended data (usually a logistic regression). If NNs are used for the ensemble, the whole thing turns into a big NN.