What even is probability?
Definitions:
Identities:
Maximum likelihood estimation: $\hat{\theta} = \argmax_{\theta} P(X | \theta)$
Fitting a normal distribution:
$$\begin{aligned} \hat{\mu}, \hat{\sigma} &= \argmax_{\mu, \sigma} P(X1, X2, … | \mu, \sigma) \ &= \argmax_{\mu, \sigma} \prod_i N(X^i | \mu, \sigma) \ \end{aligned}$$
Probabilistic classifiers return a probability over all classes, given features.
This is a generative classifier – learn P(X|Y) and P(Y), apply Bayes rule.
Choose class y that maximises P(y|x) – the probability of class given data. Then expand using Bayes’ rule. Denominator doesn’t affect which class gets highest probability, so just fit models to P(x|y) and P(y) to maximise quantity P(x|y)P(y).
$$\begin{aligned} c(x) &= \argmax_{y \in {pos,neg}}P(y|x) \ &= \argmax_{y \in {pos,neg}}\frac{P(x|y) P(y)}{P(x)} \ &= \argmax_{y \in {pos,neg}}P(x|y)P(y) \end{aligned}$$
Bayes classifier:
Naive Bayes:
Laplace smoothing:
A discriminative classifier: learn function for P(Y|X) directly.
The logistic sigmoid: $\sigma(t) = \frac{1}{1+e^{-t}} = \frac{et}{1+et}$
Classifier: compute linear function, apply logistic sigmoid to result $c(x) = \sigma(w \cdot x + b)$
Loss function: log loss ($-\log{P(class |features)}$)
$$\begin{aligned} \argmax_q \prod_{C,x}q_x(C) &= \argmax_{q}\log{\prod_{C,x}q_x(C)} \ &= \argmin_{q}-\log{\prod_{C,x} q_x (C)} \ &= \argmin_q \sum_{C,x} - \log{q_x (C)} \ &= \argmin_q - \sum_{x \in X_p} \log{q_x (P)} - \sum_{x \in X_N} \log{q_x (N)} \end{aligned}$$
where:
Problem: if the classes are well separable linearly, there are many suitable classifiers, and logistic regression has no reason to prefer one over the other.
The relation between encoding information and probability theory.
Prefix-free trees assign prefix free code to set of outcomes. Benefit is that no delimiters necessary in bit/codeword string.
Arithmetic coding - if allow L(x) (length of code for x) to take non-integer values, we can equate codes with probability distributions.
Entropy of distribution: expected codelength of an element sampled from that distribution.
$$\begin{aligned} H(p) &= E_p L(x) \ &= \sum_{x \in X} P(x)L(x) \ &= - \sum_{x \in X} P(x) \log{P(x)} \end{aligned}$$
Cross entropy: expected codelength if we use q, but data comes from p.
$H(p, q) = E_p L^q(x) = - \sum_{x \in X} p(x) \log{q(x)}$
Kulback-Leibler divergence: expected difference in codelength between p and q. in other words, differencein expected codelength.
$KL(p,q) = H(p,q) - H(p) = - \sum_{x \in X} p(x) \log{\frac{q(x)}{p(x)}}$
The maximum likelihood is the model with the highest probability. Selects the model that is most suitable given the observed data.
(Log) likelihood: what we maximise to fit a probability model
Loss: what we minimise to find a machine learning model
Has a mean μ and standard deviation σ.
Not a probability function, but a probability density function. The only things on the graph that have probability are intervals, so to find probability, you integrate over the probability density function.
Definition: $N(x | \mu, \sigma) = \frac{1}{\sqrt{2\pi\sigma^{2}}} \exp{[ -\frac{1}{2\sigma^2} (x-\mu)^2 ]}$
Maximum likelihood for the mean:
$\begin{aligned} \argmax_{\theta} \log{p(x | \theta)} &= \argmax_{\theta} \ln{\prod_{x \in x} p(x|\theta)} \ &= \argmax_{\theta} \sum_{x}{\ln{p(x|\theta})} &&\text{(because product in log is sum outside of log)}\ &= \argmax_{\mu, \sigma} \sum_{x}\ln{\frac{1}{\sqrt{2\pi\sigma^2}}} \exp \big\lfloor -\frac{1}{2\sigma^2} (x-\mu)^2 \big\rfloor &&\text{(fill in the formula)}\ &= \argmax_{\mu, \sigma} \sum_{x}\ln{\frac{1}{\sqrt{2\pi\sigma^2}}} - \frac{1}{2\sigma^2} (x-\mu)^2 \ \frac{\partial \ln P(x|\theta)}{\partial \mu} &= \sum_{x} \frac{\partial \big[ \ln{\frac{1}{\sqrt{2\pi\sigma^2}}} - \frac{1}{2\sigma^2} (x-\mu)^2 \big]}{\partial \mu} &&\text{(because we want to maximise it)}\ &= -\frac{1}{2\sigma^2} \sum_{x} \frac{\partial (x-\mu)^2}{\partial \mu} \ &= -\frac{1}{\sigma^2} \sum_{x} (x-\mu) \ -\frac{1}{\sigma^2} \sum_{x} (x-\mu) &= 0 &&\text{(because the max/min is where the derivative is 0)} \ \sum_{x} (x-\mu) &= 0 \ -\mu n + \sum_{x} x &= 0 \ \mu &= \frac{1}{n} \sum_{x} x &&\text{(i.e. the arithmetic mean)} \end{aligned}$
The implication is that the maximum likelihood estimator for the mean of normal distribution is the mean of the data.
For a regression $y = x^{T} w + b + E$, where $E \sim N(0, \sigma)$
If we want to maximise the likelihood of the parameters of the line, given some data:
$\begin{aligned} \argmax_{w,b} P(Y|X,w,b) &= \argmax_{w,b} \ln{\prod_{i} N(y_i | x_{i}^{T} w + b, \sigma)} \ &= \argmax_{w,b} \sum_{i} \ln{\frac{1}{\sqrt{2\pi\sigma^2}}} \exp \Big[ -\frac{1}{2\sigma2} (x_{i}{T} w + b - y_i)^2 \Big] &&\text{(just fill in the formula)}\ &= \argmax_{w,b} -\sum_{i} \frac{1}{2\sigma2} (x_{i}{T} w + b - y_i)^2 &&\text{(because the ln doesn’t matter for argmax)}\ &= \argmax_{w,b} -\frac{1}{2} \sum_{i} (x_{i}{T} w + b - y_i)2 &&\text{(because the stdev doesn’t impact the result)}\ &= \argmin_{w,b} \frac{1}{2} \sum_{i} (x_{i}{T} w + b - y_i)2 &&\text{(which is the least squares function)}\ \end{aligned}$
So that’s why least squares assumes a normal distribution.
The formula: $N(x | \mu, \Sigma) = \frac{1}{\sqrt{(2\pi)^d |\Sigma |}} \exp \Big[ -\frac{1}{2} (x-\mu)^{T} \Sigma^{-1} (x-\mu) \Big]$
Basically, combine Gaussians to represent more complex shapes.
Example with three components:
Maximum likelihood: $\argmax_{w_{i}, \mu_{i}, \Sigma_{i}} \sum_{x} {\ln \sum_{i}{ N(x | \mu_{i}, \Sigma_{i})}}$
Finding maximum-likelihood is hard if there are hidden variables (not observed) that affect those that are in the dataset. For example, if the hidden variables come from mixture models (you don’t know their specific distribution). This can be used to fit any hidden variable model.
Key insight: can’t optimise both θ and z, but given some θ, can compute P(z|x), and given z, can optimise θ.
Intuition:
Definition of “responsibility”: $r_{x}^{i} = \frac{P(z=i | x)}{\sum_{j} P(z=j | x)}$ Model parameters, given responsibilities: