using the example of movie recommendations for users.
Recommendation using only explicit info:
embedding models:
to make a prediction, define that dot product of user vector and movie vector should be high if user would like the movie. this is a choice, but it’s a logical one.
multiplying UT with M gives matrix of rating predictions for every user/movie pair. so we want to take rating matrix R, and decompose it as product of two factors (“matrix factorization/decomposition”)
You get an optimisation problem: choose U and M st UT M ≈ R.
$\begin{aligned} \argmin_{U,M} &= | R - U^T M |{F}^{2} \ &= \sum{i,j} (R_{ij} - \lbrack U^T M \rbrack_{ij})^2 \ &= \sum_{i,j} (R_{ij} - u_{i}{T} m_j)2 \end{aligned}$
but, R is not complete, for most user/movie pairs we don’t know the rating. so, sometimes it’s better to only optimise for known ratings:
$\begin{aligned} \argmin_{U,M} \sum_{i,j \in R_{\text{known}}} (R_{ij} - u_{i}{T} m_j)2 \end{aligned}$
Alternating least squares - alternative to gradient descent
Stochastic gradient descent is usually more practical.
If we only have positive ratings, we have two options:
For user/movie biases, use an additive scalar, which is learned along with embeddings. One for each user, one for each movie, and one general bias over all ratings.
Make the biases and embeddings time dependent. e.g. cut time into a small number of chunks, learn separate embedding for each chunk.
When a user/movie is added to the database, we have no ratings, so matrix factorization can’t build an embedding. Have to rely on implicit feedback and side information.
Using implicit “likes” (e.g. movies watched but not rated, movies browsed, movies hovered over…)
Using side info (age, login, browser resolution…)
graph convolutional network:
Link prediction: assume graph is incomplete, try to predict which nodes should be linked. We can treat this like a matrix factorization problem. GCN: perform convolutions on embeddings, multiply them out to generate predictions, compare to training data, and backpropagate loss
Node classification: for each node, try to predict a label. With node embeddings, we can use a regular classifier, but how do we get good embeddings? GCN for classification: ensure embedding size of last layer is equal to number of classes, apply softmax activation to embeddings, interpret them as probabilities over classes
GCN issues:
inductive vs transductive learning: in transduction, learning is allowed to see features of all data, but labels of only training data. embedding models only support transductive learning; if we don’t know objects until after training, won’t have embedding vectors. like with graph models, we need to know the whole graph.
to evaluate matrix factorization, give training alg all users and movies, but withhold some ratings. same for links in a graph. for node classification, give the alg the whole graph, and table linking node ids to labels, but withhold some labels.