In today's post, we will explain a certain algorithm for matrix factorization models for recommender systems which goes by the name *Alternating Least Squares* (there are others, for example based on stochastic gradient descent). We will go through the basic ALS algorithm, as well as how one can modify it to incorporate user and item biases.

We will also go through the ALS algorithm for implicit feedback, and then explain how to modify *that* to incorporate user and item biases. The basic ALS model and the version from implicit feedback are discussed in many places (both online and in freely available research papers), but we aren't aware of any good source for *implicit ALS with biases*... hence, this post.

Recommendation engines are becoming ubiquitous in the modern consumer landscape. From Netflix's personalized movie recommendations to Amazon's raft of suggested items (New For Your, Recommendations for You in the Kindle Store, etc...), to Spotify's wildly popular personalized Discover Weekly Playlist, recommendation engines have become an essential tool for promoting content discovery and extending consumer engagement.

There has been a proliferation of algorithms in the data science community upon which recommender systems are built, but they all follow the basic central dogma that similar users like similar items. The hard part, of course, is deciding exactly what "similar" means and how to measure it. For Netflix, similar users are users who rate movies similarly. For Spotify, similar users are users who listen to the same kinds of music (a little thought will convince you that these two ideas, while similar, are not really the same).

The simplest type of recommender system, called a *collaborative filter*, is essentially a literal interpretation of "similar users like similar things" Goldberg1992. Once a metric for similarity is decided on, one just looks at the top rated items of the $k$ nearest neighbors (with respect to the similarity metric), filters out those items which the user of interest has already seen/consumed, and *voila*: personalized recommendations.

Collaborative filters have the advantages of being relatively easy to implement and relatively easy to interpret (e.g. we recommended movie A to user 1 because users 2, 3, and 4 all rated it highly and have similar ratings profiles to user 1). This being the Age of Data Science, though, it is no longer enough to have a reasonable explanation. We can actually measure how effectively a recommender system predicts the ratings of users on unseen items. Thus began Netflix's famous quest for better recommender systems, leading to the famous million dollar prize and eventual solution Bell2008 (by a team of experts working for several years).

Although the recommender system that won the Netflix prize was never put into production (it was a carefully tuned mix of many statistical models, and the marginal gain obtained by implementing all of the models was not worth the technical price compared to simply implementing a few of the best ones), the data science community, and the wider commercial world, learned without a doubt that, while collaborative filters are explainable and easily implementable, one can do significantly better if one is willing to flex a bit of mathematical muscle.

There were essentially two types of recommender systems in the final solution to the Netflix prize, and they have become the bread and butter of the current state of the art: matrix factorization models and restricted Bolzmann machines (RBMs). We will discuss matrix factorization models in this post.

We will be interested in two refinements of the basic matrix factorization model for recommendations: using implicit feedback, and using user and item biases.

It was realized early on, even for collaborative filters, that recommender systems work a lot better if one accounts for user and item biases.

Suppose, for example, that we are trying to predict movie ratings and

*Bob*is grouchy and rates items with an average of 2 stars.*Alice*is chipper and cheery and rates things with an average of 4 stars.

Clearly 3 stars from *Bob* is very different than 3 stars from *Alice*. If we keep track of the average movie rating for each user (the *user bias*), and just try to predict the difference from that average, the recommender system works *much* better.

Similarly, suppose

*Snakes on a Plane*gets an average rating (over all users) of 1 star.*The Shawshank Redemption*gets an average rating of 5 stars.

Then a 3 star rating is very different for *Snakes on a Plane* than for *The Shawshank Redemption*. If we keep track of the average user rating for each movie (the *item bias*), and just try to predict the difference from that average, the recommender system also works *much* better.

Obviously, one should keep track of both user and item biases. (We will explain the technical details below.)

One of the challenges of recommender systems in the wider commercial world is that one rarely has explicit ratings data (Netflix strongly encourages users to rate movies and leverage the tech built around those ratings).

Instead, user-item interactions like clicks, purchases, song listens (or fast-forwards), etc, are used as a proxy to indicate a preference or distaste for a particular item (possibly very weakly). If a user listens to a song only once, maybe it indicates that they didn't like the song. However, maybe they were busy or weren't paying attention and actually would like to hear the song again. If a user listens to a song 100 times, it is a safe bet that they like the song. But it is hard to draw a line between unknown preference and known preference.

This kind of indirect information about user-item preferences is known as *implicit feedback*, and many smart people have thought long and hard about how to deal with it. We won't delve much into the modeling of implicit feedback in today's post. It turns out to be relatively effective to model a user's preference on a scale of 0 (bad) to 1 (good) instead of modeling the user's raw number of interactions (or whatever number is actually being measured).

We can interpret the number of user-item interations (song listens, for example) as a measure of our confidence in our model's prediction for the user's preference of the item.

Below, we'll step through the details of how a matrix factorization model can be used to deal with implicit feedback. The basic algorithm we discuss is from the seminal paper Hu2008.

The most basic matrix factorization model for recommender systems models the rating $\hat{r}$ a user `u`

would give to an item `i`

by

where $x_u^T = (x_u^1, x_u^2, \dots, x_u^N)$ is a vector associated to the user, and $y_i^T = (y_i^1, y_i^2, \dots, y_i^N)$ is a vector associated to the item. The dimension of the vectors is the *rank* of the model, and the components are called *factors*.

We can collect the user-item ratings into a matrix $\widehat{R}=(\hat{r}_{ui})$. First collect the user vectors into a matrix
$$X^T = \left(\begin{matrix} \vdots & \vdots & \cdots & \vdots \\
x_{u_1} & x_{u_2} & \cdots & x_{u_{n_\text{users}}} \\
\vdots & \vdots & \cdots & \vdots \end{matrix}\right),$$

and the item vectors into a matrix
$$Y^T = \left(\begin{matrix} \vdots & \vdots & \cdots & \vdots \\
y_{i_1} & y_{i_2} & \cdots & y_{i_{n_\text{items}}} \\
\vdots & \vdots & \cdots & \vdots \end{matrix}\right),$$

then we can express the above model as

Of course, there is no reason that the "true" user-item matrix $R = (r_{ui})$ should have rank $N$, and indeed, we won't even know the "true" user-item matrix. This model assumes that it can be approximated by a rank $N$ factorization:

$$ R \sim \widehat{R} = XY^T.$$In other words, we want to find an $n_\text{users}\times N$ matrix $X$ and an $n_\text{items}\times N$ matrix $Y$ such that $\widehat{R}:=XY^T$ approximates $R$.

** Remark** We follow the matrix conventions of Hu2008, and machine learning in general. Although the user- and item-factor vectors $x_u$ and $y_i$ (and vectors in general) are regarded as column vectors, they are arranged as rows into the composite matrices $X$ and $Y$.

One popular method for finding the matrices $X$ and $Y$ given partial information about $R$ is known as *alternating least squares*. The idea is to find the parameters $x^j_{u}$ and $y^j_{i}$ (the entries of the matrices $X$ and $Y$) which minimize the $L^2$ cost function

The constant $\lambda$ is called the regularization parameter and essentially penalizes the components of the matrices $X$ and $Y$ if they get too large (in magnitude). This is important for numerical stability (and some kind of regularization is almost always used). It has an even more important effect in this setting, though:

**Fundamental Observation**: If we hold the item vectors $Y$ fixed, $C$ is a *quadratic* function of the components of $X$. Similarly, if we hold the user vectors $X$ fixed, $C$ is a quadratic function of the components of $Y$.

So in order to minimize $C$, one could try the following:

- Hold the user vectors fixed and
*solve*the quadratic equation for the $y^j_{i}$'s. This will (probably) not be the global minimum of $C$ since we haven't touched half of the variables (the $x^j_{u}$'s), but we have at least decreased $C$. - Hold the item vectors fixed and
*solve*the quadratic equation for the $x^j_{u}$'s. - Repeat.

Some remarks are in order.

- Since $C$ is a convex function, and each step of the above algorithm is a minimization, the process must converge at some point.
- There are other methods for minimizing such a convex cost function, for example, gradient descent (or stochastic gradient descent, or batch SGD, etc...). One important difference here is that at each step of our algorithm, we find the
*exact*minimum; we don't take small steps in a downward direction. So if we do step 1), a second application of step 1) will have no effect: we're already at the absolute minimum of $C$ with $Y$ held fixed. It also means that a single iteration of the above algorithm generally moves much further than an iteration of a gradient descent algorithm; i.e. we need fewer iterations for convergence.

The above algorithm is called (for obvious reasons) *alternating least squares*.

A bit of linear algebra yields the following algorithm (see here, for example):

- Initialize the user vectors $X$ somehow (e.g. randomly).
- For each item
`i`

, let $r_i$ be the vector of ratings of that item (it will have`n_users`

components; one for each user). Compute $$y_i = \left(X^T X + \lambda I\right)^{-1} X^T r_i$$ for each item`i`

. ($I$ is the $N\times N$ identity matrix.) - For each user
`u`

, let $r_u$ be the vector of ratings of that user (it will have`n_items`

components; one for each item). Compute $$x_u = \left(Y^T Y + \lambda I\right)^{-1} Y^T r_u$$ for each user`u`

. - Repeat 2), 3) until desired level of convergence is achieved.

** Remark.** It is almost always faster to solve the linear system $\left(X^T X + \lambda I\right)y_i = X^T r_i$ instead of inverting the matrix $\left(X^T X + \lambda I\right)$ as written in the algorithm (similarly for the user vectors).

As we mentioned above, it turns out that most recommender systems perform better if user and item biases are taken into account.

Suppose we have a ratings system that allows each user to rate each item on a scale of 1 to 5 stars. Suppose we have two users: `Alice`

, who rates items with an average of `4`

stars, and `Bob`

, whose average rating is `1.5`

stars. If `Bob`

rates some new item with `3`

stars, it means something *very* different than if `Alice`

rates the same item with `3`

stars (`Bob`

*really* liked the new item, `Alice`

didn't). The difference is what we call *user bias*.

For an explicit-ratings model (as discussed above, when we have explicit ratings for user-item pairs), one way to account for user bias is to model the actual user-item ratings as

$$r_{ui} \sim \beta_u + \hat{r}_{ui},$$where $\beta_u$ is the *user bias* of user `u`

and $\hat{r}_{ui}$ is the model of whatever is left over; for example, $\hat{r}_{ui} = x^T_uy_i$ if we use the matrix factorization model discussed above. We can account for item bias in a similar way:

here, $\gamma_i$ is the *item bias* of item `i`

.

** Remark**. If instead of matrix factorization we used a traditional collaborative filter, we would define $\hat{r}_{ui}$ via $k$-nearest neighbors using some similarity metric (Pearson is a standard choice), $\beta_u$ is the average rating of user

`u`

over all items, and $\gamma_i$ is the average rating of item `i`

over all users.User and item biases can be directly incorporated into the ALS algorithm. We model the user-item ratings matrix as

$$r_{ui} \sim \beta_u + \gamma_i + x^T_u y_i$$and minimize the cost function

$$ C^\text{biased} = \sum_{u,i\in\text{observed ratings}} (r_{ui} - x_u^T y_i)^2 + \lambda \left( \sum_{u} \left( \|x_u\|^2 + \beta_u^2\right) + \sum_{i} \left(\|y_i\|^2 + \gamma_i^2\right) \right).$$Again, because of the regularization, we can hold the user variables fixed and solve for the minimum in the item variables. Then we can hold the item variables fixed and solve for the minimum in the user variables.

The biases $\beta_u$ and $\gamma_i$ appear in $C^\text{biased}$ without any coefficients, whereas all the other parameters appear at least once with some coefficient. So one approach would be to solve for the biases separately from the other parameters in each step.

It is easier, though, to leverage the work we've already done. We can rewrite our cost function at each step so that it looks like an unbiased model, and then just use the same formulas we found above for the unbiased ALS. The trick is to define new vectors that include the biases as components in the right way.

Let $\beta$ be the vector of user biases (with `n_users`

components) and $\gamma$ the vector of item biases (with `n_items`

components). Here is the algorithm:

- Initialize user vectors randomly and set all biases to zero (or initialize them randomly, it doesn't matter very much).
- For each item
`i`

, define three new vectors : $$r^\beta_i := r_i - \beta$$ with components $r^\beta_{ui}:= r_{ui}-\beta_u$ (notice that both vectors have`n_users`

components), $$\tilde{x}^T_u := (1, x^T_u),$$ and $$\tilde{y}^T_i := (\gamma_i, y^T_i).$$ Then $C^\text{biased} = \sum \left(r^\beta_{ui} - \widetilde{x}_u^T\widetilde{y}_i \right)^2 + \lambda\left( \sum_u \|\widetilde{x}_u\|^2 + \sum_i \|\widetilde{y}_i\|^2\right)$. Hence, we find that the item bias and vector can be computed as $$ \widetilde{y}_i := \left(\begin{matrix} \gamma_i \\ y_i\end{matrix}\right) = \left(\widetilde{X}^T \widetilde{X} + \lambda I\right)^{-1} \widetilde{X}^T r^\beta_i.$$ ($I$ is now the $(N+1)\times(N+1)$ identity matrix, and $\tilde{X}$ and $\tilde{Y}$ are matrices whose columns are the vectors $\tilde{x}_u$ and $\tilde{y}_i$ as usual.) - Now, for each user
`u`

, define three new vectors: $$r^\gamma_u := r_u - \gamma,$$ $$\widetilde{y}^T_i := (1, y_i),$$ and $$\widetilde{x}^T_u := (\beta_u, x_u).$$ The user bias and vector can be computed as $$ \widetilde{x}_u := \left(\begin{matrix} \beta_u \\ x_u\end{matrix}\right) = \left(\widetilde{Y}^T \widetilde{Y} + \lambda I\right)^{-1} \widetilde{Y}^T r^\gamma_u.$$ - Repeat 2), 3) until convergence.

For many real-world user-item interactions there are no explicit rating data available. However, there is often nontrivial information about the interactions, e.g. clicks, listens/watches, purchases, etc. Such indirect "ratings" information about user-item interactions is known as **implicit feedback**. Modeling implicit feedback is a difficult but important problem. There are several ways to use the ALS matrix factorization to approach such a model. We present here a standard solution, presented (without bias corrections) in Hu2008.

The basic approach is to forget about modeling the implicit feedback directly. Rather, we want to understand whether user `u`

has a preference or not for item `i`

using a simple boolean variable which we denote by $p_{ui}.$ The number of clicks, listens, views, etc, will be interpreted as our confidence in our model.

Following the fundamental idea of matrix factorization models, we try to find a user vector $x_u$ for each user `u`

and an item vector $y_i$ for each item `i`

so that
$$p_{ui} \sim x^T_u y_i.$$
It is important to note that we never actually observe $p_{ui}$! (This is a very different situation than the explicit feedack models, as discussed above, where $r_{ui}$ is the observer rating.)

Let's assume the our implicit feedback is a positive integer (number of clicks, number of views, number of listens, etc). That is, $$r_{ui} = \text{# of times user }\mathtt{u}\text{ interacted with item }\mathtt{i}.$$

How do we go about finding the vectors $x_u$ and $y_i$ given some implicit feedback $\{r_{ui}\}$? If a user has interacted with an item, we have reason to believe that $p_{ui}=1$. The more they have interacted with that item, the stronger our belief in their preference.

To define our model, set
$$ p_{ui} = \begin{cases} 1 & \text{if }r_{ui}>0\\ 0 & \text{if }r_{ui}=0.\end{cases}$$
We try to minimize the following cost function:
$$ C_\text{implicit} := \sum_{u,i\in\text{observed interactions}} c_{ui}\left(p_{ui} - x^T_u y_i \right)^2 + \lambda \left(\sum_u \|x_u\|^2 + \sum_i \|y_i\|^2\right),$$
where $c_{ui}$ is our **confidence** in $p_{ui}.$ That is, the more a user has interacted with an item, the more we penalize our model for incorrectly predicting $p_{ui}.$

If a user has never interacted with an item, it is possible that $p_{ui}=1$ and the user has just never encountered the item. It is also possible they are actively avoiding the item. To deal with this ambiguity, it is common to define the confidence by $$c_{ui}:=1 + \alpha r_{ui},$$ where $\alpha$ is a parameter of the model that needs to be tuned to the dataset. There is some empirical evidence that setting it to the sparsity ratio (the ratio of nonzero entries to zero entries) works well; missing entries are often interpreted as slightly negative, and so one could interpret $\alpha$ as balancing positive and negative interactions. See Johnson2014, for example. Another possibility that works well (and makes the model more robust against power users -- those users who interact with items orders of magnitude more often than the average user), which is also mentioned in Hu2008, is to set $$c_{ui} = 1 + \alpha\log(1+r_{ui}/\epsilon),$$ where $\epsilon$ is yet another data-dependent parameter of the model.

The basic idea of the ALS algorithm for matrix factorization applies to minimizing $C_\text{implicit}$: if we hold the user (resp. item) vectors fixed, then $C_\text{implicit}$ depends quadratically on the item (resp. user) vectors. That means we can do the same thing: hold the user vectors fixed and solve for the minimum in the item variables, then hold the item vectors fixed and solve for the minimum in the user variables, and repeat...

Of course, our cost function $C_\text{implicit}$ is slightly different, so the actual steps involving solving for the minimum look a bit different. The algorithm is as follows.

- Initialize user vectors.
- For each item
`i`

, let $p_i$ be the vector whose components are $p_{ui}$ (for fixed`i`

, there will be`n_users`

components), let $C^i$ be the diagonal matrix with $c_{ui}$ for fixed $u$ along the diagonal, and let $d_i = C^i p_i$ (the reason for defining $d_i$ separately will be explained in the remarks about the`numpy`

implementation below). Compute $$y_i = \left(X^TC^iX+\lambda I\right)^{-1}X^Td_i.$$ - For each user
`u`

, let $p_u$ be the vector whose components are $p_{ui}$ (for fixed`u`

; there will be`n_items`

components), let $c_{u}$ be the vector whose components are $c_{ui}$, let $C^u$ be the diagonal matrix with $c_u$ along the diagonal, and let $d_u=C^u p_u$. Compute $$x_u = \left(Y^TC^uY + \lambda I\right)^{-1}Y^Td_u.$$ - Repeat 2), 3) until convergence.

*Remarks*

- As for standard (explicit) matrix factorization models, it is (almost) always better to solve $\left(X^TC^iX+\lambda I\right) y_i = X^Td_i$ than to invert the matrix $\left(X^TC^iX+\lambda I\right)$.
- Since $c_{ui} = 1 + \alpha r_{ui}$, which we can write in terms of matrices $C:=(c_{ui})$ and $R:=(r_{ui})$ as $C = 1 + \alpha R$, the computation in step two can be rewritten as
$$y_i = \left(X^TX + \lambda I + \alpha X^TR^iX\right)^{-1}X^Td_i.$$
The term $X^TX+\lambda I$ is independent of
`i`

, and can be computed once and reused at each step of the algorithm (similarly for $Y^TY+\lambda I$ in step 3).

It turns out that the matrix multiplication $X^TC^iX$ and the matrix-vector product $X^Td_i$ are both easier to implement in `numpy`

than to express in standard linear algebra notation. A direct implementation of the algorithm is of course possible, but a small bit of thought improves performance vastly.

`numpy`

Implementation NotesA direct implementation of the matrix product $X^TC^iX$ would be something like

```
Ci = np.diag(c[i])
XTCiX = np.dot(X, np.dot(Ci, np.dot(X.T)).
```

But the matrix product `np.dot(X, Ci)`

is just multiplying the rows of $X$, in order, by the diagonal elements of $C^i$, which is achieved very efficiently by simple `numpy`

broadcasting. So we can achieve the same matrix product via

```
XTCiX = np.dot(c[i] * X.T, X).
```

Moreover, since $d_i = p_i + \alpha r_i$ (which follows from $p_{ui} r_{ui} = r_{ui}$), we can rewrite the algorithm in terms of just the scaled raw counts $\alpha R$ and the preference $p$ as $$y_i = \left(X^TX+\lambda I + \alpha X^TR^iX\right)^{-1}X^T(p_i + \alpha r_i).$$

Hence, we can completely avoid constructing the diagonal matrices $C^i$ and $C^u$.

Finally, when `n_users`

and `n_items`

are large, $R$ (and hence $p$) will generally be sparse matrices. The matrix $\left(X^TX+\lambda I + \alpha X^TR^iX\right)$ is, however, $N\times N$ and since the number of factors is generally small (on the order of 10), there is no great savings by using a sparse implementation of this matrix. On the other hand, only the user vectors corresponding to nonzero entries of $r_i$ (and $p_i$) are needed in the computation of $X^TR^iX$ and of $X^T(p_i + \alpha r_i)$, so if the algorithm is parallelized, it is *very* useful to only ship those vectors to the workers doing the computation for $y_i$ (this is, for example, how the Spark implementation works). Our implementation is not parallelized (for the sake of clarity), so we don't need to worry about shipping data to workers.

As with the standard matrix factorization model, matrix factorization models for implicit feedback tend to work better if user and item biases are accounted for; we introduce parameters $\beta_u$ and $\gamma_i$ as before, and try to fit a model $$p_{ui} \sim \beta_u + \gamma_i + \hat{p}_{ui},$$ where $\hat{p}_{ui} = x^T_u y_i$ as above, by minimizing the cost function $$C_{implicit}^{biased} := \sum_{u,i\in\text{observed interactions}} c_{ui}\left(p_{ui} - \beta_u - \gamma_i - x^T_u y_i \right)^2 + \lambda \left(\sum_u (\|x_u\|^2 + \beta_u^2) + \sum_i (\|y_i\|^2 + \gamma_i^2)\right).$$

The interpretation of the biases in this case is slightly different:

- User Bias $\beta_u$: A higher bias $\beta_u$ pushes the values of all preferences $p_{ui}$ up
*for all items*; that is, a user with a high bias likes a large variety of items; conversely, a low user bias means the user only likes a small selection of items. - Item Bias $\gamma_i$: A higher item bias $\gamma_i$ pushes the preferences $p_{ui}$ up
*for all users*; that is, the item is more universally beloved (or mainstream); conversely, a low item bias might indicate a niche item.

Again, the principle of the ALS algorithm applies with appropriate modifications for the new cost function $C^\text{biased}_\text{implicit}$. The algorithm is:

- Initialize user vectors and biases.
- Define vectors $$p^\beta_i:=p_i - \beta,$$ $$\tilde{x}_u^T = (1, x^T_u),$$ and $$\tilde{y}_i^T = (\gamma_i, y^T_i).$$ Then compute $$\tilde{y}_i = \left(\widetilde{X}^TC^i\widetilde{X} + \lambda I\right)^{-1}\widetilde{X}^Td^\beta_i,$$ where we have defined $d^\beta_i=C^ip^\beta_i$ as in the previous implicit ALS algorithm.
- Define vectors $$p^\gamma_u:=p_u - \gamma,$$ $$\tilde{y}_i^T = (1, y^T_i),$$ and $$\tilde{x}_u^T = (\beta_u, x^T_u).$$ Then compute $$\widetilde{x}_u = \left(\widetilde{Y}^TC^u\widetilde{Y} + \lambda I\right)^{-1}\widetilde{Y}^Td^\gamma_u,$$ where $d^\gamma_u=C^up^\gamma_u$.
- Repeat 2) and 3)

The same remarks apply as for biased ALS and implicit ALS.

D. Goldberg, D. Nichols, B. M. Oki, D. Terry,
*Using Collaborative Filtering to Weave an Information Tapestry*,
Comm. ACM **35** (1992) pp 61 -- 70.

R. M. Bell, Y. Koren, C. Volinsky,
*The BellKor 2008 Solution to the Netflix Prize*

Y. Hu, Y. Koren, C. Volinsky
*Collaborative Filtering for Implicit Feedback Datasets*
8th IEEE International Conference on Data Mining, pp. 263 -- 272.

C. Johnson
*Logistic Matrix Factorization for Implicit Feedback Data*