1. A Bayesian Approach toward Active Learning for Collaborative Filtering
Rong Jin Luo Si
Language Technology Institute
Department of Computer Science School of Computer Science
Michigan State University Carnegie Mellon University
rong@cse.cmu.edu Pittsburgh, PA 15232
lsi@cs.cmu.edu
Abstract the fact that contents are difficult for a computer to
analyze (e.g. music and videos). One of the key issues in
Collaborative filtering is a useful technique for the collaborative filtering is to identify the group of users
exploiting the preference patterns of a group of who share the similar interests as the active user. Usually,
users to predict the utility of items for the active the similarity between users are measured based on their
user. In general, the performance of collaborative ratings over the same set of items. Therefore, to
filtering depends on the number of rated accurately identify users that share similar interests as the
examples given by the active user. The more the active user, a reasonably large number of ratings from the
number of rated examples given by the active active user are usually required. However, few users are
user, the more accurate the predicted ratings will willing to provide ratings for a large amount of items.
be. Active learning provides an effective way to Active learning methods provide a solution to this
acquire the most informative rated examples problem by acquiring the ratings from an active user that
from active users. Previous work on active are most useful in determining his/her interests. Instead of
learning for collaborative filtering only considers randomly selecting an item for soliciting the rating from
the expected loss function based on the estimated the active user, for most active learning methods, items
model, which can be misleading when the are selected to maximize the expected reduction in the
estimated model is inaccurate. This paper takes predefined loss function. The commonly used loss
one step further by taking into account of the functions include the entropy of model distribution and
posterior distribution of the estimated model, the prediction error. In the paper by Yu et. al. (2003), the
which results in more robust active learning expected reduction in the entropy of the model
algorithm. Empirical studies with datasets of distribution is used to select the most informative item for
movie ratings show that when the number of the active user. Boutilier et. al. (2003) applies the metric
ratings from the active user is restricted to be of expected value of utility to find the most informative
small, active learning methods only based on the item for soliciting the rating, which is essentially to find
estimated model don’t perform well while the the item that leads to the most significant change in the
active learning method using the model highest expected ratings.
distribution achieves substantially better
performance. One problem with the previous work on the active
learning for collaborative filtering is that computation of
expected loss is based only on the estimated model. This
1. Introduction can be dangerous when the number of rated examples
given by the active user is small and as a result the
The rapid growth of the information on the Internet estimated model is usually far from being accurate. A
demands intelligent information agent that can sift better strategy for active learning is to take into account of
through all the available information and find out the the model uncertainty by averaging the expected loss
most valuable to us. Collaborative filtering exploits the function over the posterior distribution of models. With
preference patterns of a group of users to predict the the full Bayesian treatment, we will be able to avoid the
utility of items for an active user. Compared to content- problem caused by the large variance in the model
based filtering approaches, collaborative filtering systems distribution. Many studies have been done on the active
have advantages in the environments where the contents learning to take into account of the model uncertainty.
of items are not available due to either a privacy issue or The method of query by committee (Seung et. al., 1992;
2. Freud et. al., 1996) simulates the posterior distribution of preference factors (Hofmann & Puzicha 1999; Hofmann,
models by constructing an ensemble of models and the 2003). The latent class variable z ∈ Z = {z1 , z 2 ,....., z K }
example with the largest uncertainty in prediction is is associated with each pair of a user and an item. The
selected for user’s feedback. In the work by Tong and aspect model assumes that users and items are
Koller (2000), a full Bayesian analysis of the active independent from each other given the latent class
learning for parameter estimation in Bayesian Networks is variable. Thus, the probability for each observation tuple
used, which takes into account of the model uncertainty in ( x, y, r ) is calculated as follows:
computing the loss function. In this paper, we will apply
the full Bayesian analysis to the active learning for
collaborative filtering. Particularly, in order to simplify
p (r | x, y ) = ∑ p (r | z , x ) p ( z | y )
z∈Z
(1)
the computation, we approximate the posterior where p(z|y) stands for the likelihood for user y to be in
distribution of model with a simple Dirichlet distribution, class z and p(r|z,x) stands for the likelihood of assigning
which leads to an analytic expression for the expected item x with rating r by users in class z. In order to achieve
loss function. better performance, the ratings of each user are
normalized to be a normal distribution with zero mean
The rest of the paper is arranged as follows: Section 2
and variance as 1 (Hofmann, 2003). The parameter p(r|
describes the related work in both collaborative filtering
z,x) is approximated as a Gaussian distribution N ( µ z , σ z )
and active learning. Section 3 discusses the proposed
and p(z|y) as a multinomial distribution.
active learning algorithm for collaborative filtering. The
experiments are explained and discussed in Section 4. Personality diagnosis approach treats each user in the
Section 5 concludes this work and the future work. training database as an individual model. To predicate the
rating of the active user on certain items, we first compute
the likelihood for the active user to be in the ‘model’ of
2. Related Work each training use. The ratings from the training user on
In this section, we will first briefly discuss the previous the same items are then weighted by the computed
work on collaborative filtering, followed by the previous likelihood. The weighted average is used as the estimation
work on active filtering. The previous work on active of ratings for the active user. By assuming that the
learning for collaborative filtering will be discussed at the observed rating for the active user y’ on an item x is
end of this section. assumed to be drawn from an independent normal
True
distribution with the mean as the true rating as R y ' ( x) ,
2.1 Previous Work on Collaborative Filtering we have
Most collaborative filtering methods fall into two − ( R y ' ( x ) − RTrue ( x ))2 2σ 2 (2)
p ( R y ' ( x) | R True ( x)) ∝ e
y'
y'
categories: Memory-based algorithms and Model-based
algorithms (Breese et al. 1998). Memory-based where the standard deviation σ is set to be constant 1 in
algorithms store rating examples of users in a training our experiments. The probability for the active user y’ to
database. In the predicating phase, they predict the ratings be in the model of any user y in the training database can
of an active user based on the corresponding ratings of the be written as:
users in the training database that are similar to the active
∏e
− ( Ry ( x ) − R y ' ( x )) 2 2σ 2
user. In contrast, model-based algorithms construct p( y'| y) ∝ (3)
models that well explain the rating examples from the x∈X ( y t )
training database and apply the estimated model to predict Finally, the likelihood for the active user y’ to rate an
the ratings for active users. Both types of approaches unseen item x as r is computed as:
have been shown to be effective for collaborative
filtering. In this subsection, we focus on the model-based
collaborative filtering approaches, including the Aspect
p( R y ' ( x) = r ) ∝ ∑ p( R
y
y' | R y )e
− ( R y ( x ) − r ) 2 2σ 2
(4)
Model (AM), the Personality Diagnosis (PD) and the
Flexible Mixture Model (FMM). Previous empirical studies have shown that the
personality diagnosis method is able to outperform
For the convenience of discussion, we will first introduce several other approaches for collaborative filtering
the annotation. Let items denoted by (Pennock et al., 2000).
X = {x1 , x 2 ,......, x M } , users denoted by
Y = { y1 , y 2 ,......, y N } , and the range of ratings denoted Flexible Mixture Model introduces two sets of hidden
by {1,..., R} . A tuple ( x, y, r ) means that rating r is variables {z x , z y } , with z x for the class of items and z y
assigned to item x by user y . Let X ( y ) denote the set for the class of users (Si and Jin, 2003; Jin et. al., 2003).
of items rated by user y, and R y (x) stand for and the Similar to aspect model, the probability for each observed
rating of item x by user y, respectively. tuple ( x, y, r ) is factorized into a sum over different
classes for items and users, i.e.,
Aspect model is a probabilistic latent space model, which
models individual preferences as a convex combination of
2
3. p ( x, y , r ) = ∑ p( z x ) P( z y ) P( x | z x ) P( y | z y ) P(r | z x , z y ) (5) Correct Decision Boundary
zx , z y
f1
where parameter p(y|zy) stands for the likelihood for user
Estimated Decision Boundary
y to be in class zy, p(x|zx) stands for the likelihood for item
x to be in class zx, and p(r|zx, zy) stands for the likelihood
for users in class zy to rate items in class zx as r. All the
parameters are estimated using Expectation Maximization
algorithm (EM) (Dumpster et. al., 1976). The multiple-
cause vector quantization (MCVQ) model (Boutilier and
Zemel, 2003) uses the similar idea for collaborative
filtering.
2.2 Previous Work on Active Learning
The goal of active learning is to learn the correct model
using only a small number of labeled examples. The f2
general approach is to find the example from the pool of
unlabeled data that gives the largest reduction in the Figure 1: A learning scenario when the estimated
expected loss function. The loss functions used by most model distribution is incorrect. The vertical line
active learning methods can be categorized into two corresponds t the correct decision boundary and the
groups: the loss functions based on model uncertainty and horizontal line corresponds to the estimated decision
the loss function based on prediction errors. For the first boundary.
type of loss functions, the goal is to achieve the largest
reduction ratio in the space of hypothesis. One commonly examples, the most likely decision boundary is the
used loss function is the entropy of the model distribution. horizontal line (i.e., the dash line) while the true decision
Methods within this category usually select the example boundary is a vertical line (i.e., the dot line). If we only
for which the model has the largest uncertainty in rely on the estimated model for computing the expected
predicting the label (Seung et. al., 1992; Freud et. al, loss, the examples that will be selected for user’s
1997; Abe and Mamitsuka, 1998; Campbell et. al., 2000; feedback are most likely from the dot-shaded areas, which
Tong and Koller, 2002). The second type of loss functions will not be very effective in adjusting the estimated
involved the prediction errors. The largest reduction in the decision boundary (i.e. the horizontal line) to the correct
volume of hypothesis space may not necessary be decision boundary (i.e. the vertical. On the other hand, if
effective in cutting the prediction error. Freud et. al. we can use the model distribution for estimating the
showed an example in (1997) , in which the largest expected loss, we will be able to adjust the decision
reduction in the space of hypothesis does not bring the boundary more effectively since decision boundaries
optimal improvement to the reduction of prediction errors. other than the estimated one (i.e. horizontal line) have
Empirical studies in Bayesian Network (Tong and Koller, been considered in the computation of expected loss.
2000) and text categorization (Roy and MaCallum, 2001) There have been several studies on active learning that
have shown that using the loss function that directly use the posterior distribution of models for estimating the
targets on the prediction error is able to achieve better expected loss. The query by committee approach
performance than the loss function that is only based on simulates the model distribution by sampling a set of
the model uncertainty. The commonly used loss function models out of the posterior distribution. In the work of
within this category is the entropy of the distribution of active learning for parameter estimation in Bayesian
predicted labels. Network, a Dirichlet distribution for the parameters is
used to estimate the change in the entropy function.
In addition to the choice of loss function, how to estimate However, the general difficulty with the full Bayesian
the expected loss is another important issue for active analysis for active learning is the computational
learning. Many active learning methods compute the complexity. For complicated models, usually it is rather
expected loss only based on the currently estimated model difficult to obtain the model posterior distribution, which
without taking into account of the model uncertainty. leads to the sampling approaches such as Markov Chain
Even though this simple strategy works fine for many Mote Carlo (MCMC) and Gibbs sampling. In this paper,
applications, it can be very misleading, particularly when we follow an idea similar to the variational approach
the estimated model is far from the true model. As an (Jordan et al., 1999). Instead of accurately computing the
example, considering learning a classification model for posteriors, we approximate the posterior distribution with
the data distribution in Figure 1, where spheres represent an analytic solution and apply it to estimate the expected
data points of one class and stars represent data points of loss. Compared to the sampling approaches, this approach
the other class. The four labeled examples are highlighted substantially simplifies the computation by avoiding
by the line-shaded ellipsis. Based on these four training
3
4. generating a large number of models and calculating the
loss function values of all the generated models.
θ* = arg max
θ∈Θ
∏ p (r | x, y ';θ )
x∈X ( y ')
(6)
2.3 Previous Work on Active Learning for = arg max
θ∈Θ
∏ ∑ p(r | z, x)θ z
x∈X ( y ') z∈Z
Collaborative Filtering
There have been only few studies on active learning for Then, the posterior distribution of model,
collaborative filtering. In the paper by Kai et al (2003), a p (θ | D( y ')) where D( y ') includes all the rated
method similar to Personality Diagnosis (PD) is used for items given the active user y’, can be written as:
collaborative filtering. Each example is selected for user’s
feedback in order to reduce the entropy of the like- p (θ | D( y ')) ∝ p ( D( y ') | θ) p (θ)
mindness distribution or p ( y | y ') in Equation (3)). In the
paper by Boutilier et al (2003), a method similar to the
≈ ∏ ∑ p(r | z, x)θ z (7)
x∈X ( y ') z∈Z
Flexible Mixture Model (FMM) named ‘multiple-cause
vector quantification’ is used for collaborative filtering. In above, a uniform prior distribution is applied
Unlike most other collaborative filtering research, which and is simply ignored. The posterior in (7) is
try to minimize the prediction error, this paper only rather difficult to be used for estimating the
concerns with the items that are strongly recommended by expected loss due to the multiple products and lack
the system. The loss function is based on the expected of normalization.
value of information (EVOI), which is computed based on
the currently estimated model. One problem with these
two approaches is that, the expected loss is computed only To approximate the posterior distribution, we will
based on a single model, namely the currently estimated consider the expansion of the posterior function
model. This is especially important, when the number of around the maximum point. Let θ* stand for the
rated examples given by the active user is small and maximal parameters that are obtained from EM
meanwhile the number of parameters to be estimated is algorithm by maximizing Equation (6). Let’s
large. In the experiments, we will show that the selection consider the ratio of log p (θ | D( y ')) with respect
based on the expected loss function with only the to log p (θ* | D( y ')) , which can be written as:
currently estimated model can be even worse than simple
random selection. p (θ | S ) ∑ p(ri | xi , z )θz
log = ∑log z
p (θ * | S ) i ∑ p(ri | xi , z )θz*
z
3. A Bayesian Approach Toward Active
p (ri | xi , z )θz*
θ
Learning for Collaborative Filtering ≥ ∑∑ log z (8)
i z ∑ p(ri | xi , z )θz*' θz*
z'
'
In this section, we will first discuss the θz
approximated analytic solution for the posterior = ∑(a z −1) log
θz*
distribution. Then, the approximated posterior z
distribution will be used to estimate the expected
loss. where the variable α z is defined as
p( ri | xi , z )θ z*
3.1 Approximate the Model Posterior Distribution αz = ∑ +1 (9)
i ∑ p(ri | xi , z ')θ z*'
z'
For the sake of simplicity, we will focus on the
active learning of aspect model for collaborative Therefore, according to Equation (8), the
filtering. However, the discussion in this section approximated posterior distribution p(θ | D( y '))
can be easily extended to other models for is a Dirichlet with hyper parameters α z defined in
collaborative filtering. Equation (9), or
As described in Equation (1), each conditional Γ(α * ) (10
p (θ | D) ≈ ∏ z θ α z −1
likelihood p (r | x, y ) is decomposed as the sum of ∏z Γ(α z ) )
∑ z p(r | z, x) p( z | y) . For the active user y’, the
most important task is to determine its user type,
where α = ∑ z α z . Furthermore, it is easy to see
*
or p ( z | y ') . Let θ stands for the parameter space
θ = {θ z = p ( z | x)}z . Usually, parameters θ are *
αz −1 θz (11
estimated through the maximum likelihood = * )
estimation, i.e. αz' −1 θz'
4
5. The above relation is true because the optimal optimization problem in Equation (13) is not
solution obtained by EM algorithm is actual fixed appropriate for active learning of collaborative
point of the following equation: filtering.
p( ri | xi , z )θ z* In the ideal case, if the true model θtrue is given,
θ z* = ∑ (12
i ∑ p(ri | xi , z ')θ z*' ) the optimal strategy in selecting example should
z' be:
Based on the property in Equation (11), we have (14
the optimal point derived from the approximated x* = arg min −
x∈X
∑θ z
true
z log θ z|x ,r
p ( r|x ,θtrue ) )
Dirichlet distribution is exactly θ* , which is the
optimal solution found by EM algorithm. The goal of the optimization in Equation (14) is to
find an example such that by adding the example
3.2 Estimate the Expected Loss to the pool of rated examples, the updated model
θ| x ,r is adjusted toward the true model θtrue .
Similar to many other active learning work (Tong Unfortunately, the true model is unknown. One
and Koller, 2000; Seung et al, 1992; Kai et al, way to approximate the optimization goal in
2003), we use the entropy of the model function as Equation (14) is to replace the θtrue with the
the loss function. Many previous studies of active expectation over the posterior distribution
learning try to find the example that directly p (θ | D( y ')) . Therefore, Equation (14) will be
minimizes the entropy of the model parameters transformed into the following optimization
with currently estimated model. In the case of problem:
aspect model, it can be formulated as the following
(15
optimization problem: x* = arg min −
x∈X
∑θ z
'
z log θ z|x ,r
p ( r| x ,θ ') p ( θ '|D ( y ))
)
(13
x = arg min −
*
x∈X
∑θ z z| x , r log θ z|x ,r
p ( r| x ,θ ) ) Now, the key issue is how to compute the
integration efficiently. In effect, in our case the
where θ denotes the currently estimated integration can be computed analytically as:
parameters and θ z|x ,r denotes the optimal
parameter that are estimated using one additional
rated example, i.e., example x is rated as r. The
∑θz
'
z log θ z| x ,r
p ( r | x ,θ ' ) p (θ ' | D ( y ))
basic idea of Equation (13) is that, we would like
to find the example x such that the expected = ∑ r ∫ p(θ ' | D( y )) p(r | x, θ ' )∑ j θ 'j log θ j|r , x dθ '
entropy of distribution θ is minimized. There are
two problems with this simple strategy. The first = ∑ r ∑ i ∑ j p( r | x, i )θi'θ 'j log θ j|r , x (16
p (θ ' | D ( y ))
problem comes from the fact that the expected )
entropy is computed only based on the currently ∑ a j log θ j|r , x ∑ i ai p(r | x, i )
estimated model. As indicated in Equation (13),
= ∑ r
j
a* (a* + 1)
the expectation is evaluated over the distribution
p(r | x, θ) . As already discussed in Section 2, this
+ ∑r
∑ j
a j log θ j|r , x p(r | x, j )
simple strategy could be misleading particularly a* (a* + 1)
when the currently estimated model is far from the
true model. The second problem with the strategy where the function Ψ is the digamma function.
stated in Equation (13) comes from the The details of the derivation can be found in the
characteristics of collaborative filtering. Recall appendix. With the analytic solution for the
that the parameter θ z stand for the likelihood for integration in (16), we can simply go through each
the active user to be in the user class z, what the item in the set and find the one that minimizes the
optimization in Equation (13) tries to achieve is to loss function specified in Equation (14).
find the most efficient way to reduce the
uncertainty in the type of user class for the active
Finally, the computation involves the estimation of
user. However, according to previous studies in
the updated model θ|r , x . The standard way to
collaborative filtering (Si and Jin, 2003), most
obtain the updated model is simply to rerun the
users should be of multiple types, instead of single
full EM algorithm with one more additional rated
types. Therefore, we would expect to the
example (i.e., example x is rated as category r).
distribution θ be of multiple modes. Apparently,
This computation can be extremely expensive
this is inconsistent with the optimization goal in
since we will have an updated model for every
Equation (13). Based on the above analysis, the
item and every possible rating category. A more
5
6. efficient way to compute the updated model is to Table 1: Characteristics of MovieRating and EachMovie.
avoid the directly optimization of Equation (6). MovieRating EachMovie
Instead, we use the following approximation, Number of Users 500 2000
Number of Items 1000 1682
θ* = arg max p (r | x, θ)
θ∈Θ
∏ p (r | x ', y '; θ)
Avg. # of rated Items/User 87.7 129.6
x '∈X ( y ')
Number of Ratings 5 6
= arg max p (r | x, θ) p (θ | D( y '))
θ∈Θ (17)
≈ arg max p (r | x, θ) Dirichlet ({α z })
θ∈Θ utilizing the model distribution and see which one is
more effective for collaborative filtering. The details
≈ arg max ∑ p (r | x, z ')θ z ' ∏ α
θ z z −1
θ∈Θ
of these two active learning methods will be
z' z
discussed later.
The above Equation (17) can be again optimized
using EM algorithm with updating equation as: 4.1 Experiment Design
p (ri | xi , z )θ z + α z − 1 Two datasets of movie ratings are used in our
θz = (18
experiments, i.e., ‘MovieRating’1 and ‘EachMovie’2. For
α + ∑ ( p(ri | xi , z ')θ z ' − 1)
*
)
z'
‘EachMovie’, we extracted a subset of 2,000 users with
more than 40 ratings. The details of these two datasets are
The advantage of the updating equation in (18)
listed in Table 1. For MovieRating, we took the first 200
versus the more general EM updating equation in
(12) is that Equation (18) only relies on the rating users for training and all the others for testing. For
r for item x while Equation (12) has to go through EachMovie, the first 400 users were used for training. For
all the items that have been rated by the active each test user, we randomly selected three items with their
user. Therefore, the computation cost of (18) is ratings as the starting seed for the active learning
significantly less than that of Equation (12). algorithm to build the initial model. Furthermore, twenty
rated items were reserved for each test user for evaluating
In summary, in this section, we propose a the performance. Finally the active learning algorithm
Bayesian approach for active learning of will have chance to ask for ratings of five different items
collaborative filtering, which uses the model and the performance of the active learning algorithms is
posterior distribution to compute the estimation of evaluated for each feedback. For simplicity, in this paper,
loss function. To simplify the computation, we we assume that the active user will always be able to rate
approximate the mode distribution with a Dirichlet the items presented by the active learning. Of course, as
distribution. As a result, the expected loss can be pointed out in paper (Kai et al, 2003), this is not a correct
calculated analytically. Furthermore, we also assumption because there are some items that the active
propose an efficient way to compute the updated user might never see before and therefore it is impossible
model by approximating the prior with a Dirichlet for him to rate those items. However, in this paper, we
distribution. For easy reference, we call this will mainly focus on the behavior of active learning
method ‘Bayesian method’. algorithms for collaborative filtering and leave this issue
for future work.
The aspect model is used for collaborative filtering as
4. Experiments
already described in the Section 2. The number of user
In this section, we will present experiment results in classes used in the experiment is 5 for MovieRating
order to address the following two questions: dataset and 10 for EachMovie datasets.
1) Whether the proposed active learning algorithm is The evaluation metric used in our experiments was the
effective for collaborative filtering? In the mean absolute error (MAE), which is the average absolute
experiment, we will compare the proposed algorithm deviation of the predicted ratings to the actual ratings on
to the method of randomly acquiring examples from items the test user has actually voted.
the active user. 1 ^
MAE = ∑ | r(l ) − R y( l ) ( x(l ) ) | (19)
2) How important is the full Bayesian treatment? In this LTest l
paper, we emphasize the importance of taking into where LTest is the number of the test ratings.
account the model uncertainty using the posterior
distribution. To illustrate this point, in this
experiment, we will compare the proposed algorithm
1
to two commonly used active learning methods that http://www.cs.usyd.edu.au/~irena/movie_data.zip
are only based on the estimated model without 2
http://research.compaq.com/SRC/eachmovie
6
7. In this experiment, we consider three different baseline given by the active user is only three, which is relatively
models for active learning of collaborative filtering: small given the number of parameters to determine is 10.
Therefore, using the estimated model to compute the
1) Random Selection: This method simply randomly expected loss will be inaccurate for finding the most
selects one item out of the pool of items for user’s informative item to rate. Furthermore, comparing the
feedback. We refer this simple method as ‘random results for the ‘model entropy method’ to the results for
method’. the ‘prediction entropy method’, we see that the model
2) Model Entropy based Sample Selection. This is the entropy method usually performs even worse. This can be
method that has been described at the beginning of explained by the fact that the by minimize thing entropy
Section 3.2. The idea is to find the item that is able to of the distribution of the user classes for the active user,
reduce the entropy of the distribution of the active user the model entropy method tries to narrow down a single
being in different user classes (in Equation (13)). As user type for the active user. However, most users are of
aforementioned, the problems with this simple selection multiple types. Therefore, it is inappropriate to apply the
strategy are in two aspects: computing the expected loss
only based on the currently estimated model and being MovieRating
conflict with the intuition that each user can be of
multiple user classes. This simple method shares many 0.91
similarities as the proposed method except without random
considering the model distribution. By comparing it to the Bayesian
proposed active learning method for collaborative 0.88 model
filtering, we will be able to see if the idea of using model
prediction
distribution for computing expected loss is worthwhile for
collaborative filtering. For easy reference, we will call MAE 0.85
this method ‘model entropy method’.
3) Prediction Entropy based Sample Selection. Unlike the 0.82
previous method, which concerns with the uncertainty in
the distribution of user classes, this method is focused on
the uncertainty in the prediction of ratings for items. It 0.79
selects the item that is able to reduce the entropy of the 1 2 3 4 5 6
prediction most effectively. Formally, the selection num ber of feedback
criterion can be formulated as the following optimization
problem:
Figure 1: MAE results for four different active learning
algorithms for collaborative filtering, namely the random
x = arg min −
*
x∈X
∑ p(r | x ', θ
x'
| x ,r ) log p ( r | x ', θ| x ,r ) (20)
method, the Bayesian method, the model entropy method,
p ( r | x ,θ )
and the prediction entropy method. The database is
where notation θ|r , x stands for the updated model using
‘MovieRating’ and the number of training user is 200. The
the extra rated example (r,x,y). Similar to the previous
smaller MAE the better performance
method, this approach uses the estimated model for
computing the expected loss. However, unlike the model entropy method to the active learning for
previous approach which tries to make the distribution of collaborative filtering. On the other hand, the prediction
user types for the active user skew, this approach targets entropy method doesn’t have this defect because it
on the prediction distribution. Therefore, this method is focuses on minimizing the uncertainty in the prediction
not against the intuition that each user is of multiple instead of the uncertainty in the distribution of user
types. We will refer this method as ‘prediction entropy classes.
method’.
The second observation from Figure 1 and 2 is that the
proposed active learning method performs better than any
4.2 Results and Discussion
of the three based line models for both ‘MovieRating’ and
The results for the proposed active learning algorithm ‘EachMovie’. The most important difference between the
together with the three baseline models for database proposed method and the other methods for active
‘MovieRating’ and ‘EachMovie’ are presented in Figure 1 learning is that the proposed method takes into account
and 2, respectively. the model distribution, which makes it robust when even
there is only three rated items given by the active user.
The first thing noticed from this two diagrams is that,
both the ‘model entropy method’ and the ‘prediction
entropy method’ performs consistently worse than the 5. Conclusion
simply ‘random method’. This phenomenon can be
explained by the fact that the initial number of rated items
7
8. Freund, Y. Seung, H. S., Shamir, E. and Tishby. N.
Each Movie (1997) Selective sampling using the query by committee
algorithm. Machine Learning
1.18
Hofmann, T., & Puzicha, J. (1999). Latent class models
1.15
for collaborative filtering. In Proceedings of
International Joint Conference on Artificial
1.12 Intelligence.
Hofmann, T. (2003). Gaussian latent semantic models for
1.09
collaborative filtering. In Proceedings of the 26th
1.06
Annual Int’l ACM SIGIR Conference on Research and
Development in Information Retrieval.
1.03 Jin, R., Si, L., Zhai., C.X. and Callan J. (2003).
Collaborative filtering with decoupled models for
1
preferences and ratings. In Proceedings of the 12th
1 2 3 4 5 6
International Conference on Information and
num ber of feedbacks Knowledge Management.
Figure 1: MAE results for four different active learning Jordan, M. I., Ghahramani, Z., Jaakkola, T. S. and Saul,
L. K. (1999). An introduction to variational methods for
algorithms for collaborative filtering, namely the random
graphical models. In M. I. Jordan (Ed.), Learning in
method, the Bayesian method, the model entropy method,
graphical Models, Cambridge: MIT Press.
and the prediction entropy method. The database is
‘MovieRating’ and the number of training user is 200. The Roy, N. and McCallum A. (2001). Toward Optimal
smaller MAE the better performance Active Learning through Sampling Estimation of Error
Reduction. In Proceedings of the Eighteenth
In this paper, we proposed a full Bayesian treatment of International Conference on Machine Learning.
active learning for collaborative filtering. Different from
previous studies of active learning for collaborative Pennock, D. M., Horvitz, E., Lawrence, S., and Giles, C.
filtering, this method takes into account the model L. (2000). Collaborative filtering by personality
distribution when computing the expected loss. In order to diagosis: a hybrid memory- and model-based approach.
alleviate the computation complexity, we approximate the In Proceedings of the Sixteenth Conference on
model posterior with a simple Dirichlet distribution. As a Uncertainty in Artificial Intelligence.
result, the estimated loss can be computed analytically. Seung, H. S., Opper, M. and Sompolinsky, H. (1992).
Furthermore, we propose using the approximated prior Query by committee. In Computational Learing Theory,
distribution to simplify the computation of updated pages 287—294.
model, which is required for every item and every
possible rating category. As the current Bayesian version Si, L. and Jin, R. (2003). Flexible mixture model for
of active learning for collaborative filtering only focus on collaborative filtering. In Proceedings of the Twentieth
the model quality, we will extend this work to the International Conference on Machine Learning.
prediction error, which usually is more informative
Tong, S. and Koller, D. (2000). Active Learning for
according to the previous studies of active learning.
Parameter Estimation in Bayesian Networks. In
Advances in Neural Information Processing Systems.
Yu, K., Schwaighofer, A., Tresp, V., Ma, W.-Y. and
References Zhang, H. J. (2003). Collaborative ensemble learning:
Breese, J. S., Heckerman, D., Kadie C., (1998). Empirical combining collaborative and content-based Information
Analysis of Predictive Algorthms for Collaborative filtering via hierarchical Bayes. In Proceedings of
Filtering. In the Proceeding of the Fourteenth Nineteenth International Conference on Uncertainty in
Conference on Uncertainty in Artificial Intelligence. Artificial Intelligence.
Boutilier, C., Zemel, R. S., and Marlin, B. (2003). Active Campbell, C., Cistianini, N., and Smola, A. (2000). Query
collaborative filtering. In the Proceedings of Nineteenth Learning with Large Margin Classifiers. In Proceedings
Conference on Uncertainty in Artificial Intelligence. of 17th International Conference on Machine Learning.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Abe, N. and H. Mamitsuka (1998). Query learning
Maximum likelihood from incomplete data via the EM strategies using boosting and bagging. In Proceedings
algorithm. Journal of the Royal Statistical Society, B39: of the Fifteenth International Conference on Machine
1-38. Learning.
8