SlideShare una empresa de Scribd logo
1 de 9
A Bayesian Approach toward Active Learning for Collaborative Filtering



                                     Rong Jin                                Luo Si

                                                                 Language Technology Institute
                        Department of Computer Science            School of Computer Science
                           Michigan State University              Carnegie Mellon University
                              rong@cse.cmu.edu                       Pittsburgh, PA 15232
                                                                        lsi@cs.cmu.edu



                         Abstract                             the fact that contents are difficult for a computer to
                                                              analyze (e.g. music and videos). One of the key issues in
     Collaborative filtering is a useful technique for        the collaborative filtering is to identify the group of users
     exploiting the preference patterns of a group of         who share the similar interests as the active user. Usually,
     users to predict the utility of items for the active     the similarity between users are measured based on their
     user. In general, the performance of collaborative       ratings over the same set of items. Therefore, to
     filtering depends on the number of rated                 accurately identify users that share similar interests as the
     examples given by the active user. The more the          active user, a reasonably large number of ratings from the
     number of rated examples given by the active             active user are usually required. However, few users are
     user, the more accurate the predicted ratings will       willing to provide ratings for a large amount of items.
     be. Active learning provides an effective way to         Active learning methods provide a solution to this
     acquire the most informative rated examples              problem by acquiring the ratings from an active user that
     from active users. Previous work on active               are most useful in determining his/her interests. Instead of
     learning for collaborative filtering only considers      randomly selecting an item for soliciting the rating from
     the expected loss function based on the estimated        the active user, for most active learning methods, items
     model, which can be misleading when the                  are selected to maximize the expected reduction in the
     estimated model is inaccurate. This paper takes          predefined loss function. The commonly used loss
     one step further by taking into account of the           functions include the entropy of model distribution and
     posterior distribution of the estimated model,           the prediction error. In the paper by Yu et. al. (2003), the
     which results in more robust active learning             expected reduction in the entropy of the model
     algorithm. Empirical studies with datasets of            distribution is used to select the most informative item for
     movie ratings show that when the number of               the active user. Boutilier et. al. (2003) applies the metric
     ratings from the active user is restricted to be         of expected value of utility to find the most informative
     small, active learning methods only based on the         item for soliciting the rating, which is essentially to find
     estimated model don’t perform well while the             the item that leads to the most significant change in the
     active learning method using the model                   highest expected ratings.
     distribution achieves substantially better
     performance.                                             One problem with the previous work on the active
                                                              learning for collaborative filtering is that computation of
                                                              expected loss is based only on the estimated model. This
1.   Introduction                                             can be dangerous when the number of rated examples
                                                              given by the active user is small and as a result the
The rapid growth of the information on the Internet           estimated model is usually far from being accurate. A
demands intelligent information agent that can sift           better strategy for active learning is to take into account of
through all the available information and find out the        the model uncertainty by averaging the expected loss
most valuable to us. Collaborative filtering exploits the     function over the posterior distribution of models. With
preference patterns of a group of users to predict the        the full Bayesian treatment, we will be able to avoid the
utility of items for an active user. Compared to content-     problem caused by the large variance in the model
based filtering approaches, collaborative filtering systems   distribution. Many studies have been done on the active
have advantages in the environments where the contents        learning to take into account of the model uncertainty.
of items are not available due to either a privacy issue or   The method of query by committee (Seung et. al., 1992;
Freud et. al., 1996) simulates the posterior distribution of       preference factors (Hofmann & Puzicha 1999; Hofmann,
models by constructing an ensemble of models and the               2003). The latent class variable z ∈ Z = {z1 , z 2 ,....., z K }
example with the largest uncertainty in prediction is              is associated with each pair of a user and an item. The
selected for user’s feedback. In the work by Tong and              aspect model assumes that users and items are
Koller (2000), a full Bayesian analysis of the active              independent from each other given the latent class
learning for parameter estimation in Bayesian Networks is          variable. Thus, the probability for each observation tuple
used, which takes into account of the model uncertainty in          ( x, y, r ) is calculated as follows:
computing the loss function. In this paper, we will apply
the full Bayesian analysis to the active learning for
collaborative filtering. Particularly, in order to simplify
                                                                              p (r | x, y ) =     ∑ p (r | z , x ) p ( z | y )
                                                                                                  z∈Z
                                                                                                                                                           (1)

the computation, we approximate the posterior                      where p(z|y) stands for the likelihood for user y to be in
distribution of model with a simple Dirichlet distribution,        class z and p(r|z,x) stands for the likelihood of assigning
which leads to an analytic expression for the expected             item x with rating r by users in class z. In order to achieve
loss function.                                                     better performance, the ratings of each user are
                                                                   normalized to be a normal distribution with zero mean
The rest of the paper is arranged as follows: Section 2
                                                                   and variance as 1 (Hofmann, 2003). The parameter p(r|
describes the related work in both collaborative filtering
                                                                   z,x) is approximated as a Gaussian distribution N ( µ z , σ z )
and active learning. Section 3 discusses the proposed
                                                                   and p(z|y) as a multinomial distribution.
active learning algorithm for collaborative filtering. The
experiments are explained and discussed in Section 4.              Personality diagnosis approach treats each user in the
Section 5 concludes this work and the future work.                 training database as an individual model. To predicate the
                                                                   rating of the active user on certain items, we first compute
                                                                   the likelihood for the active user to be in the ‘model’ of
2.    Related Work                                                 each training use. The ratings from the training user on
In this section, we will first briefly discuss the previous        the same items are then weighted by the computed
work on collaborative filtering, followed by the previous          likelihood. The weighted average is used as the estimation
work on active filtering. The previous work on active              of ratings for the active user. By assuming that the
learning for collaborative filtering will be discussed at the      observed rating for the active user y’ on an item x is
end of this section.                                               assumed to be drawn from an independent normal
                                                                                                                         True
                                                                   distribution with the mean as the true rating as R y ' ( x) ,
2.1    Previous Work on Collaborative Filtering                    we have

Most collaborative filtering methods fall into two                                                                − ( R y ' ( x ) − RTrue ( x ))2 2σ 2     (2)
                                                                     p ( R y ' ( x) | R True ( x)) ∝ e
                                                                                        y'
                                                                                                                                     y'

categories: Memory-based algorithms and Model-based
algorithms (Breese et al. 1998). Memory-based                      where the standard deviation σ is set to be constant 1 in
algorithms store rating examples of users in a training            our experiments. The probability for the active user y’ to
database. In the predicating phase, they predict the ratings       be in the model of any user y in the training database can
of an active user based on the corresponding ratings of the        be written as:
users in the training database that are similar to the active
                                                                                               ∏e
                                                                                                           − ( Ry ( x ) − R y ' ( x )) 2 2σ 2
user. In contrast, model-based algorithms construct                          p( y'| y) ∝                                                                   (3)
models that well explain the rating examples from the                                        x∈X ( y t )
training database and apply the estimated model to predict         Finally, the likelihood for the active user y’ to rate an
the ratings for active users. Both types of approaches             unseen item x as r is computed as:
have been shown to be effective for collaborative
filtering. In this subsection, we focus on the model-based
collaborative filtering approaches, including the Aspect
                                                                     p( R y ' ( x) = r ) ∝   ∑ p( R
                                                                                              y
                                                                                                             y'    | R y )e
                                                                                                                              − ( R y ( x ) − r ) 2 2σ 2
                                                                                                                                                           (4)
Model (AM), the Personality Diagnosis (PD) and the
Flexible Mixture Model (FMM).                                      Previous empirical studies have shown that the
                                                                   personality diagnosis method is able to outperform
For the convenience of discussion, we will first introduce         several other approaches for collaborative filtering
the      annotation.            Let     items     denoted     by   (Pennock et al., 2000).
 X = {x1 , x 2 ,......, x M } ,      users      denoted       by
Y = { y1 , y 2 ,......, y N } , and the range of ratings denoted   Flexible Mixture Model introduces two sets of hidden
by {1,..., R} . A tuple ( x, y, r ) means that rating r is         variables {z x , z y } , with z x for the class of items and z y
assigned to item x by user y . Let X ( y ) denote the set          for the class of users (Si and Jin, 2003; Jin et. al., 2003).
of items rated by user y, and R y (x) stand for and the            Similar to aspect model, the probability for each observed
rating of item x by user y, respectively.                          tuple ( x, y, r ) is factorized into a sum over different
                                                                   classes for items and users, i.e.,
Aspect model is a probabilistic latent space model, which
models individual preferences as a convex combination of


2
p ( x, y , r ) =    ∑ p( z x ) P( z y ) P( x | z x ) P( y | z y ) P(r | z x , z y )   (5)                                        Correct Decision Boundary
                    zx , z y
                                                                                                  f1
where parameter p(y|zy) stands for the likelihood for user




                                                                                                    Estimated Decision Boundary
y to be in class zy, p(x|zx) stands for the likelihood for item
x to be in class zx, and p(r|zx, zy) stands for the likelihood
for users in class zy to rate items in class zx as r. All the
parameters are estimated using Expectation Maximization
algorithm (EM) (Dumpster et. al., 1976). The multiple-
cause vector quantization (MCVQ) model (Boutilier and
Zemel, 2003) uses the similar idea for collaborative
filtering.

2.2        Previous Work on Active Learning
The goal of active learning is to learn the correct model
using only a small number of labeled examples. The                                                                                                            f2
general approach is to find the example from the pool of
unlabeled data that gives the largest reduction in the                                        Figure 1: A learning scenario when the estimated
expected loss function. The loss functions used by most                                       model distribution is incorrect. The vertical line
active learning methods can be categorized into two                                           corresponds t the correct decision boundary and the
groups: the loss functions based on model uncertainty and                                     horizontal line corresponds to the estimated decision
the loss function based on prediction errors. For the first                                   boundary.
type of loss functions, the goal is to achieve the largest
reduction ratio in the space of hypothesis. One commonly                                     examples, the most likely decision boundary is the
used loss function is the entropy of the model distribution.                                 horizontal line (i.e., the dash line) while the true decision
Methods within this category usually select the example                                      boundary is a vertical line (i.e., the dot line). If we only
for which the model has the largest uncertainty in                                           rely on the estimated model for computing the expected
predicting the label (Seung et. al., 1992; Freud et. al,                                     loss, the examples that will be selected for user’s
1997; Abe and Mamitsuka, 1998; Campbell et. al., 2000;                                       feedback are most likely from the dot-shaded areas, which
Tong and Koller, 2002). The second type of loss functions                                    will not be very effective in adjusting the estimated
involved the prediction errors. The largest reduction in the                                 decision boundary (i.e. the horizontal line) to the correct
volume of hypothesis space may not necessary be                                              decision boundary (i.e. the vertical. On the other hand, if
effective in cutting the prediction error. Freud et. al.                                     we can use the model distribution for estimating the
showed an example in (1997) , in which the largest                                           expected loss, we will be able to adjust the decision
reduction in the space of hypothesis does not bring the                                      boundary more effectively since decision boundaries
optimal improvement to the reduction of prediction errors.                                   other than the estimated one (i.e. horizontal line) have
Empirical studies in Bayesian Network (Tong and Koller,                                      been considered in the computation of expected loss.
2000) and text categorization (Roy and MaCallum, 2001)                                       There have been several studies on active learning that
have shown that using the loss function that directly                                        use the posterior distribution of models for estimating the
targets on the prediction error is able to achieve better                                    expected loss. The query by committee approach
performance than the loss function that is only based on                                     simulates the model distribution by sampling a set of
the model uncertainty. The commonly used loss function                                       models out of the posterior distribution. In the work of
within this category is the entropy of the distribution of                                   active learning for parameter estimation in Bayesian
predicted labels.                                                                            Network, a Dirichlet distribution for the parameters is
                                                                                             used to estimate the change in the entropy function.
In addition to the choice of loss function, how to estimate                                  However, the general difficulty with the full Bayesian
the expected loss is another important issue for active                                      analysis for active learning is the computational
learning. Many active learning methods compute the                                           complexity. For complicated models, usually it is rather
expected loss only based on the currently estimated model                                    difficult to obtain the model posterior distribution, which
without taking into account of the model uncertainty.                                        leads to the sampling approaches such as Markov Chain
Even though this simple strategy works fine for many                                         Mote Carlo (MCMC) and Gibbs sampling. In this paper,
applications, it can be very misleading, particularly when                                   we follow an idea similar to the variational approach
the estimated model is far from the true model. As an                                        (Jordan et al., 1999). Instead of accurately computing the
example, considering learning a classification model for                                     posteriors, we approximate the posterior distribution with
the data distribution in Figure 1, where spheres represent                                   an analytic solution and apply it to estimate the expected
data points of one class and stars represent data points of                                  loss. Compared to the sampling approaches, this approach
the other class. The four labeled examples are highlighted                                   substantially simplifies the computation by avoiding
by the line-shaded ellipsis. Based on these four training



3
generating a large number of models and calculating the
loss function values of all the generated models.
                                                                           θ* = arg max
                                                                                     θ∈Θ
                                                                                                   ∏       p (r | x, y ';θ )
                                                                                              x∈X ( y ')
                                                                                                                                 (6)
2.3    Previous Work on Active               Learning      for                = arg max
                                                                                     θ∈Θ
                                                                                                   ∏ ∑ p(r | z, x)θ z
                                                                                              x∈X ( y ') z∈Z
       Collaborative Filtering
There have been only few studies on active learning for                   Then, the posterior distribution of model,
collaborative filtering. In the paper by Kai et al (2003), a               p (θ | D( y ')) where D( y ') includes all the rated
method similar to Personality Diagnosis (PD) is used for                  items given the active user y’, can be written as:
collaborative filtering. Each example is selected for user’s
feedback in order to reduce the entropy of the like-                         p (θ | D( y ')) ∝ p ( D( y ') | θ) p (θ)
mindness distribution or p ( y | y ') in Equation (3)). In the
paper by Boutilier et al (2003), a method similar to the
                                                                                 ≈     ∏ ∑ p(r | z, x)θ z                        (7)
                                                                                     x∈X ( y ') z∈Z
Flexible Mixture Model (FMM) named ‘multiple-cause
vector quantification’ is used for collaborative filtering.               In above, a uniform prior distribution is applied
Unlike most other collaborative filtering research, which                 and is simply ignored. The posterior in (7) is
try to minimize the prediction error, this paper only                     rather difficult to be used for estimating the
concerns with the items that are strongly recommended by                  expected loss due to the multiple products and lack
the system. The loss function is based on the expected                    of normalization.
value of information (EVOI), which is computed based on
the currently estimated model. One problem with these
two approaches is that, the expected loss is computed only                To approximate the posterior distribution, we will
based on a single model, namely the currently estimated                   consider the expansion of the posterior function
model. This is especially important, when the number of                   around the maximum point. Let θ* stand for the
rated examples given by the active user is small and                      maximal parameters that are obtained from EM
meanwhile the number of parameters to be estimated is                     algorithm by maximizing Equation (6). Let’s
large. In the experiments, we will show that the selection                consider the ratio of log p (θ | D( y ')) with respect
based on the expected loss function with only the                         to log p (θ* | D( y ')) , which can be written as:
currently estimated model can be even worse than simple
random selection.                                                       p (θ | S )         ∑ p(ri | xi , z )θz
                                                                  log                = ∑log z
                                                                        p (θ * | S )   i   ∑ p(ri | xi , z )θz*
                                                                                                       z
3.    A Bayesian Approach Toward                      Active
                                                                                                     p (ri | xi , z )θz*
                                                                                                                            θ
      Learning for Collaborative Filtering                                           ≥ ∑∑                                log z   (8)
                                                                                         i     z    ∑ p(ri | xi , z )θz*' θz*
                                                                                                      z'
                                                                                                                     '


       In this section, we will first discuss the                                                                θz
       approximated analytic solution for the posterior                              = ∑(a z −1) log
                                                                                                                 θz*
       distribution. Then, the approximated posterior                                    z

       distribution will be used to estimate the expected
       loss.                                                              where the variable α z is defined as

                                                                                              p( ri | xi , z )θ z*
3.1    Approximate the Model Posterior Distribution                            αz = ∑                                +1          (9)
                                                                                        i    ∑ p(ri | xi , z ')θ z*'
                                                                                              z'
       For the sake of simplicity, we will focus on the
       active learning of aspect model for collaborative                  Therefore, according to Equation (8), the
       filtering. However, the discussion in this section                 approximated posterior distribution p(θ | D( y '))
       can be easily extended to other models for                         is a Dirichlet with hyper parameters α z defined in
       collaborative filtering.                                           Equation (9), or

       As described in Equation (1), each conditional                                          Γ(α * )                           (10
                                                                             p (θ | D) ≈                 ∏ z θ α z −1
       likelihood p (r | x, y ) is decomposed as the sum of                                   ∏z Γ(α z )                          )
        ∑ z p(r | z, x) p( z | y) . For the active user y’, the
       most important task is to determine its user type,
                                                                          where α = ∑ z α z . Furthermore, it is easy to see
                                                                                 *
       or p ( z | y ') . Let θ stands for the parameter space
       θ = {θ z = p ( z | x)}z . Usually, parameters θ are                                           *
                                                                                             αz −1 θz                            (11
       estimated through the maximum likelihood                                                    = *                            )
       estimation, i.e.                                                                      αz' −1 θz'




4
The above relation is true because the optimal                                  optimization problem in Equation (13) is not
      solution obtained by EM algorithm is actual fixed                               appropriate for active learning of collaborative
      point of the following equation:                                                filtering.

                       p( ri | xi , z )θ z*                                           In the ideal case, if the true model θtrue is given,
           θ z* = ∑                                                       (12
                  i   ∑ p(ri | xi , z ')θ z*'                              )          the optimal strategy in selecting example should
                       z'                                                             be:
      Based on the property in Equation (11), we have                                                                                                                                       (14
      the optimal point derived from the approximated                                x* = arg min −
                                                                                                     x∈X
                                                                                                                      ∑θ     z
                                                                                                                                   true
                                                                                                                                   z      log θ z|x ,r
                                                                                                                                                                p ( r|x ,θtrue )             )
      Dirichlet distribution is exactly θ* , which is the
      optimal solution found by EM algorithm.                                         The goal of the optimization in Equation (14) is to
                                                                                      find an example such that by adding the example
3.2   Estimate the Expected Loss                                                      to the pool of rated examples, the updated model
                                                                                       θ| x ,r is adjusted toward the true model θtrue .
      Similar to many other active learning work (Tong                                Unfortunately, the true model is unknown. One
      and Koller, 2000; Seung et al, 1992; Kai et al,                                 way to approximate the optimization goal in
      2003), we use the entropy of the model function as                              Equation (14) is to replace the θtrue with the
      the loss function. Many previous studies of active                              expectation over the posterior distribution
      learning try to find the example that directly                                   p (θ | D( y ')) . Therefore, Equation (14) will be
      minimizes the entropy of the model parameters                                   transformed into the following optimization
      with currently estimated model. In the case of                                  problem:
      aspect model, it can be formulated as the following
                                                                                                                                                                                            (15
      optimization problem:                                                      x* = arg min −
                                                                                              x∈X
                                                                                                                   ∑θ   z
                                                                                                                               '
                                                                                                                               z   log θ z|x ,r
                                                                                                                                                       p ( r| x ,θ ')   p ( θ '|D ( y ))
                                                                                                                                                                                             )
                                                                          (13
      x = arg min −
       *

            x∈X
                      ∑θ    z   z| x , r   log θ z|x ,r
                                                          p ( r| x ,θ )    )          Now, the key issue is how to compute the
                                                                                      integration efficiently. In effect, in our case the
      where θ denotes the currently estimated                                         integration can be computed analytically as:
      parameters and θ z|x ,r denotes the optimal
      parameter that are estimated using one additional
      rated example, i.e., example x is rated as r. The
                                                                                    ∑θz
                                                                                              '
                                                                                              z   log θ z| x ,r
                                                                                                                     p ( r | x ,θ ' )   p (θ ' | D ( y ))
      basic idea of Equation (13) is that, we would like
      to find the example x such that the expected                              = ∑ r ∫ p(θ ' | D( y )) p(r | x, θ ' )∑ j θ 'j log θ j|r , x dθ '
      entropy of distribution θ is minimized. There are
      two problems with this simple strategy. The first                         = ∑ r ∑ i ∑ j p( r | x, i )θi'θ 'j log θ j|r , x                                                                  (16
                                                                                                                                                                        p (θ ' | D ( y ))
      problem comes from the fact that the expected                                                                                                                                                )
      entropy is computed only based on the currently                                             ∑       a j log θ j|r , x ∑ i ai p(r | x, i )
      estimated model. As indicated in Equation (13),
                                                                                =    ∑    r
                                                                                                      j

                                                                                                                     a* (a* + 1)
      the expectation is evaluated over the distribution
       p(r | x, θ) . As already discussed in Section 2, this
                                                                                    + ∑r
                                                                                                  ∑       j
                                                                                                              a j log θ j|r , x p(r | x, j )
      simple strategy could be misleading particularly                                                            a* (a* + 1)
      when the currently estimated model is far from the
      true model. The second problem with the strategy                                where the function Ψ is the digamma function.
      stated in Equation (13) comes from the                                          The details of the derivation can be found in the
      characteristics of collaborative filtering. Recall                              appendix. With the analytic solution for the
      that the parameter θ z stand for the likelihood for                             integration in (16), we can simply go through each
      the active user to be in the user class z, what the                             item in the set and find the one that minimizes the
      optimization in Equation (13) tries to achieve is to                            loss function specified in Equation (14).
      find the most efficient way to reduce the
      uncertainty in the type of user class for the active
                                                                                      Finally, the computation involves the estimation of
      user. However, according to previous studies in
                                                                                      the updated model θ|r , x . The standard way to
      collaborative filtering (Si and Jin, 2003), most
                                                                                      obtain the updated model is simply to rerun the
      users should be of multiple types, instead of single
                                                                                      full EM algorithm with one more additional rated
      types. Therefore, we would expect to the
                                                                                      example (i.e., example x is rated as category r).
      distribution θ be of multiple modes. Apparently,
                                                                                      This computation can be extremely expensive
      this is inconsistent with the optimization goal in
                                                                                      since we will have an updated model for every
      Equation (13). Based on the above analysis, the
                                                                                      item and every possible rating category. A more


5
efficient way to compute the updated model is to                        Table 1: Characteristics of MovieRating and EachMovie.
       avoid the directly optimization of Equation (6).                                                    MovieRating      EachMovie
       Instead, we use the following approximation,                              Number of Users                           500                    2000
                                                                                 Number of Items                          1000                    1682
     θ* = arg max p (r | x, θ)
             θ∈Θ
                                    ∏           p (r | x ', y '; θ)
                                                                             Avg. # of rated Items/User                   87.7                    129.6
                                 x '∈X ( y ')
                                                                                Number of Ratings                           5                       6
        = arg max p (r | x, θ) p (θ | D( y '))
             θ∈Θ                                                      (17)
        ≈ arg max p (r | x, θ) Dirichlet ({α z })
             θ∈Θ                                                                       utilizing the model distribution and see which one is
                                                                                       more effective for collaborative filtering. The details
        ≈ arg max ∑ p (r | x, z ')θ z ' ∏             α
                                                    θ z z −1
             θ∈Θ
                                                                                       of these two active learning methods will be
                       z'                       z
                                                                                       discussed later.
       The above Equation (17) can be again optimized
       using EM algorithm with updating equation as:                           4.1        Experiment Design

                   p (ri | xi , z )θ z + α z − 1                               Two datasets of movie ratings are used in our
          θz =                                                        (18
                                                                               experiments, i.e., ‘MovieRating’1 and ‘EachMovie’2. For
                 α + ∑ ( p(ri | xi , z ')θ z ' − 1)
                   *
                                                                       )
                            z'
                                                                               ‘EachMovie’, we extracted a subset of 2,000 users with
                                                                               more than 40 ratings. The details of these two datasets are
       The advantage of the updating equation in (18)
                                                                               listed in Table 1. For MovieRating, we took the first 200
       versus the more general EM updating equation in
       (12) is that Equation (18) only relies on the rating                    users for training and all the others for testing. For
       r for item x while Equation (12) has to go through                      EachMovie, the first 400 users were used for training. For
       all the items that have been rated by the active                        each test user, we randomly selected three items with their
       user. Therefore, the computation cost of (18) is                        ratings as the starting seed for the active learning
       significantly less than that of Equation (12).                          algorithm to build the initial model. Furthermore, twenty
                                                                               rated items were reserved for each test user for evaluating
       In summary, in this section, we propose a                               the performance. Finally the active learning algorithm
       Bayesian approach for active learning of                                will have chance to ask for ratings of five different items
       collaborative filtering, which uses the model                           and the performance of the active learning algorithms is
       posterior distribution to compute the estimation of                     evaluated for each feedback. For simplicity, in this paper,
       loss function. To simplify the computation, we                          we assume that the active user will always be able to rate
       approximate the mode distribution with a Dirichlet                      the items presented by the active learning. Of course, as
       distribution. As a result, the expected loss can be                     pointed out in paper (Kai et al, 2003), this is not a correct
       calculated analytically. Furthermore, we also                           assumption because there are some items that the active
       propose an efficient way to compute the updated                         user might never see before and therefore it is impossible
       model by approximating the prior with a Dirichlet                       for him to rate those items. However, in this paper, we
       distribution. For easy reference, we call this                          will mainly focus on the behavior of active learning
       method ‘Bayesian method’.                                               algorithms for collaborative filtering and leave this issue
                                                                               for future work.
                                                                               The aspect model is used for collaborative filtering as
4.   Experiments
                                                                               already described in the Section 2. The number of user
 In this section, we will present experiment results in                        classes used in the experiment is 5 for MovieRating
order to address the following two questions:                                  dataset and 10 for EachMovie datasets.
1) Whether the proposed active learning algorithm is                           The evaluation metric used in our experiments was the
   effective for collaborative filtering? In the                               mean absolute error (MAE), which is the average absolute
   experiment, we will compare the proposed algorithm                          deviation of the predicted ratings to the actual ratings on
   to the method of randomly acquiring examples from                           items the test user has actually voted.
   the active user.                                                                                     1                           ^
                                                                                              MAE =            ∑ | r(l ) − R y( l ) ( x(l ) ) |           (19)
2) How important is the full Bayesian treatment? In this                                               LTest    l
   paper, we emphasize the importance of taking into                           where LTest is the number of the test ratings.
   account the model uncertainty using the posterior
   distribution. To illustrate this point, in this
   experiment, we will compare the proposed algorithm
                                                                               1
   to two commonly used active learning methods that                               http://www.cs.usyd.edu.au/~irena/movie_data.zip
   are only based on the estimated model without                               2
                                                                                   http://research.compaq.com/SRC/eachmovie



6
In this experiment, we consider three different baseline                                          given by the active user is only three, which is relatively
models for active learning of collaborative filtering:                                            small given the number of parameters to determine is 10.
                                                                                                  Therefore, using the estimated model to compute the
1) Random Selection: This method simply randomly                                                  expected loss will be inaccurate for finding the most
selects one item out of the pool of items for user’s                                              informative item to rate. Furthermore, comparing the
feedback. We refer this simple method as ‘random                                                  results for the ‘model entropy method’ to the results for
method’.                                                                                          the ‘prediction entropy method’, we see that the model
2) Model Entropy based Sample Selection. This is the                                              entropy method usually performs even worse. This can be
method that has been described at the beginning of                                                explained by the fact that the by minimize thing entropy
Section 3.2. The idea is to find the item that is able to                                         of the distribution of the user classes for the active user,
reduce the entropy of the distribution of the active user                                         the model entropy method tries to narrow down a single
being in different user classes (in Equation (13)). As                                            user type for the active user. However, most users are of
aforementioned, the problems with this simple selection                                           multiple types. Therefore, it is inappropriate to apply the
strategy are in two aspects: computing the expected loss
only based on the currently estimated model and being                                                                         MovieRating
conflict with the intuition that each user can be of
multiple user classes. This simple method shares many                                                         0.91
similarities as the proposed method except without                                                                                          random
considering the model distribution. By comparing it to the                                                                                  Bayesian
proposed active learning method for collaborative                                                            0.88                           model
filtering, we will be able to see if the idea of using model
                                                                                                                                            prediction
distribution for computing expected loss is worthwhile for
collaborative filtering. For easy reference, we will call                                              MAE   0.85
this method ‘model entropy method’.
3) Prediction Entropy based Sample Selection. Unlike the                                                     0.82
previous method, which concerns with the uncertainty in
the distribution of user classes, this method is focused on
the uncertainty in the prediction of ratings for items. It                                                   0.79
selects the item that is able to reduce the entropy of the                                                           1    2       3    4    5        6
prediction most effectively. Formally, the selection                                                                      num ber of feedback
criterion can be formulated as the following optimization
problem:
                                                                                                  Figure 1: MAE results for four different active learning
                                                                                                  algorithms for collaborative filtering, namely the random
 x = arg min −
    *

        x∈X
                 ∑ p(r | x ', θ
                 x'
                                  | x ,r   ) log p ( r | x ', θ| x ,r )                    (20)
                                                                                                  method, the Bayesian method, the model entropy method,
                                                                          p ( r | x ,θ )
                                                                                                  and the prediction entropy method. The database is
where notation θ|r , x stands for the updated model using
                                                                                                  ‘MovieRating’ and the number of training user is 200. The
the extra rated example (r,x,y). Similar to the previous
                                                                                                  smaller MAE the better performance
method, this approach uses the estimated model for
computing the expected loss. However, unlike the                                                  model entropy method to the active learning for
previous approach which tries to make the distribution of                                         collaborative filtering. On the other hand, the prediction
user types for the active user skew, this approach targets                                        entropy method doesn’t have this defect because it
on the prediction distribution. Therefore, this method is                                         focuses on minimizing the uncertainty in the prediction
not against the intuition that each user is of multiple                                           instead of the uncertainty in the distribution of user
types. We will refer this method as ‘prediction entropy                                           classes.
method’.
                                                                                                  The second observation from Figure 1 and 2 is that the
                                                                                                  proposed active learning method performs better than any
4.2     Results and Discussion
                                                                                                  of the three based line models for both ‘MovieRating’ and
The results for the proposed active learning algorithm                                            ‘EachMovie’. The most important difference between the
together with the three baseline models for database                                              proposed method and the other methods for active
‘MovieRating’ and ‘EachMovie’ are presented in Figure 1                                           learning is that the proposed method takes into account
and 2, respectively.                                                                              the model distribution, which makes it robust when even
                                                                                                  there is only three rated items given by the active user.
The first thing noticed from this two diagrams is that,
both the ‘model entropy method’ and the ‘prediction
entropy method’ performs consistently worse than the                                              5.         Conclusion
simply ‘random method’. This phenomenon can be
explained by the fact that the initial number of rated items



7
Freund, Y. Seung, H. S., Shamir, E. and Tishby. N.
                         Each Movie                              (1997) Selective sampling using the query by committee
                                                                 algorithm. Machine Learning
       1.18
                                                                Hofmann, T., & Puzicha, J. (1999). Latent class models
       1.15
                                                                 for collaborative filtering. In Proceedings of
                                                                 International  Joint   Conference     on     Artificial
       1.12                                                      Intelligence.
                                                                Hofmann, T. (2003). Gaussian latent semantic models for
       1.09
                                                                 collaborative filtering. In Proceedings of the 26th
       1.06
                                                                 Annual Int’l ACM SIGIR Conference on Research and
                                                                 Development in Information Retrieval.
       1.03                                                     Jin, R., Si, L., Zhai., C.X. and Callan J. (2003).
                                                                  Collaborative filtering with decoupled models for
         1
                                                                  preferences and ratings. In Proceedings of the 12th
              1      2      3      4      5      6
                                                                  International Conference on Information and
                    num ber of feedbacks                          Knowledge Management.

Figure 1: MAE results for four different active learning        Jordan, M. I., Ghahramani, Z., Jaakkola, T. S. and Saul,
                                                                  L. K. (1999). An introduction to variational methods for
algorithms for collaborative filtering, namely the random
                                                                  graphical models. In M. I. Jordan (Ed.), Learning in
method, the Bayesian method, the model entropy method,
                                                                  graphical Models, Cambridge: MIT Press.
and the prediction entropy method. The database is
‘MovieRating’ and the number of training user is 200. The       Roy, N. and McCallum A. (2001). Toward Optimal
smaller MAE the better performance                               Active Learning through Sampling Estimation of Error
                                                                 Reduction. In Proceedings of the Eighteenth
 In this paper, we proposed a full Bayesian treatment of         International Conference on Machine Learning.
 active learning for collaborative filtering. Different from
 previous studies of active learning for collaborative          Pennock, D. M., Horvitz, E., Lawrence, S., and Giles, C.
 filtering, this method takes into account the model             L. (2000). Collaborative filtering by personality
 distribution when computing the expected loss. In order to      diagosis: a hybrid memory- and model-based approach.
 alleviate the computation complexity, we approximate the        In Proceedings of the Sixteenth Conference on
 model posterior with a simple Dirichlet distribution. As a      Uncertainty in Artificial Intelligence.
 result, the estimated loss can be computed analytically.       Seung, H. S., Opper, M. and Sompolinsky, H. (1992).
 Furthermore, we propose using the approximated prior            Query by committee. In Computational Learing Theory,
 distribution to simplify the computation of updated             pages 287—294.
 model, which is required for every item and every
 possible rating category. As the current Bayesian version      Si, L. and Jin, R. (2003). Flexible mixture model for
 of active learning for collaborative filtering only focus on     collaborative filtering. In Proceedings of the Twentieth
 the model quality, we will extend this work to the               International Conference on Machine Learning.
 prediction error, which usually is more informative
                                                                Tong, S. and Koller, D. (2000). Active Learning for
 according to the previous studies of active learning.
                                                                 Parameter Estimation in Bayesian Networks. In
                                                                 Advances in Neural Information Processing Systems.
                                                                Yu, K., Schwaighofer, A., Tresp, V., Ma, W.-Y. and
 References                                                      Zhang, H. J. (2003). Collaborative ensemble learning:
 Breese, J. S., Heckerman, D., Kadie C., (1998). Empirical       combining collaborative and content-based Information
  Analysis of Predictive Algorthms for Collaborative             filtering via hierarchical Bayes. In Proceedings of
  Filtering.     In the Proceeding of the Fourteenth             Nineteenth International Conference on Uncertainty in
  Conference on Uncertainty in Artificial Intelligence.          Artificial Intelligence.

 Boutilier, C., Zemel, R. S., and Marlin, B. (2003). Active     Campbell, C., Cistianini, N., and Smola, A. (2000). Query
  collaborative filtering. In the Proceedings of Nineteenth      Learning with Large Margin Classifiers. In Proceedings
  Conference on Uncertainty in Artificial Intelligence.          of 17th International Conference on Machine Learning.

 Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977).          Abe, N. and H. Mamitsuka (1998). Query learning
  Maximum likelihood from incomplete data via the EM             strategies using boosting and bagging. In Proceedings
  algorithm. Journal of the Royal Statistical Society, B39:      of the Fifteenth International Conference on Machine
  1-38.                                                          Learning.




 8
9

Más contenido relacionado

La actualidad más candente

CrossScenarioTransferPersonReidentification_finalManuscript
CrossScenarioTransferPersonReidentification_finalManuscriptCrossScenarioTransferPersonReidentification_finalManuscript
CrossScenarioTransferPersonReidentification_finalManuscript
Xiaojuan (Kathleen) WANG
 
Binary Classification with Models and Data Density Distribution by Xuan Chen
Binary Classification with Models and Data Density Distribution by Xuan ChenBinary Classification with Models and Data Density Distribution by Xuan Chen
Binary Classification with Models and Data Density Distribution by Xuan Chen
Xuan Chen
 
Centralized Class Specific Dictionary Learning for wearable sensors based phy...
Centralized Class Specific Dictionary Learning for wearable sensors based phy...Centralized Class Specific Dictionary Learning for wearable sensors based phy...
Centralized Class Specific Dictionary Learning for wearable sensors based phy...
Sherin Mathews
 
Current Approaches in Search Result Diversification
Current Approaches in Search Result DiversificationCurrent Approaches in Search Result Diversification
Current Approaches in Search Result Diversification
Mario Sangiorgio
 
Smartocracy Hicss2007
Smartocracy Hicss2007Smartocracy Hicss2007
Smartocracy Hicss2007
mjc1
 
Latent factor models for Collaborative Filtering
Latent factor models for Collaborative FilteringLatent factor models for Collaborative Filtering
Latent factor models for Collaborative Filtering
sscdotopen
 

La actualidad más candente (20)

228-SE3001_2
228-SE3001_2228-SE3001_2
228-SE3001_2
 
CrossScenarioTransferPersonReidentification_finalManuscript
CrossScenarioTransferPersonReidentification_finalManuscriptCrossScenarioTransferPersonReidentification_finalManuscript
CrossScenarioTransferPersonReidentification_finalManuscript
 
Binary Classification with Models and Data Density Distribution by Xuan Chen
Binary Classification with Models and Data Density Distribution by Xuan ChenBinary Classification with Models and Data Density Distribution by Xuan Chen
Binary Classification with Models and Data Density Distribution by Xuan Chen
 
Modeling Text Independent Speaker Identification with Vector Quantization
Modeling Text Independent Speaker Identification with Vector QuantizationModeling Text Independent Speaker Identification with Vector Quantization
Modeling Text Independent Speaker Identification with Vector Quantization
 
Mixed Language Based Offline Handwritten Character Recognition Using First St...
Mixed Language Based Offline Handwritten Character Recognition Using First St...Mixed Language Based Offline Handwritten Character Recognition Using First St...
Mixed Language Based Offline Handwritten Character Recognition Using First St...
 
Chap17 additional topics in sampling
Chap17 additional topics in samplingChap17 additional topics in sampling
Chap17 additional topics in sampling
 
Classification of text data using feature clustering algorithm
Classification of text data using feature clustering algorithmClassification of text data using feature clustering algorithm
Classification of text data using feature clustering algorithm
 
A new similarity measurement based on hellinger distance for collaborating fi...
A new similarity measurement based on hellinger distance for collaborating fi...A new similarity measurement based on hellinger distance for collaborating fi...
A new similarity measurement based on hellinger distance for collaborating fi...
 
Centralized Class Specific Dictionary Learning for wearable sensors based phy...
Centralized Class Specific Dictionary Learning for wearable sensors based phy...Centralized Class Specific Dictionary Learning for wearable sensors based phy...
Centralized Class Specific Dictionary Learning for wearable sensors based phy...
 
Paper id 71201913
Paper id 71201913Paper id 71201913
Paper id 71201913
 
IJET-V3I1P1
IJET-V3I1P1IJET-V3I1P1
IJET-V3I1P1
 
Current Approaches in Search Result Diversification
Current Approaches in Search Result DiversificationCurrent Approaches in Search Result Diversification
Current Approaches in Search Result Diversification
 
IRJET- Deep Neural Network based Mechanism to Compute Depression in Socia...
IRJET-  	  Deep Neural Network based Mechanism to Compute Depression in Socia...IRJET-  	  Deep Neural Network based Mechanism to Compute Depression in Socia...
IRJET- Deep Neural Network based Mechanism to Compute Depression in Socia...
 
Smartocracy Hicss2007
Smartocracy Hicss2007Smartocracy Hicss2007
Smartocracy Hicss2007
 
Ct35535539
Ct35535539Ct35535539
Ct35535539
 
Latent factor models for Collaborative Filtering
Latent factor models for Collaborative FilteringLatent factor models for Collaborative Filtering
Latent factor models for Collaborative Filtering
 
Utilitas
UtilitasUtilitas
Utilitas
 
AN GROUP BEHAVIOR MOBILITY MODEL FOR OPPORTUNISTIC NETWORKS
AN GROUP BEHAVIOR MOBILITY MODEL FOR OPPORTUNISTIC NETWORKS AN GROUP BEHAVIOR MOBILITY MODEL FOR OPPORTUNISTIC NETWORKS
AN GROUP BEHAVIOR MOBILITY MODEL FOR OPPORTUNISTIC NETWORKS
 
Optimistic decision making using an
Optimistic decision making using anOptimistic decision making using an
Optimistic decision making using an
 
Intent-Aware Temporal Query Modeling for Keyword Suggestion
Intent-Aware Temporal Query Modeling for Keyword SuggestionIntent-Aware Temporal Query Modeling for Keyword Suggestion
Intent-Aware Temporal Query Modeling for Keyword Suggestion
 

Destacado

Application specific Programming of the Texas Instruments ...
Application specific Programming of the Texas Instruments  ...Application specific Programming of the Texas Instruments  ...
Application specific Programming of the Texas Instruments ...
butest
 
utdallas.edu
utdallas.eduutdallas.edu
utdallas.edu
butest
 
final report.doc
final report.docfinal report.doc
final report.doc
butest
 
Newsletter
NewsletterNewsletter
Newsletter
butest
 
ARDA-Insider-BAA03-0..
ARDA-Insider-BAA03-0..ARDA-Insider-BAA03-0..
ARDA-Insider-BAA03-0..
butest
 
Mahesh Joshi
Mahesh JoshiMahesh Joshi
Mahesh Joshi
butest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
butest
 
EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
butest
 

Destacado (9)

Application specific Programming of the Texas Instruments ...
Application specific Programming of the Texas Instruments  ...Application specific Programming of the Texas Instruments  ...
Application specific Programming of the Texas Instruments ...
 
utdallas.edu
utdallas.eduutdallas.edu
utdallas.edu
 
final report.doc
final report.docfinal report.doc
final report.doc
 
paper
paperpaper
paper
 
Newsletter
NewsletterNewsletter
Newsletter
 
ARDA-Insider-BAA03-0..
ARDA-Insider-BAA03-0..ARDA-Insider-BAA03-0..
ARDA-Insider-BAA03-0..
 
Mahesh Joshi
Mahesh JoshiMahesh Joshi
Mahesh Joshi
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 

Similar a uai2004_V1.doc.doc.doc

Main single agent machine learning algorithms
Main single agent machine learning algorithmsMain single agent machine learning algorithms
Main single agent machine learning algorithms
butest
 
A Combined Approach for Feature Subset Selection and Size Reduction for High ...
A Combined Approach for Feature Subset Selection and Size Reduction for High ...A Combined Approach for Feature Subset Selection and Size Reduction for High ...
A Combined Approach for Feature Subset Selection and Size Reduction for High ...
IJERA Editor
 
Download
DownloadDownload
Download
butest
 
Download
DownloadDownload
Download
butest
 
CSE545_Porject
CSE545_PorjectCSE545_Porject
CSE545_Porject
han li
 
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
Kishor Datta Gupta
 

Similar a uai2004_V1.doc.doc.doc (20)

A Collaborative Recommender System Based On Probabilistic Inference From Fuzz...
A Collaborative Recommender System Based On Probabilistic Inference From Fuzz...A Collaborative Recommender System Based On Probabilistic Inference From Fuzz...
A Collaborative Recommender System Based On Probabilistic Inference From Fuzz...
 
Df32676679
Df32676679Df32676679
Df32676679
 
Df32676679
Df32676679Df32676679
Df32676679
 
call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...
 
At4102337341
At4102337341At4102337341
At4102337341
 
Main single agent machine learning algorithms
Main single agent machine learning algorithmsMain single agent machine learning algorithms
Main single agent machine learning algorithms
 
Exploration exploitation trade off in mobile context-aware recommender systems
Exploration  exploitation trade off in mobile context-aware recommender systemsExploration  exploitation trade off in mobile context-aware recommender systems
Exploration exploitation trade off in mobile context-aware recommender systems
 
A Modified KS-test for Feature Selection
A Modified KS-test for Feature SelectionA Modified KS-test for Feature Selection
A Modified KS-test for Feature Selection
 
Costomization of recommendation system using collaborative filtering algorith...
Costomization of recommendation system using collaborative filtering algorith...Costomization of recommendation system using collaborative filtering algorith...
Costomization of recommendation system using collaborative filtering algorith...
 
Assessment of Programming Language Reliability Utilizing Soft-Computing
Assessment of Programming Language Reliability Utilizing Soft-ComputingAssessment of Programming Language Reliability Utilizing Soft-Computing
Assessment of Programming Language Reliability Utilizing Soft-Computing
 
Recommendation system
Recommendation systemRecommendation system
Recommendation system
 
A Solution To The Random Assignment Problem On The Full Preference Domain
A Solution To The Random Assignment Problem On The Full Preference DomainA Solution To The Random Assignment Problem On The Full Preference Domain
A Solution To The Random Assignment Problem On The Full Preference Domain
 
A Combined Approach for Feature Subset Selection and Size Reduction for High ...
A Combined Approach for Feature Subset Selection and Size Reduction for High ...A Combined Approach for Feature Subset Selection and Size Reduction for High ...
A Combined Approach for Feature Subset Selection and Size Reduction for High ...
 
Kjartjo-lokaverkefni
Kjartjo-lokaverkefniKjartjo-lokaverkefni
Kjartjo-lokaverkefni
 
Download
DownloadDownload
Download
 
Download
DownloadDownload
Download
 
A Novel Collaborative Filtering Algorithm by Bit Mining Frequent Itemsets
A Novel Collaborative Filtering Algorithm by Bit Mining Frequent ItemsetsA Novel Collaborative Filtering Algorithm by Bit Mining Frequent Itemsets
A Novel Collaborative Filtering Algorithm by Bit Mining Frequent Itemsets
 
CSE545_Porject
CSE545_PorjectCSE545_Porject
CSE545_Porject
 
A New Active Learning Technique Using Furthest Nearest Neighbour Criterion fo...
A New Active Learning Technique Using Furthest Nearest Neighbour Criterion fo...A New Active Learning Technique Using Furthest Nearest Neighbour Criterion fo...
A New Active Learning Technique Using Furthest Nearest Neighbour Criterion fo...
 
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
 

Más de butest

LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
butest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
butest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
butest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
butest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
butest
 
Facebook
Facebook Facebook
Facebook
butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
butest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
butest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
butest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
butest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
butest
 
Download
DownloadDownload
Download
butest
 
resume.doc
resume.docresume.doc
resume.doc
butest
 

Más de butest (20)

LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 
Download
DownloadDownload
Download
 
resume.doc
resume.docresume.doc
resume.doc
 

uai2004_V1.doc.doc.doc

  • 1. A Bayesian Approach toward Active Learning for Collaborative Filtering Rong Jin Luo Si Language Technology Institute Department of Computer Science School of Computer Science Michigan State University Carnegie Mellon University rong@cse.cmu.edu Pittsburgh, PA 15232 lsi@cs.cmu.edu Abstract the fact that contents are difficult for a computer to analyze (e.g. music and videos). One of the key issues in Collaborative filtering is a useful technique for the collaborative filtering is to identify the group of users exploiting the preference patterns of a group of who share the similar interests as the active user. Usually, users to predict the utility of items for the active the similarity between users are measured based on their user. In general, the performance of collaborative ratings over the same set of items. Therefore, to filtering depends on the number of rated accurately identify users that share similar interests as the examples given by the active user. The more the active user, a reasonably large number of ratings from the number of rated examples given by the active active user are usually required. However, few users are user, the more accurate the predicted ratings will willing to provide ratings for a large amount of items. be. Active learning provides an effective way to Active learning methods provide a solution to this acquire the most informative rated examples problem by acquiring the ratings from an active user that from active users. Previous work on active are most useful in determining his/her interests. Instead of learning for collaborative filtering only considers randomly selecting an item for soliciting the rating from the expected loss function based on the estimated the active user, for most active learning methods, items model, which can be misleading when the are selected to maximize the expected reduction in the estimated model is inaccurate. This paper takes predefined loss function. The commonly used loss one step further by taking into account of the functions include the entropy of model distribution and posterior distribution of the estimated model, the prediction error. In the paper by Yu et. al. (2003), the which results in more robust active learning expected reduction in the entropy of the model algorithm. Empirical studies with datasets of distribution is used to select the most informative item for movie ratings show that when the number of the active user. Boutilier et. al. (2003) applies the metric ratings from the active user is restricted to be of expected value of utility to find the most informative small, active learning methods only based on the item for soliciting the rating, which is essentially to find estimated model don’t perform well while the the item that leads to the most significant change in the active learning method using the model highest expected ratings. distribution achieves substantially better performance. One problem with the previous work on the active learning for collaborative filtering is that computation of expected loss is based only on the estimated model. This 1. Introduction can be dangerous when the number of rated examples given by the active user is small and as a result the The rapid growth of the information on the Internet estimated model is usually far from being accurate. A demands intelligent information agent that can sift better strategy for active learning is to take into account of through all the available information and find out the the model uncertainty by averaging the expected loss most valuable to us. Collaborative filtering exploits the function over the posterior distribution of models. With preference patterns of a group of users to predict the the full Bayesian treatment, we will be able to avoid the utility of items for an active user. Compared to content- problem caused by the large variance in the model based filtering approaches, collaborative filtering systems distribution. Many studies have been done on the active have advantages in the environments where the contents learning to take into account of the model uncertainty. of items are not available due to either a privacy issue or The method of query by committee (Seung et. al., 1992;
  • 2. Freud et. al., 1996) simulates the posterior distribution of preference factors (Hofmann & Puzicha 1999; Hofmann, models by constructing an ensemble of models and the 2003). The latent class variable z ∈ Z = {z1 , z 2 ,....., z K } example with the largest uncertainty in prediction is is associated with each pair of a user and an item. The selected for user’s feedback. In the work by Tong and aspect model assumes that users and items are Koller (2000), a full Bayesian analysis of the active independent from each other given the latent class learning for parameter estimation in Bayesian Networks is variable. Thus, the probability for each observation tuple used, which takes into account of the model uncertainty in ( x, y, r ) is calculated as follows: computing the loss function. In this paper, we will apply the full Bayesian analysis to the active learning for collaborative filtering. Particularly, in order to simplify p (r | x, y ) = ∑ p (r | z , x ) p ( z | y ) z∈Z (1) the computation, we approximate the posterior where p(z|y) stands for the likelihood for user y to be in distribution of model with a simple Dirichlet distribution, class z and p(r|z,x) stands for the likelihood of assigning which leads to an analytic expression for the expected item x with rating r by users in class z. In order to achieve loss function. better performance, the ratings of each user are normalized to be a normal distribution with zero mean The rest of the paper is arranged as follows: Section 2 and variance as 1 (Hofmann, 2003). The parameter p(r| describes the related work in both collaborative filtering z,x) is approximated as a Gaussian distribution N ( µ z , σ z ) and active learning. Section 3 discusses the proposed and p(z|y) as a multinomial distribution. active learning algorithm for collaborative filtering. The experiments are explained and discussed in Section 4. Personality diagnosis approach treats each user in the Section 5 concludes this work and the future work. training database as an individual model. To predicate the rating of the active user on certain items, we first compute the likelihood for the active user to be in the ‘model’ of 2. Related Work each training use. The ratings from the training user on In this section, we will first briefly discuss the previous the same items are then weighted by the computed work on collaborative filtering, followed by the previous likelihood. The weighted average is used as the estimation work on active filtering. The previous work on active of ratings for the active user. By assuming that the learning for collaborative filtering will be discussed at the observed rating for the active user y’ on an item x is end of this section. assumed to be drawn from an independent normal True distribution with the mean as the true rating as R y ' ( x) , 2.1 Previous Work on Collaborative Filtering we have Most collaborative filtering methods fall into two − ( R y ' ( x ) − RTrue ( x ))2 2σ 2 (2) p ( R y ' ( x) | R True ( x)) ∝ e y' y' categories: Memory-based algorithms and Model-based algorithms (Breese et al. 1998). Memory-based where the standard deviation σ is set to be constant 1 in algorithms store rating examples of users in a training our experiments. The probability for the active user y’ to database. In the predicating phase, they predict the ratings be in the model of any user y in the training database can of an active user based on the corresponding ratings of the be written as: users in the training database that are similar to the active ∏e − ( Ry ( x ) − R y ' ( x )) 2 2σ 2 user. In contrast, model-based algorithms construct p( y'| y) ∝ (3) models that well explain the rating examples from the x∈X ( y t ) training database and apply the estimated model to predict Finally, the likelihood for the active user y’ to rate an the ratings for active users. Both types of approaches unseen item x as r is computed as: have been shown to be effective for collaborative filtering. In this subsection, we focus on the model-based collaborative filtering approaches, including the Aspect p( R y ' ( x) = r ) ∝ ∑ p( R y y' | R y )e − ( R y ( x ) − r ) 2 2σ 2 (4) Model (AM), the Personality Diagnosis (PD) and the Flexible Mixture Model (FMM). Previous empirical studies have shown that the personality diagnosis method is able to outperform For the convenience of discussion, we will first introduce several other approaches for collaborative filtering the annotation. Let items denoted by (Pennock et al., 2000). X = {x1 , x 2 ,......, x M } , users denoted by Y = { y1 , y 2 ,......, y N } , and the range of ratings denoted Flexible Mixture Model introduces two sets of hidden by {1,..., R} . A tuple ( x, y, r ) means that rating r is variables {z x , z y } , with z x for the class of items and z y assigned to item x by user y . Let X ( y ) denote the set for the class of users (Si and Jin, 2003; Jin et. al., 2003). of items rated by user y, and R y (x) stand for and the Similar to aspect model, the probability for each observed rating of item x by user y, respectively. tuple ( x, y, r ) is factorized into a sum over different classes for items and users, i.e., Aspect model is a probabilistic latent space model, which models individual preferences as a convex combination of 2
  • 3. p ( x, y , r ) = ∑ p( z x ) P( z y ) P( x | z x ) P( y | z y ) P(r | z x , z y ) (5) Correct Decision Boundary zx , z y f1 where parameter p(y|zy) stands for the likelihood for user Estimated Decision Boundary y to be in class zy, p(x|zx) stands for the likelihood for item x to be in class zx, and p(r|zx, zy) stands for the likelihood for users in class zy to rate items in class zx as r. All the parameters are estimated using Expectation Maximization algorithm (EM) (Dumpster et. al., 1976). The multiple- cause vector quantization (MCVQ) model (Boutilier and Zemel, 2003) uses the similar idea for collaborative filtering. 2.2 Previous Work on Active Learning The goal of active learning is to learn the correct model using only a small number of labeled examples. The f2 general approach is to find the example from the pool of unlabeled data that gives the largest reduction in the Figure 1: A learning scenario when the estimated expected loss function. The loss functions used by most model distribution is incorrect. The vertical line active learning methods can be categorized into two corresponds t the correct decision boundary and the groups: the loss functions based on model uncertainty and horizontal line corresponds to the estimated decision the loss function based on prediction errors. For the first boundary. type of loss functions, the goal is to achieve the largest reduction ratio in the space of hypothesis. One commonly examples, the most likely decision boundary is the used loss function is the entropy of the model distribution. horizontal line (i.e., the dash line) while the true decision Methods within this category usually select the example boundary is a vertical line (i.e., the dot line). If we only for which the model has the largest uncertainty in rely on the estimated model for computing the expected predicting the label (Seung et. al., 1992; Freud et. al, loss, the examples that will be selected for user’s 1997; Abe and Mamitsuka, 1998; Campbell et. al., 2000; feedback are most likely from the dot-shaded areas, which Tong and Koller, 2002). The second type of loss functions will not be very effective in adjusting the estimated involved the prediction errors. The largest reduction in the decision boundary (i.e. the horizontal line) to the correct volume of hypothesis space may not necessary be decision boundary (i.e. the vertical. On the other hand, if effective in cutting the prediction error. Freud et. al. we can use the model distribution for estimating the showed an example in (1997) , in which the largest expected loss, we will be able to adjust the decision reduction in the space of hypothesis does not bring the boundary more effectively since decision boundaries optimal improvement to the reduction of prediction errors. other than the estimated one (i.e. horizontal line) have Empirical studies in Bayesian Network (Tong and Koller, been considered in the computation of expected loss. 2000) and text categorization (Roy and MaCallum, 2001) There have been several studies on active learning that have shown that using the loss function that directly use the posterior distribution of models for estimating the targets on the prediction error is able to achieve better expected loss. The query by committee approach performance than the loss function that is only based on simulates the model distribution by sampling a set of the model uncertainty. The commonly used loss function models out of the posterior distribution. In the work of within this category is the entropy of the distribution of active learning for parameter estimation in Bayesian predicted labels. Network, a Dirichlet distribution for the parameters is used to estimate the change in the entropy function. In addition to the choice of loss function, how to estimate However, the general difficulty with the full Bayesian the expected loss is another important issue for active analysis for active learning is the computational learning. Many active learning methods compute the complexity. For complicated models, usually it is rather expected loss only based on the currently estimated model difficult to obtain the model posterior distribution, which without taking into account of the model uncertainty. leads to the sampling approaches such as Markov Chain Even though this simple strategy works fine for many Mote Carlo (MCMC) and Gibbs sampling. In this paper, applications, it can be very misleading, particularly when we follow an idea similar to the variational approach the estimated model is far from the true model. As an (Jordan et al., 1999). Instead of accurately computing the example, considering learning a classification model for posteriors, we approximate the posterior distribution with the data distribution in Figure 1, where spheres represent an analytic solution and apply it to estimate the expected data points of one class and stars represent data points of loss. Compared to the sampling approaches, this approach the other class. The four labeled examples are highlighted substantially simplifies the computation by avoiding by the line-shaded ellipsis. Based on these four training 3
  • 4. generating a large number of models and calculating the loss function values of all the generated models. θ* = arg max θ∈Θ ∏ p (r | x, y ';θ ) x∈X ( y ') (6) 2.3 Previous Work on Active Learning for = arg max θ∈Θ ∏ ∑ p(r | z, x)θ z x∈X ( y ') z∈Z Collaborative Filtering There have been only few studies on active learning for Then, the posterior distribution of model, collaborative filtering. In the paper by Kai et al (2003), a p (θ | D( y ')) where D( y ') includes all the rated method similar to Personality Diagnosis (PD) is used for items given the active user y’, can be written as: collaborative filtering. Each example is selected for user’s feedback in order to reduce the entropy of the like- p (θ | D( y ')) ∝ p ( D( y ') | θ) p (θ) mindness distribution or p ( y | y ') in Equation (3)). In the paper by Boutilier et al (2003), a method similar to the ≈ ∏ ∑ p(r | z, x)θ z (7) x∈X ( y ') z∈Z Flexible Mixture Model (FMM) named ‘multiple-cause vector quantification’ is used for collaborative filtering. In above, a uniform prior distribution is applied Unlike most other collaborative filtering research, which and is simply ignored. The posterior in (7) is try to minimize the prediction error, this paper only rather difficult to be used for estimating the concerns with the items that are strongly recommended by expected loss due to the multiple products and lack the system. The loss function is based on the expected of normalization. value of information (EVOI), which is computed based on the currently estimated model. One problem with these two approaches is that, the expected loss is computed only To approximate the posterior distribution, we will based on a single model, namely the currently estimated consider the expansion of the posterior function model. This is especially important, when the number of around the maximum point. Let θ* stand for the rated examples given by the active user is small and maximal parameters that are obtained from EM meanwhile the number of parameters to be estimated is algorithm by maximizing Equation (6). Let’s large. In the experiments, we will show that the selection consider the ratio of log p (θ | D( y ')) with respect based on the expected loss function with only the to log p (θ* | D( y ')) , which can be written as: currently estimated model can be even worse than simple random selection. p (θ | S ) ∑ p(ri | xi , z )θz log = ∑log z p (θ * | S ) i ∑ p(ri | xi , z )θz* z 3. A Bayesian Approach Toward Active p (ri | xi , z )θz* θ Learning for Collaborative Filtering ≥ ∑∑ log z (8) i z ∑ p(ri | xi , z )θz*' θz* z' ' In this section, we will first discuss the θz approximated analytic solution for the posterior = ∑(a z −1) log θz* distribution. Then, the approximated posterior z distribution will be used to estimate the expected loss. where the variable α z is defined as p( ri | xi , z )θ z* 3.1 Approximate the Model Posterior Distribution αz = ∑ +1 (9) i ∑ p(ri | xi , z ')θ z*' z' For the sake of simplicity, we will focus on the active learning of aspect model for collaborative Therefore, according to Equation (8), the filtering. However, the discussion in this section approximated posterior distribution p(θ | D( y ')) can be easily extended to other models for is a Dirichlet with hyper parameters α z defined in collaborative filtering. Equation (9), or As described in Equation (1), each conditional Γ(α * ) (10 p (θ | D) ≈ ∏ z θ α z −1 likelihood p (r | x, y ) is decomposed as the sum of ∏z Γ(α z ) ) ∑ z p(r | z, x) p( z | y) . For the active user y’, the most important task is to determine its user type, where α = ∑ z α z . Furthermore, it is easy to see * or p ( z | y ') . Let θ stands for the parameter space θ = {θ z = p ( z | x)}z . Usually, parameters θ are * αz −1 θz (11 estimated through the maximum likelihood = * ) estimation, i.e. αz' −1 θz' 4
  • 5. The above relation is true because the optimal optimization problem in Equation (13) is not solution obtained by EM algorithm is actual fixed appropriate for active learning of collaborative point of the following equation: filtering. p( ri | xi , z )θ z* In the ideal case, if the true model θtrue is given, θ z* = ∑ (12 i ∑ p(ri | xi , z ')θ z*' ) the optimal strategy in selecting example should z' be: Based on the property in Equation (11), we have (14 the optimal point derived from the approximated x* = arg min − x∈X ∑θ z true z log θ z|x ,r p ( r|x ,θtrue ) ) Dirichlet distribution is exactly θ* , which is the optimal solution found by EM algorithm. The goal of the optimization in Equation (14) is to find an example such that by adding the example 3.2 Estimate the Expected Loss to the pool of rated examples, the updated model θ| x ,r is adjusted toward the true model θtrue . Similar to many other active learning work (Tong Unfortunately, the true model is unknown. One and Koller, 2000; Seung et al, 1992; Kai et al, way to approximate the optimization goal in 2003), we use the entropy of the model function as Equation (14) is to replace the θtrue with the the loss function. Many previous studies of active expectation over the posterior distribution learning try to find the example that directly p (θ | D( y ')) . Therefore, Equation (14) will be minimizes the entropy of the model parameters transformed into the following optimization with currently estimated model. In the case of problem: aspect model, it can be formulated as the following (15 optimization problem: x* = arg min − x∈X ∑θ z ' z log θ z|x ,r p ( r| x ,θ ') p ( θ '|D ( y )) ) (13 x = arg min − * x∈X ∑θ z z| x , r log θ z|x ,r p ( r| x ,θ ) ) Now, the key issue is how to compute the integration efficiently. In effect, in our case the where θ denotes the currently estimated integration can be computed analytically as: parameters and θ z|x ,r denotes the optimal parameter that are estimated using one additional rated example, i.e., example x is rated as r. The ∑θz ' z log θ z| x ,r p ( r | x ,θ ' ) p (θ ' | D ( y )) basic idea of Equation (13) is that, we would like to find the example x such that the expected = ∑ r ∫ p(θ ' | D( y )) p(r | x, θ ' )∑ j θ 'j log θ j|r , x dθ ' entropy of distribution θ is minimized. There are two problems with this simple strategy. The first = ∑ r ∑ i ∑ j p( r | x, i )θi'θ 'j log θ j|r , x (16 p (θ ' | D ( y )) problem comes from the fact that the expected ) entropy is computed only based on the currently ∑ a j log θ j|r , x ∑ i ai p(r | x, i ) estimated model. As indicated in Equation (13), = ∑ r j a* (a* + 1) the expectation is evaluated over the distribution p(r | x, θ) . As already discussed in Section 2, this + ∑r ∑ j a j log θ j|r , x p(r | x, j ) simple strategy could be misleading particularly a* (a* + 1) when the currently estimated model is far from the true model. The second problem with the strategy where the function Ψ is the digamma function. stated in Equation (13) comes from the The details of the derivation can be found in the characteristics of collaborative filtering. Recall appendix. With the analytic solution for the that the parameter θ z stand for the likelihood for integration in (16), we can simply go through each the active user to be in the user class z, what the item in the set and find the one that minimizes the optimization in Equation (13) tries to achieve is to loss function specified in Equation (14). find the most efficient way to reduce the uncertainty in the type of user class for the active Finally, the computation involves the estimation of user. However, according to previous studies in the updated model θ|r , x . The standard way to collaborative filtering (Si and Jin, 2003), most obtain the updated model is simply to rerun the users should be of multiple types, instead of single full EM algorithm with one more additional rated types. Therefore, we would expect to the example (i.e., example x is rated as category r). distribution θ be of multiple modes. Apparently, This computation can be extremely expensive this is inconsistent with the optimization goal in since we will have an updated model for every Equation (13). Based on the above analysis, the item and every possible rating category. A more 5
  • 6. efficient way to compute the updated model is to Table 1: Characteristics of MovieRating and EachMovie. avoid the directly optimization of Equation (6). MovieRating EachMovie Instead, we use the following approximation, Number of Users 500 2000 Number of Items 1000 1682 θ* = arg max p (r | x, θ) θ∈Θ ∏ p (r | x ', y '; θ) Avg. # of rated Items/User 87.7 129.6 x '∈X ( y ') Number of Ratings 5 6 = arg max p (r | x, θ) p (θ | D( y ')) θ∈Θ (17) ≈ arg max p (r | x, θ) Dirichlet ({α z }) θ∈Θ utilizing the model distribution and see which one is more effective for collaborative filtering. The details ≈ arg max ∑ p (r | x, z ')θ z ' ∏ α θ z z −1 θ∈Θ of these two active learning methods will be z' z discussed later. The above Equation (17) can be again optimized using EM algorithm with updating equation as: 4.1 Experiment Design p (ri | xi , z )θ z + α z − 1 Two datasets of movie ratings are used in our θz = (18 experiments, i.e., ‘MovieRating’1 and ‘EachMovie’2. For α + ∑ ( p(ri | xi , z ')θ z ' − 1) * ) z' ‘EachMovie’, we extracted a subset of 2,000 users with more than 40 ratings. The details of these two datasets are The advantage of the updating equation in (18) listed in Table 1. For MovieRating, we took the first 200 versus the more general EM updating equation in (12) is that Equation (18) only relies on the rating users for training and all the others for testing. For r for item x while Equation (12) has to go through EachMovie, the first 400 users were used for training. For all the items that have been rated by the active each test user, we randomly selected three items with their user. Therefore, the computation cost of (18) is ratings as the starting seed for the active learning significantly less than that of Equation (12). algorithm to build the initial model. Furthermore, twenty rated items were reserved for each test user for evaluating In summary, in this section, we propose a the performance. Finally the active learning algorithm Bayesian approach for active learning of will have chance to ask for ratings of five different items collaborative filtering, which uses the model and the performance of the active learning algorithms is posterior distribution to compute the estimation of evaluated for each feedback. For simplicity, in this paper, loss function. To simplify the computation, we we assume that the active user will always be able to rate approximate the mode distribution with a Dirichlet the items presented by the active learning. Of course, as distribution. As a result, the expected loss can be pointed out in paper (Kai et al, 2003), this is not a correct calculated analytically. Furthermore, we also assumption because there are some items that the active propose an efficient way to compute the updated user might never see before and therefore it is impossible model by approximating the prior with a Dirichlet for him to rate those items. However, in this paper, we distribution. For easy reference, we call this will mainly focus on the behavior of active learning method ‘Bayesian method’. algorithms for collaborative filtering and leave this issue for future work. The aspect model is used for collaborative filtering as 4. Experiments already described in the Section 2. The number of user In this section, we will present experiment results in classes used in the experiment is 5 for MovieRating order to address the following two questions: dataset and 10 for EachMovie datasets. 1) Whether the proposed active learning algorithm is The evaluation metric used in our experiments was the effective for collaborative filtering? In the mean absolute error (MAE), which is the average absolute experiment, we will compare the proposed algorithm deviation of the predicted ratings to the actual ratings on to the method of randomly acquiring examples from items the test user has actually voted. the active user. 1 ^ MAE = ∑ | r(l ) − R y( l ) ( x(l ) ) | (19) 2) How important is the full Bayesian treatment? In this LTest l paper, we emphasize the importance of taking into where LTest is the number of the test ratings. account the model uncertainty using the posterior distribution. To illustrate this point, in this experiment, we will compare the proposed algorithm 1 to two commonly used active learning methods that http://www.cs.usyd.edu.au/~irena/movie_data.zip are only based on the estimated model without 2 http://research.compaq.com/SRC/eachmovie 6
  • 7. In this experiment, we consider three different baseline given by the active user is only three, which is relatively models for active learning of collaborative filtering: small given the number of parameters to determine is 10. Therefore, using the estimated model to compute the 1) Random Selection: This method simply randomly expected loss will be inaccurate for finding the most selects one item out of the pool of items for user’s informative item to rate. Furthermore, comparing the feedback. We refer this simple method as ‘random results for the ‘model entropy method’ to the results for method’. the ‘prediction entropy method’, we see that the model 2) Model Entropy based Sample Selection. This is the entropy method usually performs even worse. This can be method that has been described at the beginning of explained by the fact that the by minimize thing entropy Section 3.2. The idea is to find the item that is able to of the distribution of the user classes for the active user, reduce the entropy of the distribution of the active user the model entropy method tries to narrow down a single being in different user classes (in Equation (13)). As user type for the active user. However, most users are of aforementioned, the problems with this simple selection multiple types. Therefore, it is inappropriate to apply the strategy are in two aspects: computing the expected loss only based on the currently estimated model and being MovieRating conflict with the intuition that each user can be of multiple user classes. This simple method shares many 0.91 similarities as the proposed method except without random considering the model distribution. By comparing it to the Bayesian proposed active learning method for collaborative 0.88 model filtering, we will be able to see if the idea of using model prediction distribution for computing expected loss is worthwhile for collaborative filtering. For easy reference, we will call MAE 0.85 this method ‘model entropy method’. 3) Prediction Entropy based Sample Selection. Unlike the 0.82 previous method, which concerns with the uncertainty in the distribution of user classes, this method is focused on the uncertainty in the prediction of ratings for items. It 0.79 selects the item that is able to reduce the entropy of the 1 2 3 4 5 6 prediction most effectively. Formally, the selection num ber of feedback criterion can be formulated as the following optimization problem: Figure 1: MAE results for four different active learning algorithms for collaborative filtering, namely the random x = arg min − * x∈X ∑ p(r | x ', θ x' | x ,r ) log p ( r | x ', θ| x ,r ) (20) method, the Bayesian method, the model entropy method, p ( r | x ,θ ) and the prediction entropy method. The database is where notation θ|r , x stands for the updated model using ‘MovieRating’ and the number of training user is 200. The the extra rated example (r,x,y). Similar to the previous smaller MAE the better performance method, this approach uses the estimated model for computing the expected loss. However, unlike the model entropy method to the active learning for previous approach which tries to make the distribution of collaborative filtering. On the other hand, the prediction user types for the active user skew, this approach targets entropy method doesn’t have this defect because it on the prediction distribution. Therefore, this method is focuses on minimizing the uncertainty in the prediction not against the intuition that each user is of multiple instead of the uncertainty in the distribution of user types. We will refer this method as ‘prediction entropy classes. method’. The second observation from Figure 1 and 2 is that the proposed active learning method performs better than any 4.2 Results and Discussion of the three based line models for both ‘MovieRating’ and The results for the proposed active learning algorithm ‘EachMovie’. The most important difference between the together with the three baseline models for database proposed method and the other methods for active ‘MovieRating’ and ‘EachMovie’ are presented in Figure 1 learning is that the proposed method takes into account and 2, respectively. the model distribution, which makes it robust when even there is only three rated items given by the active user. The first thing noticed from this two diagrams is that, both the ‘model entropy method’ and the ‘prediction entropy method’ performs consistently worse than the 5. Conclusion simply ‘random method’. This phenomenon can be explained by the fact that the initial number of rated items 7
  • 8. Freund, Y. Seung, H. S., Shamir, E. and Tishby. N. Each Movie (1997) Selective sampling using the query by committee algorithm. Machine Learning 1.18 Hofmann, T., & Puzicha, J. (1999). Latent class models 1.15 for collaborative filtering. In Proceedings of International Joint Conference on Artificial 1.12 Intelligence. Hofmann, T. (2003). Gaussian latent semantic models for 1.09 collaborative filtering. In Proceedings of the 26th 1.06 Annual Int’l ACM SIGIR Conference on Research and Development in Information Retrieval. 1.03 Jin, R., Si, L., Zhai., C.X. and Callan J. (2003). Collaborative filtering with decoupled models for 1 preferences and ratings. In Proceedings of the 12th 1 2 3 4 5 6 International Conference on Information and num ber of feedbacks Knowledge Management. Figure 1: MAE results for four different active learning Jordan, M. I., Ghahramani, Z., Jaakkola, T. S. and Saul, L. K. (1999). An introduction to variational methods for algorithms for collaborative filtering, namely the random graphical models. In M. I. Jordan (Ed.), Learning in method, the Bayesian method, the model entropy method, graphical Models, Cambridge: MIT Press. and the prediction entropy method. The database is ‘MovieRating’ and the number of training user is 200. The Roy, N. and McCallum A. (2001). Toward Optimal smaller MAE the better performance Active Learning through Sampling Estimation of Error Reduction. In Proceedings of the Eighteenth In this paper, we proposed a full Bayesian treatment of International Conference on Machine Learning. active learning for collaborative filtering. Different from previous studies of active learning for collaborative Pennock, D. M., Horvitz, E., Lawrence, S., and Giles, C. filtering, this method takes into account the model L. (2000). Collaborative filtering by personality distribution when computing the expected loss. In order to diagosis: a hybrid memory- and model-based approach. alleviate the computation complexity, we approximate the In Proceedings of the Sixteenth Conference on model posterior with a simple Dirichlet distribution. As a Uncertainty in Artificial Intelligence. result, the estimated loss can be computed analytically. Seung, H. S., Opper, M. and Sompolinsky, H. (1992). Furthermore, we propose using the approximated prior Query by committee. In Computational Learing Theory, distribution to simplify the computation of updated pages 287—294. model, which is required for every item and every possible rating category. As the current Bayesian version Si, L. and Jin, R. (2003). Flexible mixture model for of active learning for collaborative filtering only focus on collaborative filtering. In Proceedings of the Twentieth the model quality, we will extend this work to the International Conference on Machine Learning. prediction error, which usually is more informative Tong, S. and Koller, D. (2000). Active Learning for according to the previous studies of active learning. Parameter Estimation in Bayesian Networks. In Advances in Neural Information Processing Systems. Yu, K., Schwaighofer, A., Tresp, V., Ma, W.-Y. and References Zhang, H. J. (2003). Collaborative ensemble learning: Breese, J. S., Heckerman, D., Kadie C., (1998). Empirical combining collaborative and content-based Information Analysis of Predictive Algorthms for Collaborative filtering via hierarchical Bayes. In Proceedings of Filtering. In the Proceeding of the Fourteenth Nineteenth International Conference on Uncertainty in Conference on Uncertainty in Artificial Intelligence. Artificial Intelligence. Boutilier, C., Zemel, R. S., and Marlin, B. (2003). Active Campbell, C., Cistianini, N., and Smola, A. (2000). Query collaborative filtering. In the Proceedings of Nineteenth Learning with Large Margin Classifiers. In Proceedings Conference on Uncertainty in Artificial Intelligence. of 17th International Conference on Machine Learning. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Abe, N. and H. Mamitsuka (1998). Query learning Maximum likelihood from incomplete data via the EM strategies using boosting and bagging. In Proceedings algorithm. Journal of the Royal Statistical Society, B39: of the Fifteenth International Conference on Machine 1-38. Learning. 8
  • 9. 9