SlideShare una empresa de Scribd logo
1 de 50
Data Mining:
      Concepts and Techniques

                 — Chapter 11 —
  Additional Theme: Collaborative Filtering & Data
                     Mining


               Jiawei Han and Micheline Kamber
              Department of Computer Science
           University of Illinois at Urbana-Champaign
                    www.cs.uiuc.edu/~hanj
©2006 Jiawei Han and Micheline Kamber. All rights reserved
04/18/13               Data Mining: Principles and Algorithms
                                                     1
04/18/13   Data Mining: Principles and Algorithms
                                         2
Outline
   Motivation
   Systems in Action
   A Conceptual Framework
   User-User Methods
   Item-Item Methods
   Recent Advances and Open Problems




04/18/13          Data Mining: Principles and Algorithms
                                                3
Motivation
   User Perspective
     Lots of online products, books, movies, etc.

     Reduce my choices…please…




   Manager Perspective

    “ if I have 3 million customers on the web, I should have
       3 million stores on the web.”
                             CEO of Amazon.com [SCH01]


04/18/13               Data Mining: Principles and Algorithms
                                                     4
Example: Recommendation




   Customers who bought this book also bought:

   •Data Preparation for Data Mining: by Dorian Pyle (Author)
   •The Elements of Statistical Learning: by T. Hastie, et al
   •Data Mining: Introductory and Advanced Topics: by Margaret H. Dunham
   •Mining the Web: Analysis of Hypertext and Semi Structured Data

04/18/13                   Data Mining: Principles and Algorithms
                                                         5
Example: Personalization




04/18/13      Data Mining: Principles and Algorithms
                                            6
Other Examples
   Movielens: movies
   Moviecritic: movies again
   My launch: music
   Gustos starrater: web pages
   Jester: Jokes
   TV Recommender: TV shows
   Suggest 1.0 : different products
   And much more…

04/18/13             Data Mining: Principles and Algorithms
                                                   7
How it Works?
   Each user has a profile
   Users rate items
     Explicitly: score from 1..5

     Implicitly: web usage mining

       
           Time spent in viewing the item
          Navigation path
          Etc…
   System does the rest, How?
     This is what we will show today




04/18/13                Data Mining: Principles and Algorithms
                                                      8
Basic Approaches
   Collaborative Filtering (CF)
     Look at users collective behavior

     Look at the active user history

     Combine!




   Content-based Filtering
     Recommend items based on key-words

     More appropriate for information retrieval




04/18/13            Data Mining: Principles and Algorithms
                                                  9
Collaborative Filtering: A
                        Framework

                   Items: I
             i1    i2 … ij … in

    u1                                  The task:
    u2       3     1.5 …. …    2        Q1: Find Unknown ratings?
                                        Q2: Which items should we
    …
                              rij=?     recommend to this user?
                                        .
    ui       2
                                        .
    ...                                 .
             1
Users: U
    um      Unknown function
            f: U x I→ R

 04/18/13                     Data Mining: Principles and Algorithms
                                                            10
Collaborative Filtering Road Map
   User-User Methods
      Identify like-minded users

      Memory-based: K-NN

      Model-based: Clustering

   Item-Item Method
      Identify buying patterns

      Correlation Analysis

      Linear Regression

      Belief Network

      Association Rule Mining




04/18/13               Data Mining: Principles and Algorithms
                                                     11
User-User Similarity: Intuition

                                              Target
                                              Customer



                                         Q3: How to combine?
Q1: How to measure
     similarity?


     Q2: How to select
         neighbors?
04/18/13             Data Mining: Principles and Algorithms
                                                   12
How to Measure Similarity?
                                                                           i1   in
   Pearson correlation coefficient                                   ui

                             ∑Rated Itemsra )(rij − ri )
                    j∈ Commonly
                                (raj −
                                                                      ua
     w p ( a, i ) =
                      ∑   (raj − ra ) 2
                    j∈Commonly Rated Items
                                                 ∑  ( rij − ri ) 2
                                             j∈Commonly Rated Items

   Cosine measure
        Users are vectors in product-dimension space


                                        ra .ri
                        wc (a, i ) =
                                     r a 2 * ri            2


04/18/13                         Data Mining: Principles and Algorithms
                                                               13
Nearest Neighbor Approaches
                   [SAR00a]
    Offline phase:
      Do nothing…just store transactions

    Online phase:
      Identify highly similar users to the active one

             Best K ones
             All with a measure greater than a threshold
    Prediction


                              ∑ w(a, i)(r − r )
                                           ij   i
                 raj = ra   +  i

    User a’s neutral
                                ∑ w(a, i)
                                     i
                                                             User i’s deviation


                              User a’s estimated deviation
04/18/13                           Data Mining: Principles and Algorithms
                                                                 14
Horting Method [ AGG99 ]
   K-NN is not transitive
   Horting takes advantage of transitivity
   Uses new similarity measure: Predictability
   User i predicts user a if
     They have rated sufficiently common items

     There is an error-bounded linear

      transformation from user i’s ratings to a’s ones




04/18/13             Data Mining: Principles and Algorithms
                                                   15
How Horting Works?
        Offline phase: build neighborhood graph
        Online phase: Compute raj

                          1- Identify users who predict ua
                          2- Identify users who rated j
Ua                        3- Find shortest paths from group1 to 2
                          4- Backward propagation and averaging


                         - Better for sparse environments
                         - Not well evaluated

 04/18/13               Data Mining: Principles and Algorithms
                                                      16
Clustering [BRE98]
   Offline phase:
     Build clusters: k-mean, k-medoid, etc.

   Online phase:
     Identify the nearest cluster to the active user

     Prediction:

          Use the center of the cluster
          Weighted average between cluster members
               Weights depend on the active user


           Faster                            Slower but a little
                                              more accurate
04/18/13                   Data Mining: Principles and Algorithms
                                                         17
Clustering vs. k-NN
                     Approaches
    K-NN using Pearson measure is slower but more
     accurate
    Clustering is more scalable
                                                           Active user



                                                        Bad recommendations




We can use soft clustering but
will lose computational edge
04/18/13                         Data Mining: Principles and Algorithms
                                                               18
Did We Answer the Questions?

                                              Target
                                              Customer



                                         Q3: How to combine?
Q1: How to measure
     similarity?


     Q2: How to select
         neighbors?
04/18/13             Data Mining: Principles and Algorithms
                                                   19
Are We Done?
     Q1:How to measure similarity?                          Done... Really??

                                           Sparsity results from the poor representation!
                          ∑ ......
                 j∈ Commonly Rated Items
w p ( a, i ) =                                       U1 rates recycled letter pads High
                          .....
                                                     U2 rates recycled memo pads High

                                                 Both of them like Recycled office products

                                                 They are similar but the math won’t work
                                                 for that
What about Sparsity?
Not enough common Items                                          Example from [SAR00P]
implies spurious neighbors
and hence bad recommendations

                           By working at the right level of abstraction we
                           can eliminate sparsity
04/18/13                             Data Mining: Principles and Algorithms
                                                                   20
The Power of Representation [UNG98]




           Action          Foreign       Classic




       Q1-B: How can we formalize this intuition?
04/18/13             Data Mining: Principles and Algorithms
                                                   21
How to Abstract?
   Semi-manual Methods
     Use product features

     Cluster products first, then cluster users

     Works only if we have descriptive features




   Automatic Methods
     Adjusted Product Taxonomy

     Latent Semantic Indexing




04/18/13            Data Mining: Principles and Algorithms
                                                  22
Adjusted Product Taxonomy [CHO04]
   • Input : product taxonomy
   •Output: modified taxonomy with even distribution




04/18/13             Data Mining: Principles and Algorithms
                                                   23
Adjusted Product Taxonomy (2)


                                                       Using
                                                       original
                                                       taxonomy




Number of transactions
having this category


                                                       Using
                                                       adjusted
                                                       taxonomy




   04/18/13              Data Mining: Principles and Algorithms
                                                       24
Latent Semantic Indexing [SAR00b]

                   =
                                Sk            I’
           R R          UUk     S
               k                              Ik’


      mXn                k
                       mXr       rXr
                                 k k        k
                                            rXn


   The reconstructed matrix Rk = Uk.Sk.Ik’ is the closest
   rank-k matrix to the original matrix R.

       • Captures latent associations
       • Reduced space is less-noisy

04/18/13               Data Mining: Principles and Algorithms
                                                     25
Are We Done? (2)
                                             Not adequately
    Q2:How to Select Neighbors?             answered
    We don’t expect to use the same neighbors
     for all products
       Neighbors should be product-category

        specific



       Q2-B. How can we determine whether or not a
       user is relevant to a given product?



04/18/13             Data Mining: Principles and Algorithms
                                                   26
Selecting Relevant Instances
                   [YU01]




      Superman and Batman and correlated
                                                                Predict this
      Titanic and Batman are negatively correlated
      “Dances with Wolves” has nothing to do with Batman’s rating
      Karen is not a good instance to consider

         How can we formalize this?  Mutual Information
      MI(X;Y) = H(X) – H(X|Y)




04/18/13                   Data Mining: Principles and Algorithms
                                                         27
Selecting Relevant Instances (2)
   Offline phase:
     Estimate mutual information between items

     For each item:

        
            Find users who rated it
        
            Compute their strength (how many relevant items
            they also rated)
        
            Retain subset of them (10% works fine)
   Online phase:
     To predict the target item’s rating, run k-NN on

      its reduced instance space
    Better results with less data… quality not quantity is what matter

04/18/13                   Data Mining: Principles and Algorithms
                                                         28
Are We Done? (3)
   Q3:How to combine?
   Weighted average
   Discover association rules in neighbors’ transactions
    [LEE01, WAN04]
       For every x in this group:
        like(x, Item1) ^ like(x, Item2) like(x, Item3)
       Use confidence and support to judge the quality of the
        prediction
       Prediction is done on the binary level (like, dislike)
       Costly to run online

04/18/13                 Data Mining: Principles and Algorithms
                                                       29
User-User Methods Evaluation
   Achieve good quality in practice
   The more processing we push offline, the better
    the method scale
   However:
      User preference is dynamic

           High update frequency of offline-calculated
            information
       No recommendation for new users
           We don’t know much about them yet



04/18/13                 Data Mining: Principles and Algorithms
                                                       30
Collaborative Filtering Road Map
   User-User Methods
      Identify like-minded users

      Memory-based: K-NN

      Model-based: Clustering

   Item-Item Method
      Identify buying patterns

      Correlation Analysis

      Linear Regression

      Belief Network

      Association Rule Mining




04/18/13               Data Mining: Principles and Algorithms
                                                     31
Item-Item Similarity: The Intuition
     Search for similarities among items
     All computations can be done offline
     Item-Item similarity is more stable that user-user
      similarity
        No need for frequent updates

     First Order Models
        Correlation Analysis

        Linear Regression

     Higher Order Models
        Belief Network

        Association Rule Mining



04/18/13                Data Mining: Principles and Algorithms
                                                      32
Correlation-based Methods [SAR01]

   Same as in user-user similarity but on item vectors
   Pearson correlation coefficient
       Look for users who rated both items



                                                                          i1 ii   ij   in

                              ∑ (r   uj  − r )(rui − ri )
                                              j
                                                                     u1

     sij =          u∈ Users Rated Both Items

                      ∑ (ruj − rj ) 2
             u∈Users Rated Both Items
                                                   ∑ (rui − ri ) 2
                                          u∈Users Rated Both Items
                                                                     um




04/18/13                                Data Mining: Principles and Algorithms
                                                                      33
Correlation-based Methods (2)
   Offline phase:
     Calculate n(n-1) similarity measures

     For each item

         
             Determine its k-most similar items
   Online phase:
     Predict rating for a given user-item pair as a

      weighted sum over similar items that he rated


              ∑s r    ij ai
 raj =   i∈similar items      Ua 2      3         ?         4
              ∑s       ij
          i∈similar items
                                                  j


04/18/13                      Data Mining: Principles and Algorithms
                                                            34
Regression Based Methods [VUC00]
   Offline phase:
     Fit n(n-1) linear regressions

     F (x) is a linear transformation of a user rating on
         ij
       item i to his rating on item j
   Online phase
     Same as previous method

     The weights are inversely proportional to the

       regression error rates

                                   ∑ w f (r ij ij
                             i∈rated items by a
                                                      ai   )
                     raj =
                                    ∑w          ij
                                 i∈rated items by a

04/18/13                Data Mining: Principles and Algorithms
                                                      35
Higher Order Models
   Previous approaches used the Naïve Bayes
    assumption
      Item effects on a given one are independent

   Not always true
   Higher order models can do better
      Belief Network

      Association Rule Mining




04/18/13            Data Mining: Principles and Algorithms
                                                  36
Bayesian Belief Network: introduction

   Bayesian belief network allows a subset of the variables to
    be conditionally independent
   A graphical model of causal relationships
      Represents dependency among the variables

      Gives a specification of joint probability distribution




                                  Nodes: random variables
                                  Links: dependency
           X         Y            X,Y are the parents of Z, and Y is the
                                  parent of P
               Z                  No dependency between Z and P
                         P
                                  Has no loops or cycles
04/18/13                     Data Mining: Principles and Algorithms
                                                           37
Bayesian Belief Network: An Example

  Family
                  Smoker
  History
                                      (FH, S)      (FH, ~S) (~FH, S) (~FH, ~S)

                                LC         0.8            0.5          0.7         0.1
LungCancer       Emphysema     ~LC        0.2             0.5          0.3         0.9

                              The conditional probability table
                              for the variable LungCancer:
PositiveXRay       Dyspnea    Shows the conditional probability
                              for each possible combination of its
                              parents             n
  Bayesian Belief Networks            P ( z1,..., zn) =    ∏ P ( z i | Parents ( Z i ))
                                                          i =1

04/18/13              Data Mining: Principles and Algorithms
                                                    38
Belief Network for CF [BRE98]
   Every item is a node
   Binary rating (like, dislike)
   Learn offline a belief network over the training date
   CPT table at each node is represented as a decision tree
   Use greedy algorithms to determine the best network
    structure
   Use probabilistic inference for online prediction




04/18/13               Data Mining: Principles and Algorithms
                                                     39
Belief Network for CF: An Example

                          CPT
  Friends      B.H




   M.P




                                                        Probability


  decision tree for the random variable “Melrose Palace” in
  the movie domain

04/18/13             Data Mining: Principles and Algorithms
                                                   40
Association Rule Mining
   Offline processing
     Work on the binary level (like, dislike)

     View user as market basket containing items

      liked by user
     Discover association rules between items

   Online processing:
     Match items that the active user like with rules

      left hand side
     Recommend rules’ consequent based on

      support and confidence

04/18/13             Data Mining: Principles and Algorithms
                                                   41
Association Rule Mining : Problems
   High support threshold leads to low coverage and may
    eliminate important, but infrequent items from
    consideration

   Low support thresholds result in very large model sizes,
    computationally expensive offline pattern discovery phase
    and slower online matching phase

   Solution:
      Adaptive Association Rule Mining




04/18/13               Data Mining: Principles and Algorithms
                                                     42
Adaptive Association Rule Mining [LIN01]

    Given:
      transaction dataset
                                   Desired number
                                                     minConfidence
                                         of rules
      target item

      desired range for number of

       rules
      specified minimum confidence
                                                minSupport


 Find: set S of association rules for target item such that
     number of rules in S is in given range

     rules in S satisfy minimum confidence constraint

     rules in S have higher support than rules not in S that satisfy above

       constraints




04/18/13                     Data Mining: Principles and Algorithms
                                                           43
Adaptive Association Rule Mining (2)

   Discover rules with one item on the head
     Like (x, item1) ^ Like (x, item2)  Like(x,

      target)

   The miner discovers association rules iteratively
    (for each target item) until the desired number of
    rules are extracted

   Support is adjusted per-item


04/18/13             Data Mining: Principles and Algorithms
                                                   44
Item-Item Methods: Why It Works?

   Like(x,Book1)^like(x,book2)                 Like(x,Movie1)  like(x,Movie2)
   like(x,book3)
                                  Book1, Book2
             Support              Movie1                   Support



                        Book                Movie
                        gang                gang



                                                    Without discovering the
We use the right neighbors for each                 groups themselves thus
item                                                eliminating costly online
                                                    matching
 In general better quality than user-user methods and better response time [LIN03]
 04/18/13                      Data Mining: Principles and Algorithms
                                                             45
Recent Work and Open Problems
   Order-based methods
      Ordering items is more informative than rating them

      [KAM03] developed k-o’mean to work on orders

   Preference-based methods
      Total ordering of items is not feasible

      Work on partial orders (preferences) [COH99]

   Integrating background knowledge
      User demographic information, item-features, etc..

   Modeling time
      Sequential patterns




04/18/13               Data Mining: Principles and Algorithms
                                                     46
References (1)
   Charu C. Aggarwal, Joel L. Wolf, Kun-Lung Wu, Philip S. Yu: Horting Hatches
    an Egg: A New Graph-Theoretic Approach to Collaborative Filtering. KDD
    1999: 201-212
   J. Breese, D. Heckerman, C. Kadie Empirical Analysis of Predictive
    Algorithms for Collaborative Filtering. In Proc. 14th Conf. Uncertainty in
    Artificial Intelligence, Madison, July 1998.
   Yoon Ho Cho and Jae Kyeong Kim: Application of Web usage mining and
    product taxonomy to collaborative recommendations in e-commerce. Expert
    Systems with Applications, 26(2), 2003
   William W. Cohen, Robert E. Schapire, and Yoram Singer. Learning to order
    things. In Advances in Neural Processing Systems 10, Denver, CO, 1997
   Jiawe Han, Fall 2003 online course notes available at:
    http://www-courses.cs.uiuc.edu/~cs397han/slides/05.ppt
   Toshihiro Kamishima: Nantonac collaborative filtering: recommendation
    based on order responses. KDD 2003: 583-588
   Lee, C.-H, Kim, Y.-H., Rhee, P.-K. Web personalization expert with combining
    collaborative filtering and association rule mining technique. Expert Systems
    with Applications, v 21, n 3, October, 2001, p 131-137

04/18/13                     Data Mining: Principles and Algorithms
                                                           47
References (2)
   W. Lin, 2001P, online presentation available at: http://www.wiwi.hu-
    berlin.de/~myra/WEBKDD2000/WEBKDD2000_ARCHIVE/LinAlvarezRuiz_W
    ebKDD2000.ppt
   Weiyang Lin, Sergio A. Alvarez, and Carolina Ruiz. Efficient adaptive-support
    association rule mining for recommender systems. Data Mining and
    Knowledge Discovery, 6:83--105, 2002
   G. Linden, B. Smith, and J. York, "Amazon.com Recommendations Iemto
    -item collaborative filtering", IEEE Internet Computing, Vo. 7, No. 1, pp. 7680,
    Jan. 2003.
    Badrul M. Sarwar, George Karypis, Joseph A. Konstan, John Riedl: Analysis
    of recommendation algorithms for e-commerce. ACM Conf. Electronic
    Commerce 2000: 158-167
   B. Sarwar, G. Karypis, J. Konstan, and J. Riedl: Application of dimensionality
    reduction in recommender systems--a case study. In ACM WebKDD 2000
    Web Mining for E-Commerce Workshop, 2000.
   B. M. Sarwar, G. Karypis, J. A. Konstan, and J. Riedl. Item-based
    collaborative filtering recommendation algorithms. WWW’01

04/18/13                       Data Mining: Principles and Algorithms
                                                             48
References (3)
   B. Sarwar, 2000P, online presentation available at: http://www.wiwi.hu-
    berlin.de/~myra/WEBKDD2000/WEBKDD2000_ARCHIVE/badrul.ppt
   J. Ben Schafer, Joseph A. Konstan, John Riedl: E-Commerce
    Recommendation Applications. Data Mining and Knowledge Discovery 5(1/2):
    115-153, 2001
   L.H. Ungar and D.P. Foster: Clustering Methods for Collaborative Filtering,
    AAAI Workshop on Recommendation Systems, 1998.
   Yi-Fan Wang, Yu-Liang Chuang, Mei-Hua Hsu and Huan-Chao Keh: A
    personalized recommender system for the cosmetic business. Expert
    Systems with Applications, v 26, n 3, April, 2004 Pages 427-434
   S. Vucetic and Z. Obradovic. A regression-based approach for scaling-up
    personalized recommender systems in e-commerce. In ACM WebKDD 2000
    Web Mining for E-Commerce Workshop, 2000.
   Kai Yu, Xiaowei Xu, Martin Ester, and Hans-Peter Kriegel: Selecting relevant
    instances for efficient accurate collaborative filtering. In Proceedings of the
    10th CIKM, pages 239--246. ACM Press, 2001.
   Cheng Zhai, Spring 2003 online course notes available at:
     http://sifaka.cs.uiuc.edu/course/2003-497CXZ/loc/cf.ppt

04/18/13                      Data Mining: Principles and Algorithms
                                                            49
04/18/13   Data Mining: Principles and Algorithms
                                         50

Más contenido relacionado

La actualidad más candente

Mining frequent patterns association
Mining frequent patterns associationMining frequent patterns association
Mining frequent patterns associationDeepaR42
 
What is Datamining? Which algorithms can be used for Datamining?
What is Datamining? Which algorithms can be used for Datamining?What is Datamining? Which algorithms can be used for Datamining?
What is Datamining? Which algorithms can be used for Datamining?Seval Çapraz
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introductionbutest
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data MiningValerii Klymchuk
 
Fundamentals of data mining and its applications
Fundamentals of data mining and its applicationsFundamentals of data mining and its applications
Fundamentals of data mining and its applicationsSubrat Swain
 
Associations1
Associations1Associations1
Associations1mancnilu
 
Lecture3 business intelligence
Lecture3 business intelligenceLecture3 business intelligence
Lecture3 business intelligencehktripathy
 
Introduction to Data mining
Introduction to Data miningIntroduction to Data mining
Introduction to Data miningHadi Fadlallah
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
Text mining presentation in Data mining Area
Text mining presentation in Data mining AreaText mining presentation in Data mining Area
Text mining presentation in Data mining AreaMahamudHasanCSE
 

La actualidad más candente (20)

18231979 Data Mining
18231979 Data Mining18231979 Data Mining
18231979 Data Mining
 
05
0505
05
 
Dbm630 Lecture01
Dbm630 Lecture01Dbm630 Lecture01
Dbm630 Lecture01
 
Lecture 01 Data Mining
Lecture 01 Data MiningLecture 01 Data Mining
Lecture 01 Data Mining
 
Mining frequent patterns association
Mining frequent patterns associationMining frequent patterns association
Mining frequent patterns association
 
What is Datamining? Which algorithms can be used for Datamining?
What is Datamining? Which algorithms can be used for Datamining?What is Datamining? Which algorithms can be used for Datamining?
What is Datamining? Which algorithms can be used for Datamining?
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introduction
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data Mining
 
Data Mining
Data MiningData Mining
Data Mining
 
Fundamentals of data mining and its applications
Fundamentals of data mining and its applicationsFundamentals of data mining and its applications
Fundamentals of data mining and its applications
 
Associations1
Associations1Associations1
Associations1
 
Lecture3 business intelligence
Lecture3 business intelligenceLecture3 business intelligence
Lecture3 business intelligence
 
3 classification
3  classification3  classification
3 classification
 
Introduction to Data mining
Introduction to Data miningIntroduction to Data mining
Introduction to Data mining
 
Data mining
Data miningData mining
Data mining
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Text mining presentation in Data mining Area
Text mining presentation in Data mining AreaText mining presentation in Data mining Area
Text mining presentation in Data mining Area
 
Data mining
Data miningData mining
Data mining
 
Ej36829834
Ej36829834Ej36829834
Ej36829834
 
From data lakes to actionable data (adventures in data curation)
From data lakes to actionable data (adventures in data curation)From data lakes to actionable data (adventures in data curation)
From data lakes to actionable data (adventures in data curation)
 

Similar a Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

Data Mining Techniques
Data Mining TechniquesData Mining Techniques
Data Mining TechniquesHouw Liong The
 
2016 03-16 digital energy luncheon
2016 03-16 digital energy luncheon2016 03-16 digital energy luncheon
2016 03-16 digital energy luncheonMark Reynolds
 
What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)Pratik Tambekar
 
Analysis on different Data mining Techniques and algorithms used in IOT
Analysis on different Data mining Techniques and algorithms used in IOTAnalysis on different Data mining Techniques and algorithms used in IOT
Analysis on different Data mining Techniques and algorithms used in IOTIJERA Editor
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-stepsShesha R
 
Real time analytics @ netflix
Real time analytics @ netflixReal time analytics @ netflix
Real time analytics @ netflixCody Rioux
 
Project Progress Report - Recommender Systems for Social Networks
Project Progress Report - Recommender Systems for Social NetworksProject Progress Report - Recommender Systems for Social Networks
Project Progress Report - Recommender Systems for Social Networksamirhhz
 
Collective Intelligence, part II
Collective Intelligence, part IICollective Intelligence, part II
Collective Intelligence, part IIAli Abbasi
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slidestafosepsdfasg
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryNeo4j
 
Mining on Relationships in Big Data era using Improve Apriori Algorithm with ...
Mining on Relationships in Big Data era using Improve Apriori Algorithm with ...Mining on Relationships in Big Data era using Improve Apriori Algorithm with ...
Mining on Relationships in Big Data era using Improve Apriori Algorithm with ...KamleshKumar394
 
Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1DanWooster1
 
Democratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryDemocratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryMark Grover
 

Similar a Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber (20)

093
093093
093
 
Data Mining Techniques
Data Mining TechniquesData Mining Techniques
Data Mining Techniques
 
2016 03-16 digital energy luncheon
2016 03-16 digital energy luncheon2016 03-16 digital energy luncheon
2016 03-16 digital energy luncheon
 
What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)
 
Analysis on different Data mining Techniques and algorithms used in IOT
Analysis on different Data mining Techniques and algorithms used in IOTAnalysis on different Data mining Techniques and algorithms used in IOT
Analysis on different Data mining Techniques and algorithms used in IOT
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-steps
 
Real time analytics @ netflix
Real time analytics @ netflixReal time analytics @ netflix
Real time analytics @ netflix
 
5desc
5desc5desc
5desc
 
Project Progress Report - Recommender Systems for Social Networks
Project Progress Report - Recommender Systems for Social NetworksProject Progress Report - Recommender Systems for Social Networks
Project Progress Report - Recommender Systems for Social Networks
 
Big data in GIS Environment
Big data in GIS Environment Big data in GIS Environment
Big data in GIS Environment
 
Collective Intelligence, part II
Collective Intelligence, part IICollective Intelligence, part II
Collective Intelligence, part II
 
data mining
data miningdata mining
data mining
 
Data Mining
Data MiningData Mining
Data Mining
 
Data Mining Overview
Data Mining OverviewData Mining Overview
Data Mining Overview
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slides
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
Mining on Relationships in Big Data era using Improve Apriori Algorithm with ...
Mining on Relationships in Big Data era using Improve Apriori Algorithm with ...Mining on Relationships in Big Data era using Improve Apriori Algorithm with ...
Mining on Relationships in Big Data era using Improve Apriori Algorithm with ...
 
Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1
 
Democratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryDemocratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data Discovery
 
8clst
8clst8clst
8clst
 

Último

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 

Último (20)

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 

Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

  • 1. Data Mining: Concepts and Techniques — Chapter 11 — Additional Theme: Collaborative Filtering & Data Mining Jiawei Han and Micheline Kamber Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj ©2006 Jiawei Han and Micheline Kamber. All rights reserved 04/18/13 Data Mining: Principles and Algorithms 1
  • 2. 04/18/13 Data Mining: Principles and Algorithms 2
  • 3. Outline  Motivation  Systems in Action  A Conceptual Framework  User-User Methods  Item-Item Methods  Recent Advances and Open Problems 04/18/13 Data Mining: Principles and Algorithms 3
  • 4. Motivation  User Perspective  Lots of online products, books, movies, etc.  Reduce my choices…please…  Manager Perspective “ if I have 3 million customers on the web, I should have 3 million stores on the web.” CEO of Amazon.com [SCH01] 04/18/13 Data Mining: Principles and Algorithms 4
  • 5. Example: Recommendation Customers who bought this book also bought: •Data Preparation for Data Mining: by Dorian Pyle (Author) •The Elements of Statistical Learning: by T. Hastie, et al •Data Mining: Introductory and Advanced Topics: by Margaret H. Dunham •Mining the Web: Analysis of Hypertext and Semi Structured Data 04/18/13 Data Mining: Principles and Algorithms 5
  • 6. Example: Personalization 04/18/13 Data Mining: Principles and Algorithms 6
  • 7. Other Examples  Movielens: movies  Moviecritic: movies again  My launch: music  Gustos starrater: web pages  Jester: Jokes  TV Recommender: TV shows  Suggest 1.0 : different products  And much more… 04/18/13 Data Mining: Principles and Algorithms 7
  • 8. How it Works?  Each user has a profile  Users rate items  Explicitly: score from 1..5  Implicitly: web usage mining  Time spent in viewing the item  Navigation path  Etc…  System does the rest, How?  This is what we will show today 04/18/13 Data Mining: Principles and Algorithms 8
  • 9. Basic Approaches  Collaborative Filtering (CF)  Look at users collective behavior  Look at the active user history  Combine!  Content-based Filtering  Recommend items based on key-words  More appropriate for information retrieval 04/18/13 Data Mining: Principles and Algorithms 9
  • 10. Collaborative Filtering: A Framework Items: I i1 i2 … ij … in u1 The task: u2 3 1.5 …. … 2 Q1: Find Unknown ratings? Q2: Which items should we … rij=? recommend to this user? . ui 2 . ... . 1 Users: U um Unknown function f: U x I→ R 04/18/13 Data Mining: Principles and Algorithms 10
  • 11. Collaborative Filtering Road Map  User-User Methods  Identify like-minded users  Memory-based: K-NN  Model-based: Clustering  Item-Item Method  Identify buying patterns  Correlation Analysis  Linear Regression  Belief Network  Association Rule Mining 04/18/13 Data Mining: Principles and Algorithms 11
  • 12. User-User Similarity: Intuition Target Customer Q3: How to combine? Q1: How to measure similarity? Q2: How to select neighbors? 04/18/13 Data Mining: Principles and Algorithms 12
  • 13. How to Measure Similarity? i1 in  Pearson correlation coefficient ui ∑Rated Itemsra )(rij − ri ) j∈ Commonly (raj − ua w p ( a, i ) = ∑ (raj − ra ) 2 j∈Commonly Rated Items ∑ ( rij − ri ) 2 j∈Commonly Rated Items  Cosine measure  Users are vectors in product-dimension space ra .ri wc (a, i ) = r a 2 * ri 2 04/18/13 Data Mining: Principles and Algorithms 13
  • 14. Nearest Neighbor Approaches [SAR00a]  Offline phase:  Do nothing…just store transactions  Online phase:  Identify highly similar users to the active one  Best K ones  All with a measure greater than a threshold  Prediction ∑ w(a, i)(r − r ) ij i raj = ra + i User a’s neutral ∑ w(a, i) i User i’s deviation User a’s estimated deviation 04/18/13 Data Mining: Principles and Algorithms 14
  • 15. Horting Method [ AGG99 ]  K-NN is not transitive  Horting takes advantage of transitivity  Uses new similarity measure: Predictability  User i predicts user a if  They have rated sufficiently common items  There is an error-bounded linear transformation from user i’s ratings to a’s ones 04/18/13 Data Mining: Principles and Algorithms 15
  • 16. How Horting Works?  Offline phase: build neighborhood graph  Online phase: Compute raj 1- Identify users who predict ua 2- Identify users who rated j Ua 3- Find shortest paths from group1 to 2 4- Backward propagation and averaging - Better for sparse environments - Not well evaluated 04/18/13 Data Mining: Principles and Algorithms 16
  • 17. Clustering [BRE98]  Offline phase:  Build clusters: k-mean, k-medoid, etc.  Online phase:  Identify the nearest cluster to the active user  Prediction:  Use the center of the cluster  Weighted average between cluster members  Weights depend on the active user Faster Slower but a little more accurate 04/18/13 Data Mining: Principles and Algorithms 17
  • 18. Clustering vs. k-NN Approaches  K-NN using Pearson measure is slower but more accurate  Clustering is more scalable Active user Bad recommendations We can use soft clustering but will lose computational edge 04/18/13 Data Mining: Principles and Algorithms 18
  • 19. Did We Answer the Questions? Target Customer Q3: How to combine? Q1: How to measure similarity? Q2: How to select neighbors? 04/18/13 Data Mining: Principles and Algorithms 19
  • 20. Are We Done?  Q1:How to measure similarity? Done... Really?? Sparsity results from the poor representation! ∑ ...... j∈ Commonly Rated Items w p ( a, i ) = U1 rates recycled letter pads High ..... U2 rates recycled memo pads High Both of them like Recycled office products They are similar but the math won’t work for that What about Sparsity? Not enough common Items Example from [SAR00P] implies spurious neighbors and hence bad recommendations By working at the right level of abstraction we can eliminate sparsity 04/18/13 Data Mining: Principles and Algorithms 20
  • 21. The Power of Representation [UNG98] Action Foreign Classic Q1-B: How can we formalize this intuition? 04/18/13 Data Mining: Principles and Algorithms 21
  • 22. How to Abstract?  Semi-manual Methods  Use product features  Cluster products first, then cluster users  Works only if we have descriptive features  Automatic Methods  Adjusted Product Taxonomy  Latent Semantic Indexing 04/18/13 Data Mining: Principles and Algorithms 22
  • 23. Adjusted Product Taxonomy [CHO04] • Input : product taxonomy •Output: modified taxonomy with even distribution 04/18/13 Data Mining: Principles and Algorithms 23
  • 24. Adjusted Product Taxonomy (2) Using original taxonomy Number of transactions having this category Using adjusted taxonomy 04/18/13 Data Mining: Principles and Algorithms 24
  • 25. Latent Semantic Indexing [SAR00b] = Sk I’ R R UUk S k Ik’ mXn k mXr rXr k k k rXn The reconstructed matrix Rk = Uk.Sk.Ik’ is the closest rank-k matrix to the original matrix R. • Captures latent associations • Reduced space is less-noisy 04/18/13 Data Mining: Principles and Algorithms 25
  • 26. Are We Done? (2) Not adequately  Q2:How to Select Neighbors? answered  We don’t expect to use the same neighbors for all products  Neighbors should be product-category specific Q2-B. How can we determine whether or not a user is relevant to a given product? 04/18/13 Data Mining: Principles and Algorithms 26
  • 27. Selecting Relevant Instances [YU01]  Superman and Batman and correlated Predict this  Titanic and Batman are negatively correlated  “Dances with Wolves” has nothing to do with Batman’s rating  Karen is not a good instance to consider How can we formalize this?  Mutual Information  MI(X;Y) = H(X) – H(X|Y) 04/18/13 Data Mining: Principles and Algorithms 27
  • 28. Selecting Relevant Instances (2)  Offline phase:  Estimate mutual information between items  For each item:  Find users who rated it  Compute their strength (how many relevant items they also rated)  Retain subset of them (10% works fine)  Online phase:  To predict the target item’s rating, run k-NN on its reduced instance space Better results with less data… quality not quantity is what matter 04/18/13 Data Mining: Principles and Algorithms 28
  • 29. Are We Done? (3)  Q3:How to combine?  Weighted average  Discover association rules in neighbors’ transactions [LEE01, WAN04]  For every x in this group: like(x, Item1) ^ like(x, Item2) like(x, Item3)  Use confidence and support to judge the quality of the prediction  Prediction is done on the binary level (like, dislike)  Costly to run online 04/18/13 Data Mining: Principles and Algorithms 29
  • 30. User-User Methods Evaluation  Achieve good quality in practice  The more processing we push offline, the better the method scale  However:  User preference is dynamic  High update frequency of offline-calculated information  No recommendation for new users  We don’t know much about them yet 04/18/13 Data Mining: Principles and Algorithms 30
  • 31. Collaborative Filtering Road Map  User-User Methods  Identify like-minded users  Memory-based: K-NN  Model-based: Clustering  Item-Item Method  Identify buying patterns  Correlation Analysis  Linear Regression  Belief Network  Association Rule Mining 04/18/13 Data Mining: Principles and Algorithms 31
  • 32. Item-Item Similarity: The Intuition  Search for similarities among items  All computations can be done offline  Item-Item similarity is more stable that user-user similarity  No need for frequent updates  First Order Models  Correlation Analysis  Linear Regression  Higher Order Models  Belief Network  Association Rule Mining 04/18/13 Data Mining: Principles and Algorithms 32
  • 33. Correlation-based Methods [SAR01]  Same as in user-user similarity but on item vectors  Pearson correlation coefficient  Look for users who rated both items i1 ii ij in ∑ (r uj − r )(rui − ri ) j u1 sij = u∈ Users Rated Both Items ∑ (ruj − rj ) 2 u∈Users Rated Both Items ∑ (rui − ri ) 2 u∈Users Rated Both Items um 04/18/13 Data Mining: Principles and Algorithms 33
  • 34. Correlation-based Methods (2)  Offline phase:  Calculate n(n-1) similarity measures  For each item  Determine its k-most similar items  Online phase:  Predict rating for a given user-item pair as a weighted sum over similar items that he rated ∑s r ij ai raj = i∈similar items Ua 2 3 ? 4 ∑s ij i∈similar items j 04/18/13 Data Mining: Principles and Algorithms 34
  • 35. Regression Based Methods [VUC00]  Offline phase:  Fit n(n-1) linear regressions  F (x) is a linear transformation of a user rating on ij item i to his rating on item j  Online phase  Same as previous method  The weights are inversely proportional to the regression error rates ∑ w f (r ij ij i∈rated items by a ai ) raj = ∑w ij i∈rated items by a 04/18/13 Data Mining: Principles and Algorithms 35
  • 36. Higher Order Models  Previous approaches used the Naïve Bayes assumption  Item effects on a given one are independent  Not always true  Higher order models can do better  Belief Network  Association Rule Mining 04/18/13 Data Mining: Principles and Algorithms 36
  • 37. Bayesian Belief Network: introduction  Bayesian belief network allows a subset of the variables to be conditionally independent  A graphical model of causal relationships  Represents dependency among the variables  Gives a specification of joint probability distribution Nodes: random variables Links: dependency X Y X,Y are the parents of Z, and Y is the parent of P Z No dependency between Z and P P Has no loops or cycles 04/18/13 Data Mining: Principles and Algorithms 37
  • 38. Bayesian Belief Network: An Example Family Smoker History (FH, S) (FH, ~S) (~FH, S) (~FH, ~S) LC 0.8 0.5 0.7 0.1 LungCancer Emphysema ~LC 0.2 0.5 0.3 0.9 The conditional probability table for the variable LungCancer: PositiveXRay Dyspnea Shows the conditional probability for each possible combination of its parents n Bayesian Belief Networks P ( z1,..., zn) = ∏ P ( z i | Parents ( Z i )) i =1 04/18/13 Data Mining: Principles and Algorithms 38
  • 39. Belief Network for CF [BRE98]  Every item is a node  Binary rating (like, dislike)  Learn offline a belief network over the training date  CPT table at each node is represented as a decision tree  Use greedy algorithms to determine the best network structure  Use probabilistic inference for online prediction 04/18/13 Data Mining: Principles and Algorithms 39
  • 40. Belief Network for CF: An Example CPT Friends B.H M.P Probability decision tree for the random variable “Melrose Palace” in the movie domain 04/18/13 Data Mining: Principles and Algorithms 40
  • 41. Association Rule Mining  Offline processing  Work on the binary level (like, dislike)  View user as market basket containing items liked by user  Discover association rules between items  Online processing:  Match items that the active user like with rules left hand side  Recommend rules’ consequent based on support and confidence 04/18/13 Data Mining: Principles and Algorithms 41
  • 42. Association Rule Mining : Problems  High support threshold leads to low coverage and may eliminate important, but infrequent items from consideration  Low support thresholds result in very large model sizes, computationally expensive offline pattern discovery phase and slower online matching phase  Solution:  Adaptive Association Rule Mining 04/18/13 Data Mining: Principles and Algorithms 42
  • 43. Adaptive Association Rule Mining [LIN01]  Given:  transaction dataset Desired number minConfidence of rules  target item  desired range for number of rules  specified minimum confidence minSupport Find: set S of association rules for target item such that  number of rules in S is in given range  rules in S satisfy minimum confidence constraint  rules in S have higher support than rules not in S that satisfy above constraints 04/18/13 Data Mining: Principles and Algorithms 43
  • 44. Adaptive Association Rule Mining (2)  Discover rules with one item on the head  Like (x, item1) ^ Like (x, item2)  Like(x, target)  The miner discovers association rules iteratively (for each target item) until the desired number of rules are extracted  Support is adjusted per-item 04/18/13 Data Mining: Principles and Algorithms 44
  • 45. Item-Item Methods: Why It Works? Like(x,Book1)^like(x,book2) Like(x,Movie1)  like(x,Movie2) like(x,book3) Book1, Book2 Support Movie1 Support Book Movie gang gang Without discovering the We use the right neighbors for each groups themselves thus item eliminating costly online matching In general better quality than user-user methods and better response time [LIN03] 04/18/13 Data Mining: Principles and Algorithms 45
  • 46. Recent Work and Open Problems  Order-based methods  Ordering items is more informative than rating them  [KAM03] developed k-o’mean to work on orders  Preference-based methods  Total ordering of items is not feasible  Work on partial orders (preferences) [COH99]  Integrating background knowledge  User demographic information, item-features, etc..  Modeling time  Sequential patterns 04/18/13 Data Mining: Principles and Algorithms 46
  • 47. References (1)  Charu C. Aggarwal, Joel L. Wolf, Kun-Lung Wu, Philip S. Yu: Horting Hatches an Egg: A New Graph-Theoretic Approach to Collaborative Filtering. KDD 1999: 201-212  J. Breese, D. Heckerman, C. Kadie Empirical Analysis of Predictive Algorithms for Collaborative Filtering. In Proc. 14th Conf. Uncertainty in Artificial Intelligence, Madison, July 1998.  Yoon Ho Cho and Jae Kyeong Kim: Application of Web usage mining and product taxonomy to collaborative recommendations in e-commerce. Expert Systems with Applications, 26(2), 2003  William W. Cohen, Robert E. Schapire, and Yoram Singer. Learning to order things. In Advances in Neural Processing Systems 10, Denver, CO, 1997  Jiawe Han, Fall 2003 online course notes available at: http://www-courses.cs.uiuc.edu/~cs397han/slides/05.ppt  Toshihiro Kamishima: Nantonac collaborative filtering: recommendation based on order responses. KDD 2003: 583-588  Lee, C.-H, Kim, Y.-H., Rhee, P.-K. Web personalization expert with combining collaborative filtering and association rule mining technique. Expert Systems with Applications, v 21, n 3, October, 2001, p 131-137 04/18/13 Data Mining: Principles and Algorithms 47
  • 48. References (2)  W. Lin, 2001P, online presentation available at: http://www.wiwi.hu- berlin.de/~myra/WEBKDD2000/WEBKDD2000_ARCHIVE/LinAlvarezRuiz_W ebKDD2000.ppt  Weiyang Lin, Sergio A. Alvarez, and Carolina Ruiz. Efficient adaptive-support association rule mining for recommender systems. Data Mining and Knowledge Discovery, 6:83--105, 2002  G. Linden, B. Smith, and J. York, "Amazon.com Recommendations Iemto -item collaborative filtering", IEEE Internet Computing, Vo. 7, No. 1, pp. 7680, Jan. 2003. Badrul M. Sarwar, George Karypis, Joseph A. Konstan, John Riedl: Analysis of recommendation algorithms for e-commerce. ACM Conf. Electronic Commerce 2000: 158-167  B. Sarwar, G. Karypis, J. Konstan, and J. Riedl: Application of dimensionality reduction in recommender systems--a case study. In ACM WebKDD 2000 Web Mining for E-Commerce Workshop, 2000.  B. M. Sarwar, G. Karypis, J. A. Konstan, and J. Riedl. Item-based collaborative filtering recommendation algorithms. WWW’01 04/18/13 Data Mining: Principles and Algorithms 48
  • 49. References (3)  B. Sarwar, 2000P, online presentation available at: http://www.wiwi.hu- berlin.de/~myra/WEBKDD2000/WEBKDD2000_ARCHIVE/badrul.ppt  J. Ben Schafer, Joseph A. Konstan, John Riedl: E-Commerce Recommendation Applications. Data Mining and Knowledge Discovery 5(1/2): 115-153, 2001  L.H. Ungar and D.P. Foster: Clustering Methods for Collaborative Filtering, AAAI Workshop on Recommendation Systems, 1998.  Yi-Fan Wang, Yu-Liang Chuang, Mei-Hua Hsu and Huan-Chao Keh: A personalized recommender system for the cosmetic business. Expert Systems with Applications, v 26, n 3, April, 2004 Pages 427-434  S. Vucetic and Z. Obradovic. A regression-based approach for scaling-up personalized recommender systems in e-commerce. In ACM WebKDD 2000 Web Mining for E-Commerce Workshop, 2000.  Kai Yu, Xiaowei Xu, Martin Ester, and Hans-Peter Kriegel: Selecting relevant instances for efficient accurate collaborative filtering. In Proceedings of the 10th CIKM, pages 239--246. ACM Press, 2001.  Cheng Zhai, Spring 2003 online course notes available at: http://sifaka.cs.uiuc.edu/course/2003-497CXZ/loc/cf.ppt 04/18/13 Data Mining: Principles and Algorithms 49
  • 50. 04/18/13 Data Mining: Principles and Algorithms 50