SlideShare una empresa de Scribd logo
1 de 55
Descargar para leer sin conexión
Link Mining

Lise Getoor
Li G t
University of Maryland, College Park



              August 22, 2012
Alternate Title…..
                 What
Machine Learning/Statistics/Data Mining
           can do for YOU!

1.Predict future values

2.Fill-in missing values       Supervised Learning


3 Identify anomalies What are some common
3.Identify
                     machine learning algorithms?
4.Find patterns
                               Unsupervised Learning
5.Identify Clusters
So, what’s Link Mining???
   Machine learning when you have graphs (or networks)
       Nodes are entities
         •   People
         •   Places
         •   Organizations
         •   Text
       Links are relationships
                             p
         •   Friends
         •   MemberOf
         •   LivesIn
         •   Tweeted
         •   Posted
       e.g., heterogeneous multi-relational data, multimodal
        data …..
Ex: Social Media Relationships
                         User-User
                           Friends
                           Collaborators
                           Family
          Ua     Ub        Fan/Follower
                           Replies
                           Co-Edits
                           Co-Mentions, etc.
                         User Doc
                         User-Doc
     U         Doc1        Comments
                           Edits, etc.


U    Q         URL    User-Query-Click


U   Tag        Doc     User-Tag-Doc
Link Mining Tasks
   Node Labeling
   Link Prediction
   Entity Resolution
   G oup etect o
    Group Detection
Node Labeling

       What is Harry’s
          h
     political persuasion?




                  Harry




 Natasha
Link Prediction




            Friends?
Entity Resolution
   Aka: deduplication, co-reference resolution, record
    linkage, reference consolidation, etc.
         g
Abstract Problem Statement
Real            Digital World
World                     Records /
                          Mentions
Deduplication Problem Statement
   Cluster the records/mentions that correspond to
    same entityy
Deduplication Problem Statement
   Cluster the records/mentions that correspond to
    same entityy
       Intensional Variant: Compute cluster representative
Record Linkage Problem Statement
   Link records that match across databases

                                               B
A
Reference Matching Problem
   Match noisy records to clean records in a reference
    table

                                          Reference
                                            Table
                                            T bl
InfoVis Co-Author Network Fragment




  before              after
Group Detection
Link Mining Algorithms
   Node Labeling
   Link Prediction
   Entity Resolution
   G oup etect o
    Group Detection
Link Mining Algorithms
   Node Labeling       1. Relational Classifiers
                        2. Collective Classifiers
   Link Prediction
   Entity Resolution
   G oup etect o
    Group Detection
Relational Classifiers
 Given:                             a                                w
                            b                   1
                                        5                   2                x
                   c

                        d                            3                   y
                                            4
                                e                                z

    Task: Predict attribute                                     Alternate task: Predict existence
    of some of the entities                                     of relationship between entities
                                                            ?
1                                       ?            1          2                                            ?
                                                            ?
2                                       ?            1          3                                            ?



                                                         ...
                    relational features
...
  .




                                                            ?
5                                       ?            4          5                                            ?


                                                                     same-attribute-value
       local features
                            avg value of
                                  l    f        neighbors
                                                    hb
                                                                number of shared              neighbors
 number of     neighbors                                                     participate in       relation
Relational Classifiers
   Values are represented as a fixed-length feature
    vector

   Instances are treated independently of each other

   Relational features are computed by aggregating
    over related entities

   Any classification or regression model can be used
    for learning and prediction
Application Case Studies
   Two example applications that use relational
    classifiers
       Focus is on types of relational features used

   Case Study 1: Predicting click-through rate of
    search result ads
   Case St d 2 P di ti f i d hi i a social
    C    Study 2: Predicting friendships in      i l
    network
Case Study 1:
         Predicting Ad Click-Through Rate
                       Click Through
   Task: Predict the click through rate (CTR) of an
                      click-through
    online ad, given that it is seen by the user, where
    the ad is described by:
       URL to which user is sent when clicking on ad
       Bid terms used to determine when to display ad
       Title d text f d
        Titl and t t of ad

   Our description is based on approach by
       [Richardson et al., WWW07]
Relational Features Used
                                         Average CTR                         Average CTR
  CTR?

                Ad               Ad1      Ad2       Ad3           Ad4        Ad5       Ad6

         contains-bid-term


              BT1     BT2      BT3
                                                       BT4       BT5        BT6
contains-bid-term
   t i   bid t
(according to search engine)
                                                                       related-bid-term
                                                                       (containing subsets or
                                                                       supersets of the term)


          …           …              …     queried-bid-term

                                                          …

    Count                                                     Count
Case Study 2:
         Predicting Friendships

   Task: Predict new friendships among users, based
                                             users
    on their descriptive attributes, their existing
    friendships, and their family ties.
             p ,                 y

   Our description is based on approach by
              p                  pp       y
       [Zheleva et al., SNAKDD08]
Relational Features Used
   “Petworks” - social networks of pets

                                    count, density



                                    P3              P8

                                                                  count, proportion
                     P6

                                                                    P9
                                            P4           count
                          count
                              t


            P7                             P5                            P10
                          P1                                 P2
                                         Friends?
                                                                  P11
    F1                            Jaccard coeff
         in-family                                                                    F2
                                                same-breed
                                                same breed
Key Idea: Feature Construction
   Feature informativeness is key to the success of a
    relational classifier

   Features can be:
       Attributes of entity/entities
       Match predicate on attributes of entities
       Attributes of related entities
       Encode structural features
       Based on overlap in sets
                   o erlap
Link Mining Algorithms
   Node Labeling       1. Relational Classifiers
                        2. Collective Classifiers
   Link Prediction
   Entity Resolution
   G oup etect o
    Group Detection
Collective Classification
             [Neville & Jensen, SRL00; Lu & Getoor, ICML03, Sen et al. AI Mag08]


   Extends relational classifiers by allowing relational
    features to be functions of predicted attributes/relations
    of neighbors
   At training time, these features are computed based on
    observed values in the training set
   At i f
       inference ti
                  time, th algorithm it t
                        the l ith iterates, computing ti
    relational features based on the current prediction for
    any unobserved attributes
       In the first, bootstrap, iteration, only local features are
        used
CC: Learning
   label set:

                      P2
                                         P4
                 P1
                              P3
                                              P10
                      P8            P5

                 P6
                             P9
                      P7


       Learn models (local and relational) f
       L         d l (l    l d l ti     l) from
         fully labeled training set
CC: Inference (1)


                     P1

           P2
                               P5


            P3
                          P4



Step 1 B t t
St 1: Bootstrap using entity attributes only
                  i     tit tt ib t       l
CC: Inference (2)


                      P1

            P2
                                P5


             P3
                           P4



Step 2 Iteratively d t th
St 2: It ti l update the category of each entity,
                                t       f h tit
  based on related entities’ categories
CC Key Idea
   Rather than make predictions independently, begin
    with relational classifier, and then ‘propagate’
                                          p p g
    classification

   Variations:
       Propagate probabilities, rather than mode (related to
        Gibbs Sampling)
       Batch vs. Incremental updates
       Ordering strategies


   Active area of research: active learning, semi
                                              semi-
    supervised learning, more principled joint
    probabilistic models, etc.
Link Mining Algorithms
   Node Labeling
   Link Prediction
   Entity Resolution
   G oup etect o
    Group Detection
The Entity Resolution Problem
                                                  James
                 John
                                                  Smith
                 Smith



                                 “John Smith”

                                                  “Jim Smith”
                  “J Smith”

                                                                 “James Smith
                                                                  James Smith”

Jonathan Smith                “Jon Smith”


                                                                “J Smith”
                              “Jonthan Smith”
                                                Issues:
                                                1.   Identification
                                                2.   Disambiguation
Relational Identification




                   Very similar names.
                   Added evidence from
                   shared co-authors
Relational Disambiguation




                     Very similar names
                     but no shared
                     collaborators
Collective Entity Resolution




                       One resolution
                       provides evidence
                       for another => joint
                                      j
                       resolution
P1: “JOSTLE: Partitioning of Unstructured Meshes for
    Massively Parallel Machines”, C. Walshaw, M. Cross,
    M. G. Everett, S. Johnson J

P2: “Partitioning Mapping of Unstructured Meshes to
     Partitioning
    Parallel Machine Topologies”, C. Walshaw, M. Cross, M.
    G. Everett, S. Johnson, K. McManus J

P3: “Dynamic Mesh Partitioning: A Unied Optimisation and
     Dynamic
    Load-Balancing Algorithm”, C. Walshaw, M. Cross, M.
    G. Everett

P4: “Code Generation for Machines with Multiregister
    Operations”, Alfred V. Aho, Stephen C. Johnson,
    Jefferey D. Ullman J

P5: “Deterministic Parsing of Ambiguous Grammars”, A.
                         g        g
    Aho, S. Johnson, J. Ullman J

P6: “Compilers: Principles, Techniques, and Tools”, A. Aho,
    R. Sethi, J. Ullman
P1: “JOSTLE: Partitioning of Unstructured Meshes for
    Massively Parallel Machines”, C. Walshaw, M. Cross,
    M. G. Everett, S. Johnson

P2: “Partitioning Mapping of Unstructured Meshes to
     Partitioning
    Parallel Machine Topologies”, C. Walshaw, M. Cross, M.
    G. Everett, S. Johnson, K. McManus

P3: “Dynamic Mesh Partitioning: A Unied Optimisation and
     Dynamic
    Load-Balancing Algorithm”, C. Walshaw, M. Cross, M.
    G. Everett

P4: “Code Generation for Machines with Multiregister
    Operations”, Alfred V. Aho, Stephen C. Johnson,
    Jefferey D. Ullman

P5: “Deterministic Parsing of Ambiguous Grammars”, A.
                         g        g
    Aho, S. Johnson, J. Ullman

P6: “Compilers: Principles, Techniques, and Tools”, A. Aho,
    R. Sethi, J. Ullman
Relational Clustering (RC-ER)

P1   C. Walshaw      M. Cross   M. G. Everett     S. Johnson




P2   C.
     C Walshaw       M.
                     M Cross     M. Everett       S. Johnson   K.
                                                               K McManus




P4   Alfred V. Aho     Jefferey D. Ullman     Stephen C. Johnson




P5      A. Aho             J. Ullman              S. Johnson
Relational Clustering (RC-ER)

P1   C. Walshaw      M. Cross   M. G. Everett     S. Johnson




P2   C.
     C Walshaw       M.
                     M Cross     M. Everett       S. Johnson   K.
                                                               K McManus




P4   Alfred V. Aho     Jefferey D. Ullman     Stephen C. Johnson




P5      A. Aho             J. Ullman              S. Johnson
Relational Clustering (RC-ER)

P1   C. Walshaw      M. Cross   M. G. Everett     S. Johnson




P2   C.
     C Walshaw       M.
                     M Cross     M. Everett       S. Johnson   K.
                                                               K McManus




P4   Alfred V. Aho     Jefferey D. Ullman     Stephen C. Johnson




P5      A. Aho             J. Ullman              S. Johnson
Relational Clustering (RC-ER)

P1   C. Walshaw      M. Cross   M. G. Everett     S. Johnson




P2   C.
     C Walshaw       M.
                     M Cross     M. Everett       S. Johnson   K.
                                                               K McManus




P4   Alfred V. Aho     Jefferey D. Ullman     Stephen C. Johnson




P5      A. Aho             J. Ullman              S. Johnson
Cut-based Formulation of RC-ER

   M. G. Everett       S. Johnson       M. G. Everett       S. Johnson


    M. Everett         S. Johnson        M. Everett         S. Johnson


                       S. Johnson                           S. Johnson



     A. Aho                               A. Aho
                       Stephen C.                           Stephen C.
  Alfred V. Aho          Johnson       Alfred V. Aho          Johnson




Good separation of attributes        Worse in terms of attributes
Many cluster-cluster relationships   Fewer cluster-cluster relationships
 Aho-Johnson1 Aho-Johnson2
  Aho Johnson1, Aho Johnson2,         Aho-Johnson1 Everett Johnson2
                                       Aho Johnson1, Everett-Johnson2
  Everett-Johnson1
Objective Function
   Minimize:

              w sim
                i
                   i
                      j
                             A      A   (ci ,c j )  wR simR (ci , c j )
                                                         i

         weight for       similarity of      weight for   Similarity based on relational
         attributes        attributes         relations     edges between ci and cj


   Greedy clustering algorithm: merge cluster pair with max
    reduction in objective function


     (ci ,c j ) w A sim A (ci ,c j )  w R (|N (ci )||N (c j )|)
            Similarity of attributes              Common cluster neighborhood
Relational Clustering Algorithm
1.   Find similar references using ‘blocking’
2.   Bootstrap clusters using attributes and relations
3.   Compute similarities for cluster pairs and insert into
     priority queue

4.   Repeat until priority queue is empty
5.        Find ‘closest’ cluster pair
6.        Stop if similarity below threshold
7.        Merge to create new cluster
8.
8         Update similarity for ‘related’ clusters
                                  related


    O(n l
     O( k log n) algorithm w/ efficient i l
               ) l ith      / ffi i t implementation
                                              t ti
Evaluation Datasets
   CiteSeer
       1,504 citations to machine learning papers (Lawrence et al.)
       2,892 references to 1,165 author entities


   arXiv
       29,555 publications from High Energy Physics (KDD Cup’03)
       58,515 refs to 9,200 authors


   Elsevier BioBase
       156,156 Biology papers (IBM KDD Challenge ’05)
       831,991 author refs
       Keywords, topic classifications, language, country and affiliation
        of corresponding author, etc
                 p     g        ,
Baselines
   A: Pair-wise duplicate decisions w/ attributes only
       Names: Soft-TFIDF with Levenstein, Jaro, Jaro-Winkler
       Other textual attributes: TF-IDF
   A*: Transitive closure over A


   A+N: Add attribute similarity of co-occurring refs
   A+N*: Transitive closure over A+N

   Evaluate pair-wise decisions over references
   F1-measure
    F1 measure (harmonic mean of precision and recall)
ER over Entire Dataset
        Algorithm      CiteSeer        arXiv        BioBase
             A           0.980         0.976         0.568
            A*           0.990         0.971         0.559
           A+N           0.973         0.938         0.710
           A+N
           A+N*          0.984
                         0 984         0.934
                                       0 934         0.753
                                                     0 753
          RC-ER          0.995         0.985         0.818



   RC-ER outperforms baselines in all datasets
   Collective resolution better than naïve relational resolution
ER over Entire Dataset
        Algorithm      CiteSeer       arXiv        BioBase
            A           0.980         0.976         0.568
            A*          0.990         0.971         0.559
           A+N          0.973         0.938         0.710
           A+N
           A+N*         0.984
                        0 984         0.934
                                      0 934         0.753
                                                    0 753
          RC-ER         0.995         0.985         0.818




   CiteSeer: Near perfect resolution; 22% error reduction
   arXiv: 6 500 additional correct resolutions; 20% error reduction
           6,500
   BioBase: Biggest improvement over baselines
Flipside….
Privacy breaches in OSNs
       Identity disclosure
         A mapping from a record              Who is           ?
          to a specific individual

       Attribute disclosure
         Find attribute value that the         Is         liberal?
          user intended to stay private

       Social link disclosure
         Participation in a sensitive
                                                     Friends?
          relationship or communication
                     p

       Affiliation link disclosure                       Support gay
         Participation in a group revealing
          a sensitive attribute value
                                                          marriage
Other Linqs Projects
   Key Opinion Leader Identification
   Active Surveying in Social Networks
   Ontology Alignment and Folksonomy construction
   Label Acquisition & Active Learning in Network Data
   Inference & Search in Camera Networks
   Identifying R l in Social Networks
    Id tif i Roles i S i l N t        k
   Group Recommendation in Social Networks
   Social Search
   Analysis of Dynamic Networks: loyalty, stability, diversity
   Ranking and Retrieval in Biological Networks
   Discourse level
    Discourse-level sentiment analysis
   Bilingual Word Sense Disambiguation
   Visual Analytics:
        D-Dupe, C G
         DD      C-Group, G-Pare
                          GP
   Others …
                                       http://www.cs.umd.edu/linqs
Other Linqs Projects
   Key Opinion Leader Identification
   Active Surveying in Social Networks
   Ontology Alignment and Folksonomy construction
   Label Acquisition & Active Learning in Network Data
   Inference & Search in Camera Networks
   Identifying R l in Social Networks
    Id tif i Roles i S i l N t        k
   Group Recommendation in Social Networks
   Social Search
   Analysis of Dynamic Networks: loyalty, stability, diversity
   Ranking and Retrieval in Biological Networks
   Discourse level
    Discourse-level sentiment analysis
   Bilingual Word Sense Disambiguation
   Visual Analytics:
        D-Dupe, C G
         DD      C-Group, G-Pare
                          GP
   Others …
                                       http://www.cs.umd.edu/linqs
Conclusion
   Link mining algorithms can be useful tools for social
    media
   Need algorithms that can handle the multi-modal,
    multi-relational, temporal nature of social media
   Collective algorithms make use of
       Structure to define features and propagate
        information, allows us to improve the overall accuracy
        i f     ti    ll       t i        th       ll
   While there are important pitfalls to take into
    account (confidence and privacy) there are
                             privacy),
    many potential benefits and payoffs (improved
    personalization and context-aware predictions!)
                        context aware
http://www.cs.umd.edu/linqs



      Work sponsored by the National Science Foundation,
Maryland Industrial Partners (MIPS), National Geospatial Agency
                             (MIPS)                      Agency,
           Airforce Research Laboratory, DARPA,
                 Google, Microsoft, and Yahoo!

Más contenido relacionado

Destacado

Elizabeth Churchill, "Data by Design"
Elizabeth Churchill, "Data by Design"Elizabeth Churchill, "Data by Design"
Elizabeth Churchill, "Data by Design"summersocialwebshop
 
Libby Hemphill, "Elected Officials and Social Media"
Libby Hemphill, "Elected Officials and Social Media"Libby Hemphill, "Elected Officials and Social Media"
Libby Hemphill, "Elected Officials and Social Media"summersocialwebshop
 
Paul Resnick, "Healthier Together: Social Approaches to Health and Wellness"
Paul Resnick, "Healthier Together: Social Approaches to Health and Wellness"Paul Resnick, "Healthier Together: Social Approaches to Health and Wellness"
Paul Resnick, "Healthier Together: Social Approaches to Health and Wellness"summersocialwebshop
 
Butler, "Realizing the potential of data"
Butler, "Realizing the potential of data"Butler, "Realizing the potential of data"
Butler, "Realizing the potential of data"summersocialwebshop
 
Katie Shilton, "Participatory Personal Data"
Katie Shilton, "Participatory Personal Data"Katie Shilton, "Participatory Personal Data"
Katie Shilton, "Participatory Personal Data"summersocialwebshop
 
Bernie Hogan, "A survey of Facebook as a research site"
Bernie Hogan, "A survey of Facebook as a research site"Bernie Hogan, "A survey of Facebook as a research site"
Bernie Hogan, "A survey of Facebook as a research site"summersocialwebshop
 
Jana Diesner, "Words and Networks: Considering the Content of Text Data for N...
Jana Diesner, "Words and Networks: Considering the Content of Text Data for N...Jana Diesner, "Words and Networks: Considering the Content of Text Data for N...
Jana Diesner, "Words and Networks: Considering the Content of Text Data for N...summersocialwebshop
 
Eszter Hargittai, "The Implications of Digital Inequality for Internet Research"
Eszter Hargittai, "The Implications of Digital Inequality for Internet Research"Eszter Hargittai, "The Implications of Digital Inequality for Internet Research"
Eszter Hargittai, "The Implications of Digital Inequality for Internet Research"summersocialwebshop
 
Jessica Vitak, "When Contexts Collapse: Managing Self-Presentation Across Soc...
Jessica Vitak, "When Contexts Collapse: Managing Self-Presentation Across Soc...Jessica Vitak, "When Contexts Collapse: Managing Self-Presentation Across Soc...
Jessica Vitak, "When Contexts Collapse: Managing Self-Presentation Across Soc...summersocialwebshop
 

Destacado (9)

Elizabeth Churchill, "Data by Design"
Elizabeth Churchill, "Data by Design"Elizabeth Churchill, "Data by Design"
Elizabeth Churchill, "Data by Design"
 
Libby Hemphill, "Elected Officials and Social Media"
Libby Hemphill, "Elected Officials and Social Media"Libby Hemphill, "Elected Officials and Social Media"
Libby Hemphill, "Elected Officials and Social Media"
 
Paul Resnick, "Healthier Together: Social Approaches to Health and Wellness"
Paul Resnick, "Healthier Together: Social Approaches to Health and Wellness"Paul Resnick, "Healthier Together: Social Approaches to Health and Wellness"
Paul Resnick, "Healthier Together: Social Approaches to Health and Wellness"
 
Butler, "Realizing the potential of data"
Butler, "Realizing the potential of data"Butler, "Realizing the potential of data"
Butler, "Realizing the potential of data"
 
Katie Shilton, "Participatory Personal Data"
Katie Shilton, "Participatory Personal Data"Katie Shilton, "Participatory Personal Data"
Katie Shilton, "Participatory Personal Data"
 
Bernie Hogan, "A survey of Facebook as a research site"
Bernie Hogan, "A survey of Facebook as a research site"Bernie Hogan, "A survey of Facebook as a research site"
Bernie Hogan, "A survey of Facebook as a research site"
 
Jana Diesner, "Words and Networks: Considering the Content of Text Data for N...
Jana Diesner, "Words and Networks: Considering the Content of Text Data for N...Jana Diesner, "Words and Networks: Considering the Content of Text Data for N...
Jana Diesner, "Words and Networks: Considering the Content of Text Data for N...
 
Eszter Hargittai, "The Implications of Digital Inequality for Internet Research"
Eszter Hargittai, "The Implications of Digital Inequality for Internet Research"Eszter Hargittai, "The Implications of Digital Inequality for Internet Research"
Eszter Hargittai, "The Implications of Digital Inequality for Internet Research"
 
Jessica Vitak, "When Contexts Collapse: Managing Self-Presentation Across Soc...
Jessica Vitak, "When Contexts Collapse: Managing Self-Presentation Across Soc...Jessica Vitak, "When Contexts Collapse: Managing Self-Presentation Across Soc...
Jessica Vitak, "When Contexts Collapse: Managing Self-Presentation Across Soc...
 

Similar a Lise Getoor, "

RecSys 2008: Social Ranking
RecSys 2008: Social RankingRecSys 2008: Social Ranking
RecSys 2008: Social RankingUCL-CS MobiSys
 
IASSIST 2011 - Representation of the Data Documentation Initiative using Sema...
IASSIST 2011 - Representation of the Data Documentation Initiative using Sema...IASSIST 2011 - Representation of the Data Documentation Initiative using Sema...
IASSIST 2011 - Representation of the Data Documentation Initiative using Sema...Dr.-Ing. Thomas Hartmann
 
Microsoft PowerPoint - ml4textweb00
Microsoft PowerPoint - ml4textweb00Microsoft PowerPoint - ml4textweb00
Microsoft PowerPoint - ml4textweb00butest
 
Tag And Tag Based Recommender
Tag And Tag Based RecommenderTag And Tag Based Recommender
Tag And Tag Based Recommendergu wendong
 
Recsys Presentation
Recsys PresentationRecsys Presentation
Recsys PresentationNeal Lathia
 
Teigland, Di Gangi, Yetis - Open Innovation Conference
Teigland, Di Gangi, Yetis - Open Innovation ConferenceTeigland, Di Gangi, Yetis - Open Innovation Conference
Teigland, Di Gangi, Yetis - Open Innovation ConferenceRobin Teigland
 
Private Distributed Collaborative Filtering
Private Distributed Collaborative FilteringPrivate Distributed Collaborative Filtering
Private Distributed Collaborative FilteringNeal Lathia
 
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorial
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorialBuilding Large-scale Real-world Recommender Systems - Recsys2012 tutorial
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorialXavier Amatriain
 
Adding Semantics to Social Software Engineering (by Steffen Lohmann & Thomas ...
Adding Semantics to Social Software Engineering (by Steffen Lohmann & Thomas ...Adding Semantics to Social Software Engineering (by Steffen Lohmann & Thomas ...
Adding Semantics to Social Software Engineering (by Steffen Lohmann & Thomas ...Wolfgang Reinhardt
 
Metaphors as design points for collaboration 2012
Metaphors as design points for collaboration 2012Metaphors as design points for collaboration 2012
Metaphors as design points for collaboration 2012KM Chicago
 
Linked Open Data to support content based Recommender Systems
Linked Open Data to support content based Recommender SystemsLinked Open Data to support content based Recommender Systems
Linked Open Data to support content based Recommender SystemsVito Ostuni
 
Linked Open Data to Support Content-based Recommender Systems - I-SEMANTIC…
Linked Open Data to Support Content-based Recommender Systems - I-SEMANTIC…Linked Open Data to Support Content-based Recommender Systems - I-SEMANTIC…
Linked Open Data to Support Content-based Recommender Systems - I-SEMANTIC…Roku
 
Metrics for Evaluating Quality of Embeddings for Ontological Concepts
Metrics for Evaluating Quality of Embeddings for Ontological Concepts Metrics for Evaluating Quality of Embeddings for Ontological Concepts
Metrics for Evaluating Quality of Embeddings for Ontological Concepts Saeedeh Shekarpour
 
Web services for supporting the interactions of learners in the social web - ...
Web services for supporting the interactions of learners in the social web - ...Web services for supporting the interactions of learners in the social web - ...
Web services for supporting the interactions of learners in the social web - ...Traian Rebedea
 
Content, Connections, and Context
Content, Connections, and ContextContent, Connections, and Context
Content, Connections, and ContextDaniel Tunkelang
 
Adapting Rankers Online, Maarten de Rijke
Adapting Rankers Online, Maarten de RijkeAdapting Rankers Online, Maarten de Rijke
Adapting Rankers Online, Maarten de Rijkeyaevents
 
Adapto\ing Rankers Online, Maarten de Rijke
Adapto\ing Rankers Online, Maarten de RijkeAdapto\ing Rankers Online, Maarten de Rijke
Adapto\ing Rankers Online, Maarten de Rijkeyaevents
 
Machine Learning with Mahout
Machine Learning with MahoutMachine Learning with Mahout
Machine Learning with Mahoutbigdatasyd
 
Context-Enhanced Adaptive Entity Linking
Context-Enhanced Adaptive Entity LinkingContext-Enhanced Adaptive Entity Linking
Context-Enhanced Adaptive Entity LinkingGiuseppe Rizzo
 

Similar a Lise Getoor, " (20)

Declarative analysis of noisy information networks
Declarative analysis of noisy information networksDeclarative analysis of noisy information networks
Declarative analysis of noisy information networks
 
RecSys 2008: Social Ranking
RecSys 2008: Social RankingRecSys 2008: Social Ranking
RecSys 2008: Social Ranking
 
IASSIST 2011 - Representation of the Data Documentation Initiative using Sema...
IASSIST 2011 - Representation of the Data Documentation Initiative using Sema...IASSIST 2011 - Representation of the Data Documentation Initiative using Sema...
IASSIST 2011 - Representation of the Data Documentation Initiative using Sema...
 
Microsoft PowerPoint - ml4textweb00
Microsoft PowerPoint - ml4textweb00Microsoft PowerPoint - ml4textweb00
Microsoft PowerPoint - ml4textweb00
 
Tag And Tag Based Recommender
Tag And Tag Based RecommenderTag And Tag Based Recommender
Tag And Tag Based Recommender
 
Recsys Presentation
Recsys PresentationRecsys Presentation
Recsys Presentation
 
Teigland, Di Gangi, Yetis - Open Innovation Conference
Teigland, Di Gangi, Yetis - Open Innovation ConferenceTeigland, Di Gangi, Yetis - Open Innovation Conference
Teigland, Di Gangi, Yetis - Open Innovation Conference
 
Private Distributed Collaborative Filtering
Private Distributed Collaborative FilteringPrivate Distributed Collaborative Filtering
Private Distributed Collaborative Filtering
 
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorial
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorialBuilding Large-scale Real-world Recommender Systems - Recsys2012 tutorial
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorial
 
Adding Semantics to Social Software Engineering (by Steffen Lohmann & Thomas ...
Adding Semantics to Social Software Engineering (by Steffen Lohmann & Thomas ...Adding Semantics to Social Software Engineering (by Steffen Lohmann & Thomas ...
Adding Semantics to Social Software Engineering (by Steffen Lohmann & Thomas ...
 
Metaphors as design points for collaboration 2012
Metaphors as design points for collaboration 2012Metaphors as design points for collaboration 2012
Metaphors as design points for collaboration 2012
 
Linked Open Data to support content based Recommender Systems
Linked Open Data to support content based Recommender SystemsLinked Open Data to support content based Recommender Systems
Linked Open Data to support content based Recommender Systems
 
Linked Open Data to Support Content-based Recommender Systems - I-SEMANTIC…
Linked Open Data to Support Content-based Recommender Systems - I-SEMANTIC…Linked Open Data to Support Content-based Recommender Systems - I-SEMANTIC…
Linked Open Data to Support Content-based Recommender Systems - I-SEMANTIC…
 
Metrics for Evaluating Quality of Embeddings for Ontological Concepts
Metrics for Evaluating Quality of Embeddings for Ontological Concepts Metrics for Evaluating Quality of Embeddings for Ontological Concepts
Metrics for Evaluating Quality of Embeddings for Ontological Concepts
 
Web services for supporting the interactions of learners in the social web - ...
Web services for supporting the interactions of learners in the social web - ...Web services for supporting the interactions of learners in the social web - ...
Web services for supporting the interactions of learners in the social web - ...
 
Content, Connections, and Context
Content, Connections, and ContextContent, Connections, and Context
Content, Connections, and Context
 
Adapting Rankers Online, Maarten de Rijke
Adapting Rankers Online, Maarten de RijkeAdapting Rankers Online, Maarten de Rijke
Adapting Rankers Online, Maarten de Rijke
 
Adapto\ing Rankers Online, Maarten de Rijke
Adapto\ing Rankers Online, Maarten de RijkeAdapto\ing Rankers Online, Maarten de Rijke
Adapto\ing Rankers Online, Maarten de Rijke
 
Machine Learning with Mahout
Machine Learning with MahoutMachine Learning with Mahout
Machine Learning with Mahout
 
Context-Enhanced Adaptive Entity Linking
Context-Enhanced Adaptive Entity LinkingContext-Enhanced Adaptive Entity Linking
Context-Enhanced Adaptive Entity Linking
 

Último

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 

Último (20)

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 

Lise Getoor, "

  • 1. Link Mining Lise Getoor Li G t University of Maryland, College Park August 22, 2012
  • 2. Alternate Title….. What Machine Learning/Statistics/Data Mining can do for YOU! 1.Predict future values 2.Fill-in missing values Supervised Learning 3 Identify anomalies What are some common 3.Identify machine learning algorithms? 4.Find patterns Unsupervised Learning 5.Identify Clusters
  • 3. So, what’s Link Mining???  Machine learning when you have graphs (or networks)  Nodes are entities • People • Places • Organizations • Text  Links are relationships p • Friends • MemberOf • LivesIn • Tweeted • Posted  e.g., heterogeneous multi-relational data, multimodal data …..
  • 4. Ex: Social Media Relationships User-User Friends Collaborators Family Ua Ub Fan/Follower Replies Co-Edits Co-Mentions, etc. User Doc User-Doc U Doc1 Comments Edits, etc. U Q URL User-Query-Click U Tag Doc User-Tag-Doc
  • 5. Link Mining Tasks  Node Labeling  Link Prediction  Entity Resolution  G oup etect o Group Detection
  • 6. Node Labeling What is Harry’s h political persuasion? Harry Natasha
  • 7. Link Prediction Friends?
  • 8. Entity Resolution  Aka: deduplication, co-reference resolution, record linkage, reference consolidation, etc. g
  • 9. Abstract Problem Statement Real Digital World World Records / Mentions
  • 10. Deduplication Problem Statement  Cluster the records/mentions that correspond to same entityy
  • 11. Deduplication Problem Statement  Cluster the records/mentions that correspond to same entityy  Intensional Variant: Compute cluster representative
  • 12. Record Linkage Problem Statement  Link records that match across databases B A
  • 13. Reference Matching Problem  Match noisy records to clean records in a reference table Reference Table T bl
  • 14. InfoVis Co-Author Network Fragment before after
  • 16. Link Mining Algorithms  Node Labeling  Link Prediction  Entity Resolution  G oup etect o Group Detection
  • 17. Link Mining Algorithms  Node Labeling 1. Relational Classifiers 2. Collective Classifiers  Link Prediction  Entity Resolution  G oup etect o Group Detection
  • 18. Relational Classifiers Given: a w b 1 5 2 x c d 3 y 4 e z Task: Predict attribute Alternate task: Predict existence of some of the entities of relationship between entities ? 1 ? 1 2 ? ? 2 ? 1 3 ? ... relational features ... . ? 5 ? 4 5 ? same-attribute-value local features avg value of l f neighbors hb number of shared neighbors number of neighbors participate in relation
  • 19. Relational Classifiers  Values are represented as a fixed-length feature vector  Instances are treated independently of each other  Relational features are computed by aggregating over related entities  Any classification or regression model can be used for learning and prediction
  • 20. Application Case Studies  Two example applications that use relational classifiers  Focus is on types of relational features used  Case Study 1: Predicting click-through rate of search result ads  Case St d 2 P di ti f i d hi i a social C Study 2: Predicting friendships in i l network
  • 21. Case Study 1: Predicting Ad Click-Through Rate Click Through  Task: Predict the click through rate (CTR) of an click-through online ad, given that it is seen by the user, where the ad is described by:  URL to which user is sent when clicking on ad  Bid terms used to determine when to display ad  Title d text f d Titl and t t of ad  Our description is based on approach by  [Richardson et al., WWW07]
  • 22. Relational Features Used Average CTR Average CTR CTR? Ad Ad1 Ad2 Ad3 Ad4 Ad5 Ad6 contains-bid-term BT1 BT2 BT3 BT4 BT5 BT6 contains-bid-term t i bid t (according to search engine) related-bid-term (containing subsets or supersets of the term) … … … queried-bid-term … Count Count
  • 23. Case Study 2: Predicting Friendships  Task: Predict new friendships among users, based users on their descriptive attributes, their existing friendships, and their family ties. p , y  Our description is based on approach by p pp y  [Zheleva et al., SNAKDD08]
  • 24. Relational Features Used  “Petworks” - social networks of pets count, density P3 P8 count, proportion P6 P9 P4 count count t P7 P5 P10 P1 P2 Friends? P11 F1 Jaccard coeff in-family F2 same-breed same breed
  • 25. Key Idea: Feature Construction  Feature informativeness is key to the success of a relational classifier  Features can be:  Attributes of entity/entities  Match predicate on attributes of entities  Attributes of related entities  Encode structural features  Based on overlap in sets o erlap
  • 26. Link Mining Algorithms  Node Labeling 1. Relational Classifiers 2. Collective Classifiers  Link Prediction  Entity Resolution  G oup etect o Group Detection
  • 27. Collective Classification [Neville & Jensen, SRL00; Lu & Getoor, ICML03, Sen et al. AI Mag08]  Extends relational classifiers by allowing relational features to be functions of predicted attributes/relations of neighbors  At training time, these features are computed based on observed values in the training set  At i f inference ti time, th algorithm it t the l ith iterates, computing ti relational features based on the current prediction for any unobserved attributes  In the first, bootstrap, iteration, only local features are used
  • 28. CC: Learning  label set: P2 P4 P1 P3 P10 P8 P5 P6 P9 P7 Learn models (local and relational) f L d l (l l d l ti l) from fully labeled training set
  • 29. CC: Inference (1) P1 P2 P5 P3 P4 Step 1 B t t St 1: Bootstrap using entity attributes only i tit tt ib t l
  • 30. CC: Inference (2) P1 P2 P5 P3 P4 Step 2 Iteratively d t th St 2: It ti l update the category of each entity, t f h tit based on related entities’ categories
  • 31. CC Key Idea  Rather than make predictions independently, begin with relational classifier, and then ‘propagate’ p p g classification  Variations:  Propagate probabilities, rather than mode (related to Gibbs Sampling)  Batch vs. Incremental updates  Ordering strategies  Active area of research: active learning, semi semi- supervised learning, more principled joint probabilistic models, etc.
  • 32. Link Mining Algorithms  Node Labeling  Link Prediction  Entity Resolution  G oup etect o Group Detection
  • 33. The Entity Resolution Problem James John Smith Smith “John Smith” “Jim Smith” “J Smith” “James Smith James Smith” Jonathan Smith “Jon Smith” “J Smith” “Jonthan Smith” Issues: 1. Identification 2. Disambiguation
  • 34. Relational Identification Very similar names. Added evidence from shared co-authors
  • 35. Relational Disambiguation Very similar names but no shared collaborators
  • 36. Collective Entity Resolution One resolution provides evidence for another => joint j resolution
  • 37. P1: “JOSTLE: Partitioning of Unstructured Meshes for Massively Parallel Machines”, C. Walshaw, M. Cross, M. G. Everett, S. Johnson J P2: “Partitioning Mapping of Unstructured Meshes to Partitioning Parallel Machine Topologies”, C. Walshaw, M. Cross, M. G. Everett, S. Johnson, K. McManus J P3: “Dynamic Mesh Partitioning: A Unied Optimisation and Dynamic Load-Balancing Algorithm”, C. Walshaw, M. Cross, M. G. Everett P4: “Code Generation for Machines with Multiregister Operations”, Alfred V. Aho, Stephen C. Johnson, Jefferey D. Ullman J P5: “Deterministic Parsing of Ambiguous Grammars”, A. g g Aho, S. Johnson, J. Ullman J P6: “Compilers: Principles, Techniques, and Tools”, A. Aho, R. Sethi, J. Ullman
  • 38. P1: “JOSTLE: Partitioning of Unstructured Meshes for Massively Parallel Machines”, C. Walshaw, M. Cross, M. G. Everett, S. Johnson P2: “Partitioning Mapping of Unstructured Meshes to Partitioning Parallel Machine Topologies”, C. Walshaw, M. Cross, M. G. Everett, S. Johnson, K. McManus P3: “Dynamic Mesh Partitioning: A Unied Optimisation and Dynamic Load-Balancing Algorithm”, C. Walshaw, M. Cross, M. G. Everett P4: “Code Generation for Machines with Multiregister Operations”, Alfred V. Aho, Stephen C. Johnson, Jefferey D. Ullman P5: “Deterministic Parsing of Ambiguous Grammars”, A. g g Aho, S. Johnson, J. Ullman P6: “Compilers: Principles, Techniques, and Tools”, A. Aho, R. Sethi, J. Ullman
  • 39. Relational Clustering (RC-ER) P1 C. Walshaw M. Cross M. G. Everett S. Johnson P2 C. C Walshaw M. M Cross M. Everett S. Johnson K. K McManus P4 Alfred V. Aho Jefferey D. Ullman Stephen C. Johnson P5 A. Aho J. Ullman S. Johnson
  • 40. Relational Clustering (RC-ER) P1 C. Walshaw M. Cross M. G. Everett S. Johnson P2 C. C Walshaw M. M Cross M. Everett S. Johnson K. K McManus P4 Alfred V. Aho Jefferey D. Ullman Stephen C. Johnson P5 A. Aho J. Ullman S. Johnson
  • 41. Relational Clustering (RC-ER) P1 C. Walshaw M. Cross M. G. Everett S. Johnson P2 C. C Walshaw M. M Cross M. Everett S. Johnson K. K McManus P4 Alfred V. Aho Jefferey D. Ullman Stephen C. Johnson P5 A. Aho J. Ullman S. Johnson
  • 42. Relational Clustering (RC-ER) P1 C. Walshaw M. Cross M. G. Everett S. Johnson P2 C. C Walshaw M. M Cross M. Everett S. Johnson K. K McManus P4 Alfred V. Aho Jefferey D. Ullman Stephen C. Johnson P5 A. Aho J. Ullman S. Johnson
  • 43. Cut-based Formulation of RC-ER M. G. Everett S. Johnson M. G. Everett S. Johnson M. Everett S. Johnson M. Everett S. Johnson S. Johnson S. Johnson A. Aho A. Aho Stephen C. Stephen C. Alfred V. Aho Johnson Alfred V. Aho Johnson Good separation of attributes Worse in terms of attributes Many cluster-cluster relationships Fewer cluster-cluster relationships  Aho-Johnson1 Aho-Johnson2 Aho Johnson1, Aho Johnson2,  Aho-Johnson1 Everett Johnson2 Aho Johnson1, Everett-Johnson2 Everett-Johnson1
  • 44. Objective Function  Minimize:  w sim i i j A A (ci ,c j )  wR simR (ci , c j ) i weight for similarity of weight for Similarity based on relational attributes attributes relations edges between ci and cj  Greedy clustering algorithm: merge cluster pair with max reduction in objective function  (ci ,c j ) w A sim A (ci ,c j )  w R (|N (ci )||N (c j )|) Similarity of attributes Common cluster neighborhood
  • 45. Relational Clustering Algorithm 1. Find similar references using ‘blocking’ 2. Bootstrap clusters using attributes and relations 3. Compute similarities for cluster pairs and insert into priority queue 4. Repeat until priority queue is empty 5. Find ‘closest’ cluster pair 6. Stop if similarity below threshold 7. Merge to create new cluster 8. 8 Update similarity for ‘related’ clusters related  O(n l O( k log n) algorithm w/ efficient i l ) l ith / ffi i t implementation t ti
  • 46. Evaluation Datasets  CiteSeer  1,504 citations to machine learning papers (Lawrence et al.)  2,892 references to 1,165 author entities  arXiv  29,555 publications from High Energy Physics (KDD Cup’03)  58,515 refs to 9,200 authors  Elsevier BioBase  156,156 Biology papers (IBM KDD Challenge ’05)  831,991 author refs  Keywords, topic classifications, language, country and affiliation of corresponding author, etc p g ,
  • 47. Baselines  A: Pair-wise duplicate decisions w/ attributes only  Names: Soft-TFIDF with Levenstein, Jaro, Jaro-Winkler  Other textual attributes: TF-IDF  A*: Transitive closure over A  A+N: Add attribute similarity of co-occurring refs  A+N*: Transitive closure over A+N  Evaluate pair-wise decisions over references  F1-measure F1 measure (harmonic mean of precision and recall)
  • 48. ER over Entire Dataset Algorithm CiteSeer arXiv BioBase A 0.980 0.976 0.568 A* 0.990 0.971 0.559 A+N 0.973 0.938 0.710 A+N A+N* 0.984 0 984 0.934 0 934 0.753 0 753 RC-ER 0.995 0.985 0.818  RC-ER outperforms baselines in all datasets  Collective resolution better than naïve relational resolution
  • 49. ER over Entire Dataset Algorithm CiteSeer arXiv BioBase A 0.980 0.976 0.568 A* 0.990 0.971 0.559 A+N 0.973 0.938 0.710 A+N A+N* 0.984 0 984 0.934 0 934 0.753 0 753 RC-ER 0.995 0.985 0.818  CiteSeer: Near perfect resolution; 22% error reduction  arXiv: 6 500 additional correct resolutions; 20% error reduction 6,500  BioBase: Biggest improvement over baselines
  • 51. Privacy breaches in OSNs  Identity disclosure  A mapping from a record Who is ? to a specific individual  Attribute disclosure  Find attribute value that the Is liberal? user intended to stay private  Social link disclosure  Participation in a sensitive Friends? relationship or communication p  Affiliation link disclosure Support gay  Participation in a group revealing a sensitive attribute value marriage
  • 52. Other Linqs Projects  Key Opinion Leader Identification  Active Surveying in Social Networks  Ontology Alignment and Folksonomy construction  Label Acquisition & Active Learning in Network Data  Inference & Search in Camera Networks  Identifying R l in Social Networks Id tif i Roles i S i l N t k  Group Recommendation in Social Networks  Social Search  Analysis of Dynamic Networks: loyalty, stability, diversity  Ranking and Retrieval in Biological Networks  Discourse level Discourse-level sentiment analysis  Bilingual Word Sense Disambiguation  Visual Analytics:  D-Dupe, C G DD C-Group, G-Pare GP  Others … http://www.cs.umd.edu/linqs
  • 53. Other Linqs Projects  Key Opinion Leader Identification  Active Surveying in Social Networks  Ontology Alignment and Folksonomy construction  Label Acquisition & Active Learning in Network Data  Inference & Search in Camera Networks  Identifying R l in Social Networks Id tif i Roles i S i l N t k  Group Recommendation in Social Networks  Social Search  Analysis of Dynamic Networks: loyalty, stability, diversity  Ranking and Retrieval in Biological Networks  Discourse level Discourse-level sentiment analysis  Bilingual Word Sense Disambiguation  Visual Analytics:  D-Dupe, C G DD C-Group, G-Pare GP  Others … http://www.cs.umd.edu/linqs
  • 54. Conclusion  Link mining algorithms can be useful tools for social media  Need algorithms that can handle the multi-modal, multi-relational, temporal nature of social media  Collective algorithms make use of  Structure to define features and propagate information, allows us to improve the overall accuracy i f ti ll t i th ll  While there are important pitfalls to take into account (confidence and privacy) there are privacy), many potential benefits and payoffs (improved personalization and context-aware predictions!) context aware
  • 55. http://www.cs.umd.edu/linqs Work sponsored by the National Science Foundation, Maryland Industrial Partners (MIPS), National Geospatial Agency (MIPS) Agency, Airforce Research Laboratory, DARPA, Google, Microsoft, and Yahoo!