SlideShare una empresa de Scribd logo
1 de 129
Descargar para leer sin conexión
Computer-Assisted Clustering and Conceptualization

                                                  Gary King

                                 Institute for Quantitative Social Science
                                             Harvard University


                        Parthemos Lecture at University of Georgia, 3/4/2011




1
    Based on joint work with Justin Grimmer (Harvard   Stanford)
                                                                        Parthemos Lecture at University of
Gary King (Harvard IQSS)                       Quantitative Discovery                               / 20
A Method for Computer Assisted Conceptualization


    Conceptualization through Classification: “one of the most central
    and generic of all our conceptual exercises. . . . the foundation not
    only for conceptualization, language, and speech, but also for
    mathematics, statistics, and data analysis. . . . Without classification,
    there could be no advanced conceptualization, reasoning, language,
    data analysis or,for that matter, social science research.” (Bailey,
    1994).




                                                        Parthemos Lecture at University of
  Gary King (Harvard IQSS)     Quantitative Discovery                               / 20
A Method for Computer Assisted Conceptualization


    Conceptualization through Classification: “one of the most central
    and generic of all our conceptual exercises. . . . the foundation not
    only for conceptualization, language, and speech, but also for
    mathematics, statistics, and data analysis. . . . Without classification,
    there could be no advanced conceptualization, reasoning, language,
    data analysis or,for that matter, social science research.” (Bailey,
    1994).
    Cluster Analysis: simultaneously (1) invents categories and (2)
    assigns documents to categories




                                                        Parthemos Lecture at University of
  Gary King (Harvard IQSS)     Quantitative Discovery                               / 20
A Method for Computer Assisted Conceptualization


    Conceptualization through Classification: “one of the most central
    and generic of all our conceptual exercises. . . . the foundation not
    only for conceptualization, language, and speech, but also for
    mathematics, statistics, and data analysis. . . . Without classification,
    there could be no advanced conceptualization, reasoning, language,
    data analysis or,for that matter, social science research.” (Bailey,
    1994).
    Cluster Analysis: simultaneously (1) invents categories and (2)
    assigns documents to categories
    We focus on unstructured text; methods apply more broadly.




                                                        Parthemos Lecture at University of
  Gary King (Harvard IQSS)     Quantitative Discovery                               / 20
A Method for Computer Assisted Conceptualization


    Conceptualization through Classification: “one of the most central
    and generic of all our conceptual exercises. . . . the foundation not
    only for conceptualization, language, and speech, but also for
    mathematics, statistics, and data analysis. . . . Without classification,
    there could be no advanced conceptualization, reasoning, language,
    data analysis or,for that matter, social science research.” (Bailey,
    1994).
    Cluster Analysis: simultaneously (1) invents categories and (2)
    assigns documents to categories
    We focus on unstructured text; methods apply more broadly.
    Main goal: Switch from Fully Automated to Computer Assisted



                                                        Parthemos Lecture at University of
  Gary King (Harvard IQSS)     Quantitative Discovery                               / 20
What’s Hard about Clustering?




                                                      Parthemos Lecture at University of
  Gary King (Harvard IQSS)   Quantitative Discovery                               / 20
What’s Hard about Clustering?
(aka Why Johnny Can’t Classify)




                                                           Parthemos Lecture at University of
   Gary King (Harvard IQSS)       Quantitative Discovery                               / 20
What’s Hard about Clustering?
(aka Why Johnny Can’t Classify)




     Clustering seems easy; its not!




                                                           Parthemos Lecture at University of
   Gary King (Harvard IQSS)       Quantitative Discovery                               / 20
What’s Hard about Clustering?
(aka Why Johnny Can’t Classify)




     Clustering seems easy; its not!
     Bell(n) = number of ways of partitioning n objects




                                                           Parthemos Lecture at University of
   Gary King (Harvard IQSS)       Quantitative Discovery                               / 20
What’s Hard about Clustering?
(aka Why Johnny Can’t Classify)




     Clustering seems easy; its not!
     Bell(n) = number of ways of partitioning n objects
     Bell(2) = 2 (AB, A B)




                                                           Parthemos Lecture at University of
   Gary King (Harvard IQSS)       Quantitative Discovery                               / 20
What’s Hard about Clustering?
(aka Why Johnny Can’t Classify)




     Clustering seems easy; its not!
     Bell(n) = number of ways of partitioning n objects
     Bell(2) = 2 (AB, A B)
     Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)




                                                           Parthemos Lecture at University of
   Gary King (Harvard IQSS)       Quantitative Discovery                               / 20
What’s Hard about Clustering?
(aka Why Johnny Can’t Classify)




     Clustering seems easy; its not!
     Bell(n) = number of ways of partitioning n objects
     Bell(2) = 2 (AB, A B)
     Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)
     Bell(5) = 52




                                                           Parthemos Lecture at University of
   Gary King (Harvard IQSS)       Quantitative Discovery                               / 20
What’s Hard about Clustering?
(aka Why Johnny Can’t Classify)




     Clustering seems easy; its not!
     Bell(n) = number of ways of partitioning n objects
     Bell(2) = 2 (AB, A B)
     Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)
     Bell(5) = 52
     Bell(100) ≈




                                                           Parthemos Lecture at University of
   Gary King (Harvard IQSS)       Quantitative Discovery                               / 20
What’s Hard about Clustering?
(aka Why Johnny Can’t Classify)




     Clustering seems easy; its not!
     Bell(n) = number of ways of partitioning n objects
     Bell(2) = 2 (AB, A B)
     Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)
     Bell(5) = 52
     Bell(100) ≈ 1028 × Number of elementary particles in the universe




                                                           Parthemos Lecture at University of
   Gary King (Harvard IQSS)       Quantitative Discovery                               / 20
What’s Hard about Clustering?
(aka Why Johnny Can’t Classify)




     Clustering seems easy; its not!
     Bell(n) = number of ways of partitioning n objects
     Bell(2) = 2 (AB, A B)
     Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)
     Bell(5) = 52
     Bell(100) ≈ 1028 × Number of elementary particles in the universe
     Now imagine choosing the optimal classification scheme by hand!




                                                           Parthemos Lecture at University of
   Gary King (Harvard IQSS)       Quantitative Discovery                               / 20
What’s Hard about Clustering?
(aka Why Johnny Can’t Classify)




     Clustering seems easy; its not!
     Bell(n) = number of ways of partitioning n objects
     Bell(2) = 2 (AB, A B)
     Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)
     Bell(5) = 52
     Bell(100) ≈ 1028 × Number of elementary particles in the universe
     Now imagine choosing the optimal classification scheme by hand!
     Fully automated algorithms can help, but which ones?




                                                           Parthemos Lecture at University of
   Gary King (Harvard IQSS)       Quantitative Discovery                               / 20
The Problem with Fully Automated Clustering




                                                      Parthemos Lecture at University of
  Gary King (Harvard IQSS)   Quantitative Discovery                               / 20
The Problem with Fully Automated Clustering


    The (Impossible) Goal: optimal, fully automated,
    application-independent cluster analysis




                                                      Parthemos Lecture at University of
  Gary King (Harvard IQSS)   Quantitative Discovery                               / 20
The Problem with Fully Automated Clustering


    The (Impossible) Goal: optimal, fully automated,
    application-independent cluster analysis
    No free lunch theorem: every possible clustering method performs
    equally well on average over all possible substantive applications




                                                      Parthemos Lecture at University of
  Gary King (Harvard IQSS)   Quantitative Discovery                               / 20
The Problem with Fully Automated Clustering


    The (Impossible) Goal: optimal, fully automated,
    application-independent cluster analysis
    No free lunch theorem: every possible clustering method performs
    equally well on average over all possible substantive applications
    Existing methods:




                                                      Parthemos Lecture at University of
  Gary King (Harvard IQSS)   Quantitative Discovery                               / 20
The Problem with Fully Automated Clustering


    The (Impossible) Goal: optimal, fully automated,
    application-independent cluster analysis
    No free lunch theorem: every possible clustering method performs
    equally well on average over all possible substantive applications
    Existing methods:
            Many choices: model-based, subspace, spectral, grid-based, graph-
            based, fuzzy k-modes, affinity propagation, self-organizing maps,. . .




                                                            Parthemos Lecture at University of
  Gary King (Harvard IQSS)         Quantitative Discovery                               / 20
The Problem with Fully Automated Clustering


    The (Impossible) Goal: optimal, fully automated,
    application-independent cluster analysis
    No free lunch theorem: every possible clustering method performs
    equally well on average over all possible substantive applications
    Existing methods:
            Many choices: model-based, subspace, spectral, grid-based, graph-
            based, fuzzy k-modes, affinity propagation, self-organizing maps,. . .
            Well-defined statistical, data analytic, or machine learning foundations




                                                            Parthemos Lecture at University of
  Gary King (Harvard IQSS)         Quantitative Discovery                               / 20
The Problem with Fully Automated Clustering


    The (Impossible) Goal: optimal, fully automated,
    application-independent cluster analysis
    No free lunch theorem: every possible clustering method performs
    equally well on average over all possible substantive applications
    Existing methods:
            Many choices: model-based, subspace, spectral, grid-based, graph-
            based, fuzzy k-modes, affinity propagation, self-organizing maps,. . .
            Well-defined statistical, data analytic, or machine learning foundations
            How to add substantive knowledge: With few exceptions, unclear




                                                            Parthemos Lecture at University of
  Gary King (Harvard IQSS)         Quantitative Discovery                               / 20
The Problem with Fully Automated Clustering


    The (Impossible) Goal: optimal, fully automated,
    application-independent cluster analysis
    No free lunch theorem: every possible clustering method performs
    equally well on average over all possible substantive applications
    Existing methods:
            Many choices: model-based, subspace, spectral, grid-based, graph-
            based, fuzzy k-modes, affinity propagation, self-organizing maps,. . .
            Well-defined statistical, data analytic, or machine learning foundations
            How to add substantive knowledge: With few exceptions, unclear
            The literature: little guidance on when methods apply




                                                            Parthemos Lecture at University of
  Gary King (Harvard IQSS)         Quantitative Discovery                               / 20
The Problem with Fully Automated Clustering


    The (Impossible) Goal: optimal, fully automated,
    application-independent cluster analysis
    No free lunch theorem: every possible clustering method performs
    equally well on average over all possible substantive applications
    Existing methods:
            Many choices: model-based, subspace, spectral, grid-based, graph-
            based, fuzzy k-modes, affinity propagation, self-organizing maps,. . .
            Well-defined statistical, data analytic, or machine learning foundations
            How to add substantive knowledge: With few exceptions, unclear
            The literature: little guidance on when methods apply
            Deriving such guidance: difficult or impossible




                                                            Parthemos Lecture at University of
  Gary King (Harvard IQSS)         Quantitative Discovery                               / 20
The Problem with Fully Automated Clustering


    The (Impossible) Goal: optimal, fully automated,
    application-independent cluster analysis
    No free lunch theorem: every possible clustering method performs
    equally well on average over all possible substantive applications
    Existing methods:
            Many choices: model-based, subspace, spectral, grid-based, graph-
            based, fuzzy k-modes, affinity propagation, self-organizing maps,. . .
            Well-defined statistical, data analytic, or machine learning foundations
            How to add substantive knowledge: With few exceptions, unclear
            The literature: little guidance on when methods apply
            Deriving such guidance: difficult or impossible
    Deep problem: full automation requires more information



                                                            Parthemos Lecture at University of
  Gary King (Harvard IQSS)         Quantitative Discovery                               / 20
The Problem with Fully Automated Clustering


    The (Impossible) Goal: optimal, fully automated,
    application-independent cluster analysis
    No free lunch theorem: every possible clustering method performs
    equally well on average over all possible substantive applications
    Existing methods:
            Many choices: model-based, subspace, spectral, grid-based, graph-
            based, fuzzy k-modes, affinity propagation, self-organizing maps,. . .
            Well-defined statistical, data analytic, or machine learning foundations
            How to add substantive knowledge: With few exceptions, unclear
            The literature: little guidance on when methods apply
            Deriving such guidance: difficult or impossible
    Deep problem: full automation requires more information
    No surprise: everyone’s tried cluster analysis; very few are satisfied


                                                            Parthemos Lecture at University of
  Gary King (Harvard IQSS)         Quantitative Discovery                               / 20
Switch from Fully Automated to Computer Assisted




                                                      Parthemos Lecture at University of
  Gary King (Harvard IQSS)   Quantitative Discovery                               / 20
Switch from Fully Automated to Computer Assisted



    Fully Automated Clustering may succeed sometimes, but fails in
    general: too hard to understand when each model applies




                                                      Parthemos Lecture at University of
  Gary King (Harvard IQSS)   Quantitative Discovery                               / 20
Switch from Fully Automated to Computer Assisted



    Fully Automated Clustering may succeed sometimes, but fails in
    general: too hard to understand when each model applies
    An alternative: Computer-Assisted Clustering




                                                      Parthemos Lecture at University of
  Gary King (Harvard IQSS)   Quantitative Discovery                               / 20
Switch from Fully Automated to Computer Assisted



    Fully Automated Clustering may succeed sometimes, but fails in
    general: too hard to understand when each model applies
    An alternative: Computer-Assisted Clustering
            Easy in theory: list all clusterings; choose the best




                                                             Parthemos Lecture at University of
  Gary King (Harvard IQSS)          Quantitative Discovery                               / 20
Switch from Fully Automated to Computer Assisted



    Fully Automated Clustering may succeed sometimes, but fails in
    general: too hard to understand when each model applies
    An alternative: Computer-Assisted Clustering
            Easy in theory: list all clusterings; choose the best
            Impossible in practice: Too hard for us mere humans!




                                                           Parthemos Lecture at University of
  Gary King (Harvard IQSS)        Quantitative Discovery                               / 20
Switch from Fully Automated to Computer Assisted



    Fully Automated Clustering may succeed sometimes, but fails in
    general: too hard to understand when each model applies
    An alternative: Computer-Assisted Clustering
            Easy in theory: list all clusterings; choose the best
            Impossible in practice: Too hard for us mere humans!
            An organized list will make the search possible




                                                           Parthemos Lecture at University of
  Gary King (Harvard IQSS)        Quantitative Discovery                               / 20
Switch from Fully Automated to Computer Assisted



    Fully Automated Clustering may succeed sometimes, but fails in
    general: too hard to understand when each model applies
    An alternative: Computer-Assisted Clustering
            Easy in theory: list all clusterings; choose the best
            Impossible in practice: Too hard for us mere humans!
            An organized list will make the search possible
            Insight: Many clusterings are perceptually identical




                                                           Parthemos Lecture at University of
  Gary King (Harvard IQSS)        Quantitative Discovery                               / 20
Switch from Fully Automated to Computer Assisted



    Fully Automated Clustering may succeed sometimes, but fails in
    general: too hard to understand when each model applies
    An alternative: Computer-Assisted Clustering
            Easy in theory: list all clusterings; choose the best
            Impossible in practice: Too hard for us mere humans!
            An organized list will make the search possible
            Insight: Many clusterings are perceptually identical
            E.g.,: consider two clusterings that differ only because one document
            (of 10,000) moves from category 5 to 6




                                                           Parthemos Lecture at University of
  Gary King (Harvard IQSS)        Quantitative Discovery                               / 20
Switch from Fully Automated to Computer Assisted



    Fully Automated Clustering may succeed sometimes, but fails in
    general: too hard to understand when each model applies
    An alternative: Computer-Assisted Clustering
            Easy in theory: list all clusterings; choose the best
            Impossible in practice: Too hard for us mere humans!
            An organized list will make the search possible
            Insight: Many clusterings are perceptually identical
            E.g.,: consider two clusterings that differ only because one document
            (of 10,000) moves from category 5 to 6
    Question: How to organize clusterings so humans can understand?




                                                           Parthemos Lecture at University of
  Gary King (Harvard IQSS)        Quantitative Discovery                               / 20
Our Idea: Meaning Through Geography

Set of clusterings




                                                       Parthemos Lecture at University of
   Gary King (Harvard IQSS)   Quantitative Discovery                               / 20
Our Idea: Meaning Through Geography

Set of clusterings ≈
A list of unconnected addresses




                                                       Parthemos Lecture at University of
  Gary King (Harvard IQSS)    Quantitative Discovery                               / 20
Our Idea: Meaning Through Geography

Set of clusterings ≈
A list of unconnected addresses




                                                       Parthemos Lecture at University of
  Gary King (Harvard IQSS)    Quantitative Discovery                               / 20
Our Idea: Meaning Through Geography

Set of clusterings ≈
A list of unconnected addresses




                     We develop a (conceptual) geography of clusterings
                                                            Parthemos Lecture at University of
  Gary King (Harvard IQSS)         Quantitative Discovery                               / 20
A New Strategy
Make it easy to choose best clustering from millions of choices




                                                              Parthemos Lecture at University of
    Gary King (Harvard IQSS)         Quantitative Discovery                               / 20
A New Strategy
Make it easy to choose best clustering from millions of choices



   1     Code text as numbers (in one or more of several ways)




                                                              Parthemos Lecture at University of
       Gary King (Harvard IQSS)      Quantitative Discovery                               / 20
A New Strategy
Make it easy to choose best clustering from millions of choices



   1     Code text as numbers (in one or more of several ways)
   2     Apply all clustering methods we can find to the data — each
         representing different (unstated) substantive assumptions (<15 mins)




                                                              Parthemos Lecture at University of
       Gary King (Harvard IQSS)      Quantitative Discovery                               / 20
A New Strategy
Make it easy to choose best clustering from millions of choices



   1     Code text as numbers (in one or more of several ways)
   2     Apply all clustering methods we can find to the data — each
         representing different (unstated) substantive assumptions (<15 mins)
   3     (Too much for a person to understand, but organization will help)




                                                              Parthemos Lecture at University of
       Gary King (Harvard IQSS)      Quantitative Discovery                               / 20
A New Strategy
Make it easy to choose best clustering from millions of choices



   1     Code text as numbers (in one or more of several ways)
   2     Apply all clustering methods we can find to the data — each
         representing different (unstated) substantive assumptions (<15 mins)
   3     (Too much for a person to understand, but organization will help)
   4     Develop an application-independent distance metric between
         clusterings, a metric space of clusterings, and a 2-D projection




                                                              Parthemos Lecture at University of
       Gary King (Harvard IQSS)      Quantitative Discovery                               / 20
A New Strategy
Make it easy to choose best clustering from millions of choices



   1     Code text as numbers (in one or more of several ways)
   2     Apply all clustering methods we can find to the data — each
         representing different (unstated) substantive assumptions (<15 mins)
   3     (Too much for a person to understand, but organization will help)
   4     Develop an application-independent distance metric between
         clusterings, a metric space of clusterings, and a 2-D projection
   5     “Local cluster ensemble” creates a new clustering at any point, based
         on weighted average of nearby clusterings




                                                              Parthemos Lecture at University of
       Gary King (Harvard IQSS)      Quantitative Discovery                               / 20
A New Strategy
Make it easy to choose best clustering from millions of choices



   1     Code text as numbers (in one or more of several ways)
   2     Apply all clustering methods we can find to the data — each
         representing different (unstated) substantive assumptions (<15 mins)
   3     (Too much for a person to understand, but organization will help)
   4     Develop an application-independent distance metric between
         clusterings, a metric space of clusterings, and a 2-D projection
   5     “Local cluster ensemble” creates a new clustering at any point, based
         on weighted average of nearby clusterings
   6     A new animated visualization to explore the space of clusterings
         (smoothly morphing from one into others)



                                                              Parthemos Lecture at University of
       Gary King (Harvard IQSS)      Quantitative Discovery                               / 20
A New Strategy
Make it easy to choose best clustering from millions of choices



   1     Code text as numbers (in one or more of several ways)
   2     Apply all clustering methods we can find to the data — each
         representing different (unstated) substantive assumptions (<15 mins)
   3     (Too much for a person to understand, but organization will help)
   4     Develop an application-independent distance metric between
         clusterings, a metric space of clusterings, and a 2-D projection
   5     “Local cluster ensemble” creates a new clustering at any point, based
         on weighted average of nearby clusterings
   6     A new animated visualization to explore the space of clusterings
         (smoothly morphing from one into others)
   7          Millions of clusterings, easily comprehended


                                                              Parthemos Lecture at University of
       Gary King (Harvard IQSS)      Quantitative Discovery                               / 20
Many Thousands of Clusterings, Sorted & Organized
You choose one (or more), based on insight, discovery, useful information,. . .

                 Obama                                                                                   Space of mixvmf
                                                                                                                  Clusterings                                                                                        Clustering 2
          Ford                 Clustering 1
                                                                                                                                        affprop info.costs                                                Carter
      Nixon
                                                                                               kmedoids stand.euc                                                                                        Johnson

 Carter            Eisenhower
                                                                                                                            rock              affprop maximum                                        Ford                 Roosevelt
                                                                          kmeans correlation         hclust correlation single
                                                                                                   hclust pearson single
                                                                                                                                                                                                            Eisenhower
     Truman                                                                                                                                                                                              Truman

     Johnson               Roosevelt                                                                                             hclust maximum single
                                                                                                                      hclust correlation median
                                                                                        hclust binary hclust correlationmedian
                                                                                                      centroidpearson centroid
                                                                                                       hclust pearson centroid
                                                                                                       hclust                                                                      spec_max                 Nixon
                                  ``Other                                              hclust canberra centroid
                                                                                                                                                                                                                              ``Roosevelt
                                                                                                             hclust correlationaverage
                                                                                                                                average
                                                                                                              hclust pearson mcquitty
                                                                                                                                mcquitty
                                                                                                                                       hclust kendall single                       hclust maximum ward
                                 Presidents ''                                                                                           hclust euclidean centroid                                                             To Carter''
                                                                  hclust canberra mcquitty binary median
                                                                                    hclust
                                                                                   hclust canberra median
                                                               kmeans kendall hclust canberra single                                                        mspec_max
                                                                                 hclust binary single     biclust_spectral
                                                                                                         affprop manhattan
                                                                                                           affprop cosine
                                                                                                    q
                     Clinton                                                                                                              hclust manhattan centroid
                                                                                                                              hclust manhattanmedian
                                                                                                                               hclust maximum single
                                                                                                                             hclust spearman centroid
                                                                                                                              hclust maximum centroid
                                                                                                                                kmedoids manhattan
                                                                                                                                      kendall centroid
                                                                      mspec_canb                                                               hclust euclidean median
                                                          hclust canberra average                            hclust correlation complete
                                                                                                              hclust pearson complete
                                                                                       divisive stand.euc
                                                                                                 mspec_cos                          hclust kendall average
                                                                                                                                  hclust manhattan median
                                                                                                                                   hclust spearman median
                                                                                                                                    hclust kendall median                             kmeans maximum
                                                                                                                                        hclust euclideanaverage
                                                                                                                                                         single
                                                                                                                                       hclust maximum mcquitty
                                                                                                                                      hclust maximum complete
                                                                     kmeans pearson                                          affprop euclidean hclust mcquitty average
                                                                                                                                   hclust manhattan average
                                                                                                                                                       euclidean
                                                                                                                                                                                                                    Kennedy
                 Kennedy                                                                                         hclust spearman single q                     divisive euclidean
                                                 Bushkmeans binary
                                                               hclust binary average                                                                 kmedoids euclidean
                                                                                                   som
                                                                                                                                      hclust spearman average           spec_mink
                                                                                                                                   mspec_euc
                                                                                                                                   mspec_mink

                                                                      hclust binary complete
                                                                      hclust binary mcquitty                                             divisive manhattan
                                                                                                                                  mspec_man
                                                                                                                              hclust euclidean mcquitty
                                                                                                                              hclust euclidean complete hclust kendall complete
                                                                 hclust correlation ward complete
                                                                          hclust canberra                                                                                                                                                    Bush
                                                                                                clust_convex
                                                                                                                                   hclust euclidean ward
                                                                                                             hclust spearman mcquitty
                                                                                                              hclust kendall mcquitty      dismea                                                                               Obama
                                                                        hclust binary ward
                                                                            hclust canberra ward                                                      hclust spearman complete
                                                                                                                             hclust manhattan complete
                                                                         spec_canb               hclust kendall ward
                                                                                                                   mixvmfVA
                                                                                 spec_cos                                       spec_euc
                                                                                                            hclust manhattan ward kmeans euclidean
                                                                                                                                               kmeans manhattan
                                                                                     spec_man
                                                                                            hclust pearson ward
              ``Reagan                                                                                                                                                                                          `` Reagan To
             Republicans''                                                                              hclust spearman ward                                                                                       Obama ''
                                                                                          kmeans spearman



                                        Reagan
                                                                                                              kmeans canberra
                                                                                                                                                                                                                                           HWBush
                                            HWBush
                                                                                                                                                                                                                                 Clinton
                                                                                                                                                                                                                                  Reagan
                                                                                                                                          mult_dirproc




                                                                                                                                                                              Parthemos Lecture at University of
      Gary King (Harvard IQSS)                                                                     Quantitative Discovery                                                                                                                           / 20
Software Screenshot




                                                      Parthemos Lecture at University of
  Gary King (Harvard IQSS)   Quantitative Discovery                               / 20
Evaluating Performance




                                                      Parthemos Lecture at University of
  Gary King (Harvard IQSS)   Quantitative Discovery                               / 20
Evaluating Performance



    Goals:




                                                      Parthemos Lecture at University of
  Gary King (Harvard IQSS)   Quantitative Discovery                               / 20
Evaluating Performance



    Goals:
            Validate Claim: computer-assisted conceptualization outperforms
            human conceptualization




                                                           Parthemos Lecture at University of
  Gary King (Harvard IQSS)        Quantitative Discovery                               / 20
Evaluating Performance



    Goals:
            Validate Claim: computer-assisted conceptualization outperforms
            human conceptualization
            Demonstrate: new experimental designs for cluster evaluation




                                                           Parthemos Lecture at University of
  Gary King (Harvard IQSS)        Quantitative Discovery                               / 20
Evaluating Performance



    Goals:
            Validate Claim: computer-assisted conceptualization outperforms
            human conceptualization
            Demonstrate: new experimental designs for cluster evaluation
            Inject human judgement: relying on insights from survey research




                                                           Parthemos Lecture at University of
  Gary King (Harvard IQSS)        Quantitative Discovery                               / 20
Evaluating Performance



    Goals:
            Validate Claim: computer-assisted conceptualization outperforms
            human conceptualization
            Demonstrate: new experimental designs for cluster evaluation
            Inject human judgement: relying on insights from survey research
    We now present three evaluations




                                                           Parthemos Lecture at University of
  Gary King (Harvard IQSS)        Quantitative Discovery                               / 20
Evaluating Performance



    Goals:
            Validate Claim: computer-assisted conceptualization outperforms
            human conceptualization
            Demonstrate: new experimental designs for cluster evaluation
            Inject human judgement: relying on insights from survey research
    We now present three evaluations
            Cluster Quality ⇒ RA coders




                                                           Parthemos Lecture at University of
  Gary King (Harvard IQSS)        Quantitative Discovery                               / 20
Evaluating Performance



    Goals:
            Validate Claim: computer-assisted conceptualization outperforms
            human conceptualization
            Demonstrate: new experimental designs for cluster evaluation
            Inject human judgement: relying on insights from survey research
    We now present three evaluations
            Cluster Quality ⇒ RA coders
            Informative discoveries ⇒ Experienced scholars analyzing texts




                                                           Parthemos Lecture at University of
  Gary King (Harvard IQSS)        Quantitative Discovery                               / 20
Evaluating Performance



    Goals:
            Validate Claim: computer-assisted conceptualization outperforms
            human conceptualization
            Demonstrate: new experimental designs for cluster evaluation
            Inject human judgement: relying on insights from survey research
    We now present three evaluations
            Cluster Quality ⇒ RA coders
            Informative discoveries ⇒ Experienced scholars analyzing texts
            Discovery ⇒ You’re the judge




                                                           Parthemos Lecture at University of
  Gary King (Harvard IQSS)        Quantitative Discovery                               / 20
Evaluation 1: Cluster Quality




                                                      Parthemos Lecture at University of
  Gary King (Harvard IQSS)   Quantitative Discovery                               / 20
Evaluation 1: Cluster Quality



    What Are Humans Good For?




                                                      Parthemos Lecture at University of
  Gary King (Harvard IQSS)   Quantitative Discovery                               / 20
Evaluation 1: Cluster Quality



    What Are Humans Good For?
            They can’t: keep many documents & clusters in their head




                                                          Parthemos Lecture at University of
  Gary King (Harvard IQSS)       Quantitative Discovery                               / 20
Evaluation 1: Cluster Quality



    What Are Humans Good For?
            They can’t: keep many documents & clusters in their head
            They can: compare two documents at a time




                                                          Parthemos Lecture at University of
  Gary King (Harvard IQSS)       Quantitative Discovery                               / 20
Evaluation 1: Cluster Quality



    What Are Humans Good For?
            They can’t: keep many documents & clusters in their head
            They can: compare two documents at a time
            =⇒ Cluster quality evaluation: human judgement of document pairs




                                                          Parthemos Lecture at University of
  Gary King (Harvard IQSS)       Quantitative Discovery                               / 20
Evaluation 1: Cluster Quality



    What Are Humans Good For?
            They can’t: keep many documents & clusters in their head
            They can: compare two documents at a time
            =⇒ Cluster quality evaluation: human judgement of document pairs
    Experimental Design to Assess Cluster Quality




                                                          Parthemos Lecture at University of
  Gary King (Harvard IQSS)       Quantitative Discovery                               / 20
Evaluation 1: Cluster Quality



    What Are Humans Good For?
            They can’t: keep many documents & clusters in their head
            They can: compare two documents at a time
            =⇒ Cluster quality evaluation: human judgement of document pairs
    Experimental Design to Assess Cluster Quality
            automated visualization to choose one clustering




                                                           Parthemos Lecture at University of
  Gary King (Harvard IQSS)        Quantitative Discovery                               / 20
Evaluation 1: Cluster Quality



    What Are Humans Good For?
            They can’t: keep many documents & clusters in their head
            They can: compare two documents at a time
            =⇒ Cluster quality evaluation: human judgement of document pairs
    Experimental Design to Assess Cluster Quality
            automated visualization to choose one clustering
            many pairs of documents




                                                           Parthemos Lecture at University of
  Gary King (Harvard IQSS)        Quantitative Discovery                               / 20
Evaluation 1: Cluster Quality



    What Are Humans Good For?
            They can’t: keep many documents & clusters in their head
            They can: compare two documents at a time
            =⇒ Cluster quality evaluation: human judgement of document pairs
    Experimental Design to Assess Cluster Quality
            automated visualization to choose one clustering
            many pairs of documents
            for coders: (1) unrelated, (2) loosely related, (3) closely related




                                                             Parthemos Lecture at University of
  Gary King (Harvard IQSS)          Quantitative Discovery                               / 20
Evaluation 1: Cluster Quality



    What Are Humans Good For?
            They can’t: keep many documents & clusters in their head
            They can: compare two documents at a time
            =⇒ Cluster quality evaluation: human judgement of document pairs
    Experimental Design to Assess Cluster Quality
            automated visualization to choose one clustering
            many pairs of documents
            for coders: (1) unrelated, (2) loosely related, (3) closely related
            Quality = mean(within cluster) - mean(between clusters)




                                                             Parthemos Lecture at University of
  Gary King (Harvard IQSS)          Quantitative Discovery                               / 20
Evaluation 1: Cluster Quality



    What Are Humans Good For?
            They can’t: keep many documents & clusters in their head
            They can: compare two documents at a time
            =⇒ Cluster quality evaluation: human judgement of document pairs
    Experimental Design to Assess Cluster Quality
            automated visualization to choose one clustering
            many pairs of documents
            for coders: (1) unrelated, (2) loosely related, (3) closely related
            Quality = mean(within cluster) - mean(between clusters)
            Bias results against ourselves by not letting evaluators choose clustering




                                                             Parthemos Lecture at University of
  Gary King (Harvard IQSS)          Quantitative Discovery                               / 20
Evaluation 1: Cluster Quality



    What Are Humans Good For?
            They can’t: keep many documents & clusters in their head
            They can: compare two documents at a time
            =⇒ Cluster quality evaluation: human judgement of document pairs
    Experimental Design to Assess Cluster Quality
            automated visualization to choose one clustering
            many pairs of documents
            for coders: (1) unrelated, (2) loosely related, (3) closely related
            Quality = mean(within cluster) - mean(between clusters)
            Bias results against ourselves by not letting evaluators choose clustering




                                                             Parthemos Lecture at University of
  Gary King (Harvard IQSS)          Quantitative Discovery                               / 20
Evaluation 1: Cluster Quality




                 −0.3        −0.2   −0.1                     0.1      0.2        0.3

                                    (Our Method) − (Human Coders)




                                                               Parthemos Lecture at University of
  Gary King (Harvard IQSS)          Quantitative Discovery                                 / 20
Evaluation 1: Cluster Quality


                                                              Lautenberg Press Releases
                                                                            q




                 −0.3        −0.2   −0.1                     0.1      0.2        0.3

                                    (Our Method) − (Human Coders)



Lautenberg: 200 Senate Press Releases (appropriations, economy,
education, tax, veterans, . . . )

                                                               Parthemos Lecture at University of
  Gary King (Harvard IQSS)          Quantitative Discovery                                 / 20
Evaluation 1: Cluster Quality


                                                               Lautenberg Press Releases
                                                                               q




                                                                    Policy Agendas Project
                                                                                   q




                  −0.3        −0.2   −0.1                     0.1        0.2           0.3

                                     (Our Method) − (Human Coders)



Policy Agendas: 213 quasi-sentences from Bush’s State of the Union
(agriculture, banking & commerce, civil rights/liberties, defense, . . . )

                                                                Parthemos Lecture at University of
   Gary King (Harvard IQSS)          Quantitative Discovery                                  / 20
Evaluation 1: Cluster Quality


                                                                    Lautenberg Press Releases
                                                                                     q




                                                                          Policy Agendas Project
                                                                                         q




                                                              Reuter's Gold Standard
                                                                    q




                  −0.3        −0.2   −0.1                          0.1         0.2           0.3

                                     (Our Method) − (Human Coders)



Reuter’s: financial news (trade, earnings, copper, gold, coffee, . . . ); “gold
standard” for supervised learning studies

                                                                        Parthemos Lecture at University of
   Gary King (Harvard IQSS)          Quantitative Discovery                                         / 20
Evaluation 2: More Informative Discoveries




                                                      Parthemos Lecture at University of
  Gary King (Harvard IQSS)   Quantitative Discovery                               / 20
Evaluation 2: More Informative Discoveries

    Found 2 scholars analyzing lots of textual data for their work




                                                       Parthemos Lecture at University of
  Gary King (Harvard IQSS)    Quantitative Discovery                               / 20
Evaluation 2: More Informative Discoveries

    Found 2 scholars analyzing lots of textual data for their work
    Created 6 clusterings:




                                                       Parthemos Lecture at University of
  Gary King (Harvard IQSS)    Quantitative Discovery                               / 20
Evaluation 2: More Informative Discoveries

    Found 2 scholars analyzing lots of textual data for their work
    Created 6 clusterings:
            2 clusterings selected with our method (biased against us)




                                                            Parthemos Lecture at University of
  Gary King (Harvard IQSS)         Quantitative Discovery                               / 20
Evaluation 2: More Informative Discoveries

    Found 2 scholars analyzing lots of textual data for their work
    Created 6 clusterings:
            2 clusterings selected with our method (biased against us)
            2 clusterings from each of 2 other methods (varying tuning parameters)




                                                           Parthemos Lecture at University of
  Gary King (Harvard IQSS)        Quantitative Discovery                               / 20
Evaluation 2: More Informative Discoveries

    Found 2 scholars analyzing lots of textual data for their work
    Created 6 clusterings:
            2 clusterings selected with our method (biased against us)
            2 clusterings from each of 2 other methods (varying tuning parameters)
    Created info packet on each clustering (for each cluster: exemplar
    document, automated content summary)




                                                           Parthemos Lecture at University of
  Gary King (Harvard IQSS)        Quantitative Discovery                               / 20
Evaluation 2: More Informative Discoveries

    Found 2 scholars analyzing lots of textual data for their work
    Created 6 clusterings:
            2 clusterings selected with our method (biased against us)
            2 clusterings from each of 2 other methods (varying tuning parameters)
    Created info packet on each clustering (for each cluster: exemplar
    document, automated content summary)
                     6
    Asked for        2   =15 pairwise comparisons




                                                             Parthemos Lecture at University of
  Gary King (Harvard IQSS)          Quantitative Discovery                               / 20
Evaluation 2: More Informative Discoveries

    Found 2 scholars analyzing lots of textual data for their work
    Created 6 clusterings:
            2 clusterings selected with our method (biased against us)
            2 clusterings from each of 2 other methods (varying tuning parameters)
    Created info packet on each clustering (for each cluster: exemplar
    document, automated content summary)
                     6
    Asked for        2   =15 pairwise comparisons
    User chooses ⇒ only care about the one clustering that wins




                                                             Parthemos Lecture at University of
  Gary King (Harvard IQSS)          Quantitative Discovery                               / 20
Evaluation 2: More Informative Discoveries

    Found 2 scholars analyzing lots of textual data for their work
    Created 6 clusterings:
            2 clusterings selected with our method (biased against us)
            2 clusterings from each of 2 other methods (varying tuning parameters)
    Created info packet on each clustering (for each cluster: exemplar
    document, automated content summary)
                     6
    Asked for        2   =15 pairwise comparisons
    User chooses ⇒ only care about the one clustering that wins
    Both cases a Condorcet winner:




                                                             Parthemos Lecture at University of
  Gary King (Harvard IQSS)          Quantitative Discovery                               / 20
Evaluation 2: More Informative Discoveries

     Found 2 scholars analyzing lots of textual data for their work
     Created 6 clusterings:
             2 clusterings selected with our method (biased against us)
             2 clusterings from each of 2 other methods (varying tuning parameters)
     Created info packet on each clustering (for each cluster: exemplar
     document, automated content summary)
                      6
     Asked for        2   =15 pairwise comparisons
     User chooses ⇒ only care about the one clustering that wins
     Both cases a Condorcet winner:
“Immigration”:

Our Method 1 → vMF 1 → vMF 2 → Our Method 2 → K-Means 1 → K-Means 2




                                                              Parthemos Lecture at University of
   Gary King (Harvard IQSS)          Quantitative Discovery                               / 20
Evaluation 2: More Informative Discoveries

      Found 2 scholars analyzing lots of textual data for their work
      Created 6 clusterings:
             2 clusterings selected with our method (biased against us)
             2 clusterings from each of 2 other methods (varying tuning parameters)
      Created info packet on each clustering (for each cluster: exemplar
      document, automated content summary)
                      6
      Asked for       2   =15 pairwise comparisons
      User chooses ⇒ only care about the one clustering that wins
      Both cases a Condorcet winner:
“Immigration”:

Our Method 1 → vMF 1 → vMF 2 → Our Method 2 → K-Means 1 → K-Means 2


“Genetic testing”:

Our Method 1 → {Our Method 2, K-Means 1, K-means 2} → Dir Proc. 1 → Dir Proc. 2
                                                       Parthemos Lecture at University of
   Gary King (Harvard IQSS)          Quantitative Discovery                         / 20
Evaluation 3: What Do Members of Congress Do?




                                                      Parthemos Lecture at University of
  Gary King (Harvard IQSS)   Quantitative Discovery                               / 20
Evaluation 3: What Do Members of Congress Do?




  - David Mayhew’s (1974) famous typology




                                                      Parthemos Lecture at University of
  Gary King (Harvard IQSS)   Quantitative Discovery                               / 20
Evaluation 3: What Do Members of Congress Do?




  - David Mayhew’s (1974) famous typology
         - Advertising




                                                      Parthemos Lecture at University of
  Gary King (Harvard IQSS)   Quantitative Discovery                               / 20
Evaluation 3: What Do Members of Congress Do?




  - David Mayhew’s (1974) famous typology
         - Advertising
         - Credit Claiming




                                                      Parthemos Lecture at University of
  Gary King (Harvard IQSS)   Quantitative Discovery                               / 20
Evaluation 3: What Do Members of Congress Do?




  - David Mayhew’s (1974) famous typology
         - Advertising
         - Credit Claiming
         - Position Taking




                                                      Parthemos Lecture at University of
  Gary King (Harvard IQSS)   Quantitative Discovery                               / 20
Evaluation 3: What Do Members of Congress Do?




  - David Mayhew’s (1974) famous typology
         - Advertising
         - Credit Claiming
         - Position Taking
  - Data: 200 press releases from Frank Lautenberg’s office (D-NJ)




                                                      Parthemos Lecture at University of
  Gary King (Harvard IQSS)   Quantitative Discovery                               / 20
Evaluation 3: What Do Members of Congress Do?




  - David Mayhew’s (1974) famous typology
         - Advertising
         - Credit Claiming
         - Position Taking
  - Data: 200 press releases from Frank Lautenberg’s office (D-NJ)
  - Apply our method




                                                      Parthemos Lecture at University of
  Gary King (Harvard IQSS)   Quantitative Discovery                               / 20
Example Discovery

                                          mult_dirproc
                                                                                           kmeans correlation
                                                                          hclust canberra ward                     sot_cor
                                                                                                  divisive stand.euc
                                                                                       mixvmf
                                                                                              hclust correlationmixvmfVAbinary complete
                                                                                                                   hclust
                                                                                                                mcquitty
                                   hclust pearson single                              affprop cosine
                         hclust pearson median
                                          hclust correlation single                                   hclust pearson mcquitty
                  hclust correlation median
                                                                mec                           hclust pearson average hclust correlation complete
                                                                                            hclust correlation averagehclust pearson complete
                                 hclust binary single
                                                                                                    hclust binary average        kmeans pearson
                                               som
                 hclust correlation centroid rock
                hclust pearson centroid
                                hclust binary median                                                 hclust binary mcquitty
                          hclust canberra single
           biclust_spectral                       hclust spearman complete
                                                                                                                                        spec_man
                                                                                                                                     spec_cos
                    kmeans kendall median
                     hclust canberra
                                                                                          hclust canberra average
                                                                                                                                   spec_mink
                                                                                                                                          spec_euc
                                                                                                                                   spec_max
                                                                                                                             mspec_minkspec_canb
                                                                                                                                    mspec_man
affprop maximum     kmeans spearman                              kmeans manhattan                                                   mspec_max
                                                                                                                                   mspec_cos
                                                                                                                                  mspec_canb
                                                                                                                                    mspec_euc
                  kmeansbinary centroid
                   hclust canberra hclust kendall single
                                   hclust spearman centroid
                              hclusthclust kendall centroid
                                     kendall medianaverage
                                                   average
                                     spearmankendall mcquitty
                                         hclust median
                                 hclust spearman single
                                 hclust spearman mcquitty kendall complete
                  hclust canberra centroid              hclust
                          hclust manhattan medianmanhattan
                               euclideankmedoids
                        hclusthclustmanhattan centroid
                             hclust manhattan single
                              hclust manhattan average
                                           single manhattan
                                           affprop
                          hclust euclidean median
                      hclust maximum single manhattan
                                       divisive
                          hclust euclidean centroid
                              hclust euclidean average
                           hclust manhattan mcquitty            clust_convex                                       hclust correlation ward
                                   hclust euclidean mcquitty
                                                 kmedoids euclidean                                                 hclust pearson wardstand.euc
                                                                                                                             kmedoids
                            hclust maximummedian
                                    divisive centroid
                           hclust maximumeuclideanaffprop euclidean                                    hclust canberra mcquitty
                                          hclust maximum average
                                        hclust maximum complete
                                     hclust euclidean complete
                             hclust manhattan maximum mcquitty
                                         hclust complete
                                                                                         dist_ebinary
                                                                                          dist_binary
                                                                                          dist_fbinary
                                                                                        dist_minkowski
                                                                                           dist_canb
                                                                                           dist_max
                                                                                            dist_cos
                                                                                            dismea
                                                                       hclust manhattan ward                                                   affprop info.costs
                                                      kmeanshclust euclidean ward
                                                             euclidean                                           hclust canberra complete
                                                   sot_euc
                                                                                                        hclust binary ward


                                                                      hclusthclust spearman ward
                                                                             kendall ward
                                             hclust maximum ward                                                 kmeans binary

                                                                                         kmeans maximum




                                                                                                                                                                    Parthemos Lecture at University of
          Gary King (Harvard IQSS)                                                                                                  Quantitative Discovery                                      / 20
Example Discovery

                                          mult_dirproc
                                                                                           kmeans correlation
                                                                          hclust canberra ward                     sot_cor
                                                                                                  divisive stand.euc
                                                                                       mixvmf
                                                                                              hclust correlationmixvmfVAbinary complete
                                                                                                                   hclust
                                                                                                                mcquitty
                                   hclust pearson single                              affprop cosine
                         hclust pearson median
                                          hclust correlation single                                   hclust pearson mcquitty
                  hclust correlation median
                                                                mec                           hclust pearson average hclust correlation complete
                                                                                            hclust correlation averagehclust pearson complete
                                 hclust binary single
                                                                                                    hclust binary average
                                               som
                 hclust correlation centroid rock
                hclust pearson centroid
                                hclust binary median
                          hclust canberra single
           biclust_spectral
                                                            affprop cosine
                                                  hclust spearman complete
                                                                                                     hclust binary mcquitty
                                                                                                                                 kmeans pearson


                                                                                                                                        spec_man
                                                                                                                                     spec_cos
                    kmeans kendall median
                     hclust canberra
                                                                                          hclust canberra average
                                                                                                                                   spec_mink
                                                                                                                                          spec_euc
                                                                                                                                   spec_max
                                                                                                                             mspec_minkspec_canb
                                                                                                                                    mspec_man
affprop maximum     kmeans spearman                              kmeans manhattan                                                   mspec_max
                                                                                                                                   mspec_cos
                                                                                                                                  mspec_canb
                                                                                                                                    mspec_euc
                  kmeansbinary centroid
                   hclust canberra hclust kendall single
                                   hclust spearman centroid
                              hclusthclust kendall centroid
                                     kendall medianaverage
                                                   average
                                     spearmankendall mcquitty
                                         hclust median
                                 hclust spearman single
                                 hclust spearman mcquitty kendall complete
                  hclust canberra centroid              hclust
                          hclust manhattan medianmanhattan
                               euclideankmedoids
                        hclusthclustmanhattan centroid
                             hclust manhattan single
                              hclust manhattan average
                                           single manhattan
                                           affprop
                          hclust euclidean median
                      hclust maximum single manhattan
                                       divisive
                          hclust euclidean centroid
                              hclust euclidean average
                           hclust manhattan mcquitty            clust_convex                                       hclust correlation ward
                                   hclust euclidean mcquitty
                                                 kmedoids euclidean                                                 hclust pearson wardstand.euc
                                                                                                                             kmedoids
                            hclust maximummedian
                                    divisive centroid
                           hclust maximumeuclideanaffprop euclidean
                                          hclust maximum average
                                        hclust maximum complete
                                     hclust euclidean complete
                             hclust manhattan maximum mcquitty
                                         hclust complete
                                                                                         dist_ebinary
                                                                                          dist_binary
                                                                                          dist_fbinary
                                                                                        dist_minkowski
                                                                                           dist_canb
                                                                                           dist_max
                                                                       hclust manhattan ward
                                                                                            dist_cos
                                                                                            dismea
                                                                                                       hclust canberra mcquitty

                                                                                                                                                                    Red point: a clustering by
                                                                                                                                               affprop info.costs
                                                      kmeanshclust euclidean ward
                                                   sot_euc
                                                             euclidean                                           hclust canberra complete
                                                                                                        hclust binary ward                                          Affinity Propagation-Cosine
                                             hclust maximum ward
                                                                      hclusthclust spearman ward
                                                                             kendall ward

                                                                                                                 kmeans binary                                      (Dueck and Frey 2007)
                                                                                         kmeans maximum




                                                                                                                                                                            Parthemos Lecture at University of
          Gary King (Harvard IQSS)                                                                                                  Quantitative Discovery                                              / 20
Example Discovery

                                          mult_dirproc

                                                     mixvmf                                kmeans correlation
                                                                          hclust canberra ward                     sot_cor
                                                                                                  divisive stand.euc
                                                                                       mixvmf
                                                                                              hclust correlationmixvmfVAbinary complete
                                                                                                                   hclust
                                                                                                                mcquitty
                                   hclust pearson single                              affprop cosine
                         hclust pearson median
                                          hclust correlation single                                   hclust pearson mcquitty
                  hclust correlation median
                                                                mec                           hclust pearson average hclust correlation complete
                                                                                            hclust correlation averagehclust pearson complete
                                 hclust binary single
                                                                                                    hclust binary average
                                               som
                 hclust correlation centroid rock
                hclust pearson centroid
                                hclust binary median
                          hclust canberra single
           biclust_spectral
                                                            affprop cosine
                                                  hclust spearman complete
                                                                                                     hclust binary mcquitty
                                                                                                                                 kmeans pearson


                                                                                                                                        spec_man
                                                                                                                                     spec_cos
                    kmeans kendall median
                     hclust canberra
                                                                                          hclust canberra average
                                                                                                                                   spec_mink
                                                                                                                                          spec_euc
                                                                                                                                   spec_max
                                                                                                                             mspec_minkspec_canb
                                                                                                                                    mspec_man
affprop maximum     kmeans spearman                              kmeans manhattan                                                   mspec_max
                                                                                                                                   mspec_cos
                                                                                                                                  mspec_canb
                                                                                                                                    mspec_euc
                  kmeansbinary centroid
                   hclust canberra hclust kendall single
                                   hclust spearman centroid
                              hclusthclust kendall centroid
                                     kendall medianaverage
                                                   average
                                     spearmankendall mcquitty
                                         hclust median
                                 hclust spearman single
                                 hclust spearman mcquitty kendall complete
                  hclust canberra centroid              hclust
                          hclust manhattan medianmanhattan
                               euclideankmedoids
                        hclusthclustmanhattan centroid
                             hclust manhattan single
                              hclust manhattan average
                                           single manhattan
                                           affprop
                          hclust euclidean median
                      hclust maximum single manhattan
                                       divisive
                          hclust euclidean centroid
                              hclust euclidean average
                           hclust manhattan mcquitty            clust_convex                                       hclust correlation ward
                                   hclust euclidean mcquitty
                                                 kmedoids euclidean                                                 hclust pearson wardstand.euc
                                                                                                                             kmedoids
                            hclust maximummedian
                                    divisive centroid
                           hclust maximumeuclideanaffprop euclidean
                                          hclust maximum average
                                        hclust maximum complete
                                     hclust euclidean complete
                             hclust manhattan maximum mcquitty
                                         hclust complete
                                                                                         dist_ebinary
                                                                                          dist_binary
                                                                                          dist_fbinary
                                                                                        dist_minkowski
                                                                                           dist_canb
                                                                                           dist_max
                                                                       hclust manhattan ward
                                                                                            dist_cos
                                                                                            dismea
                                                                                                       hclust canberra mcquitty

                                                                                                                                                                    Red point: a clustering by
                                                                                                                                               affprop info.costs
                                                      kmeanshclust euclidean ward
                                                   sot_euc
                                                             euclidean                                           hclust canberra complete
                                                                                                        hclust binary ward                                          Affinity Propagation-Cosine
                                             hclust maximum ward
                                                                      hclusthclust spearman ward
                                                                             kendall ward

                                                                                                                 kmeans binary                                      (Dueck and Frey 2007)
                                                                                         kmeans maximum


                                                                                                                                                                    Close to:
                                                                                                                                                                    Mixture of von Mises-Fisher
                                                                                                                                                                    distributions (Banerjee et. al.
                                                                                                                                                                    2005)



                                                                                                                                                                             Parthemos Lecture at University of
          Gary King (Harvard IQSS)                                                                                                  Quantitative Discovery                                               / 20
discov-uga_1.pdf
discov-uga_1.pdf
discov-uga_1.pdf
discov-uga_1.pdf
discov-uga_1.pdf
discov-uga_1.pdf
discov-uga_1.pdf
discov-uga_1.pdf
discov-uga_1.pdf
discov-uga_1.pdf
discov-uga_1.pdf
discov-uga_1.pdf
discov-uga_1.pdf
discov-uga_1.pdf
discov-uga_1.pdf
discov-uga_1.pdf
discov-uga_1.pdf
discov-uga_1.pdf
discov-uga_1.pdf
discov-uga_1.pdf
discov-uga_1.pdf
discov-uga_1.pdf
discov-uga_1.pdf
discov-uga_1.pdf
discov-uga_1.pdf
discov-uga_1.pdf
discov-uga_1.pdf
discov-uga_1.pdf
discov-uga_1.pdf
discov-uga_1.pdf
discov-uga_1.pdf
discov-uga_1.pdf
discov-uga_1.pdf

Más contenido relacionado

Destacado

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Destacado (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

discov-uga_1.pdf

  • 1. Computer-Assisted Clustering and Conceptualization Gary King Institute for Quantitative Social Science Harvard University Parthemos Lecture at University of Georgia, 3/4/2011 1 Based on joint work with Justin Grimmer (Harvard Stanford) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 2. A Method for Computer Assisted Conceptualization Conceptualization through Classification: “one of the most central and generic of all our conceptual exercises. . . . the foundation not only for conceptualization, language, and speech, but also for mathematics, statistics, and data analysis. . . . Without classification, there could be no advanced conceptualization, reasoning, language, data analysis or,for that matter, social science research.” (Bailey, 1994). Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 3. A Method for Computer Assisted Conceptualization Conceptualization through Classification: “one of the most central and generic of all our conceptual exercises. . . . the foundation not only for conceptualization, language, and speech, but also for mathematics, statistics, and data analysis. . . . Without classification, there could be no advanced conceptualization, reasoning, language, data analysis or,for that matter, social science research.” (Bailey, 1994). Cluster Analysis: simultaneously (1) invents categories and (2) assigns documents to categories Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 4. A Method for Computer Assisted Conceptualization Conceptualization through Classification: “one of the most central and generic of all our conceptual exercises. . . . the foundation not only for conceptualization, language, and speech, but also for mathematics, statistics, and data analysis. . . . Without classification, there could be no advanced conceptualization, reasoning, language, data analysis or,for that matter, social science research.” (Bailey, 1994). Cluster Analysis: simultaneously (1) invents categories and (2) assigns documents to categories We focus on unstructured text; methods apply more broadly. Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 5. A Method for Computer Assisted Conceptualization Conceptualization through Classification: “one of the most central and generic of all our conceptual exercises. . . . the foundation not only for conceptualization, language, and speech, but also for mathematics, statistics, and data analysis. . . . Without classification, there could be no advanced conceptualization, reasoning, language, data analysis or,for that matter, social science research.” (Bailey, 1994). Cluster Analysis: simultaneously (1) invents categories and (2) assigns documents to categories We focus on unstructured text; methods apply more broadly. Main goal: Switch from Fully Automated to Computer Assisted Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 6. What’s Hard about Clustering? Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 7. What’s Hard about Clustering? (aka Why Johnny Can’t Classify) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 8. What’s Hard about Clustering? (aka Why Johnny Can’t Classify) Clustering seems easy; its not! Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 9. What’s Hard about Clustering? (aka Why Johnny Can’t Classify) Clustering seems easy; its not! Bell(n) = number of ways of partitioning n objects Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 10. What’s Hard about Clustering? (aka Why Johnny Can’t Classify) Clustering seems easy; its not! Bell(n) = number of ways of partitioning n objects Bell(2) = 2 (AB, A B) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 11. What’s Hard about Clustering? (aka Why Johnny Can’t Classify) Clustering seems easy; its not! Bell(n) = number of ways of partitioning n objects Bell(2) = 2 (AB, A B) Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 12. What’s Hard about Clustering? (aka Why Johnny Can’t Classify) Clustering seems easy; its not! Bell(n) = number of ways of partitioning n objects Bell(2) = 2 (AB, A B) Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C) Bell(5) = 52 Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 13. What’s Hard about Clustering? (aka Why Johnny Can’t Classify) Clustering seems easy; its not! Bell(n) = number of ways of partitioning n objects Bell(2) = 2 (AB, A B) Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C) Bell(5) = 52 Bell(100) ≈ Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 14. What’s Hard about Clustering? (aka Why Johnny Can’t Classify) Clustering seems easy; its not! Bell(n) = number of ways of partitioning n objects Bell(2) = 2 (AB, A B) Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C) Bell(5) = 52 Bell(100) ≈ 1028 × Number of elementary particles in the universe Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 15. What’s Hard about Clustering? (aka Why Johnny Can’t Classify) Clustering seems easy; its not! Bell(n) = number of ways of partitioning n objects Bell(2) = 2 (AB, A B) Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C) Bell(5) = 52 Bell(100) ≈ 1028 × Number of elementary particles in the universe Now imagine choosing the optimal classification scheme by hand! Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 16. What’s Hard about Clustering? (aka Why Johnny Can’t Classify) Clustering seems easy; its not! Bell(n) = number of ways of partitioning n objects Bell(2) = 2 (AB, A B) Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C) Bell(5) = 52 Bell(100) ≈ 1028 × Number of elementary particles in the universe Now imagine choosing the optimal classification scheme by hand! Fully automated algorithms can help, but which ones? Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 17. The Problem with Fully Automated Clustering Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 18. The Problem with Fully Automated Clustering The (Impossible) Goal: optimal, fully automated, application-independent cluster analysis Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 19. The Problem with Fully Automated Clustering The (Impossible) Goal: optimal, fully automated, application-independent cluster analysis No free lunch theorem: every possible clustering method performs equally well on average over all possible substantive applications Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 20. The Problem with Fully Automated Clustering The (Impossible) Goal: optimal, fully automated, application-independent cluster analysis No free lunch theorem: every possible clustering method performs equally well on average over all possible substantive applications Existing methods: Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 21. The Problem with Fully Automated Clustering The (Impossible) Goal: optimal, fully automated, application-independent cluster analysis No free lunch theorem: every possible clustering method performs equally well on average over all possible substantive applications Existing methods: Many choices: model-based, subspace, spectral, grid-based, graph- based, fuzzy k-modes, affinity propagation, self-organizing maps,. . . Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 22. The Problem with Fully Automated Clustering The (Impossible) Goal: optimal, fully automated, application-independent cluster analysis No free lunch theorem: every possible clustering method performs equally well on average over all possible substantive applications Existing methods: Many choices: model-based, subspace, spectral, grid-based, graph- based, fuzzy k-modes, affinity propagation, self-organizing maps,. . . Well-defined statistical, data analytic, or machine learning foundations Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 23. The Problem with Fully Automated Clustering The (Impossible) Goal: optimal, fully automated, application-independent cluster analysis No free lunch theorem: every possible clustering method performs equally well on average over all possible substantive applications Existing methods: Many choices: model-based, subspace, spectral, grid-based, graph- based, fuzzy k-modes, affinity propagation, self-organizing maps,. . . Well-defined statistical, data analytic, or machine learning foundations How to add substantive knowledge: With few exceptions, unclear Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 24. The Problem with Fully Automated Clustering The (Impossible) Goal: optimal, fully automated, application-independent cluster analysis No free lunch theorem: every possible clustering method performs equally well on average over all possible substantive applications Existing methods: Many choices: model-based, subspace, spectral, grid-based, graph- based, fuzzy k-modes, affinity propagation, self-organizing maps,. . . Well-defined statistical, data analytic, or machine learning foundations How to add substantive knowledge: With few exceptions, unclear The literature: little guidance on when methods apply Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 25. The Problem with Fully Automated Clustering The (Impossible) Goal: optimal, fully automated, application-independent cluster analysis No free lunch theorem: every possible clustering method performs equally well on average over all possible substantive applications Existing methods: Many choices: model-based, subspace, spectral, grid-based, graph- based, fuzzy k-modes, affinity propagation, self-organizing maps,. . . Well-defined statistical, data analytic, or machine learning foundations How to add substantive knowledge: With few exceptions, unclear The literature: little guidance on when methods apply Deriving such guidance: difficult or impossible Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 26. The Problem with Fully Automated Clustering The (Impossible) Goal: optimal, fully automated, application-independent cluster analysis No free lunch theorem: every possible clustering method performs equally well on average over all possible substantive applications Existing methods: Many choices: model-based, subspace, spectral, grid-based, graph- based, fuzzy k-modes, affinity propagation, self-organizing maps,. . . Well-defined statistical, data analytic, or machine learning foundations How to add substantive knowledge: With few exceptions, unclear The literature: little guidance on when methods apply Deriving such guidance: difficult or impossible Deep problem: full automation requires more information Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 27. The Problem with Fully Automated Clustering The (Impossible) Goal: optimal, fully automated, application-independent cluster analysis No free lunch theorem: every possible clustering method performs equally well on average over all possible substantive applications Existing methods: Many choices: model-based, subspace, spectral, grid-based, graph- based, fuzzy k-modes, affinity propagation, self-organizing maps,. . . Well-defined statistical, data analytic, or machine learning foundations How to add substantive knowledge: With few exceptions, unclear The literature: little guidance on when methods apply Deriving such guidance: difficult or impossible Deep problem: full automation requires more information No surprise: everyone’s tried cluster analysis; very few are satisfied Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 28. Switch from Fully Automated to Computer Assisted Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 29. Switch from Fully Automated to Computer Assisted Fully Automated Clustering may succeed sometimes, but fails in general: too hard to understand when each model applies Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 30. Switch from Fully Automated to Computer Assisted Fully Automated Clustering may succeed sometimes, but fails in general: too hard to understand when each model applies An alternative: Computer-Assisted Clustering Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 31. Switch from Fully Automated to Computer Assisted Fully Automated Clustering may succeed sometimes, but fails in general: too hard to understand when each model applies An alternative: Computer-Assisted Clustering Easy in theory: list all clusterings; choose the best Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 32. Switch from Fully Automated to Computer Assisted Fully Automated Clustering may succeed sometimes, but fails in general: too hard to understand when each model applies An alternative: Computer-Assisted Clustering Easy in theory: list all clusterings; choose the best Impossible in practice: Too hard for us mere humans! Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 33. Switch from Fully Automated to Computer Assisted Fully Automated Clustering may succeed sometimes, but fails in general: too hard to understand when each model applies An alternative: Computer-Assisted Clustering Easy in theory: list all clusterings; choose the best Impossible in practice: Too hard for us mere humans! An organized list will make the search possible Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 34. Switch from Fully Automated to Computer Assisted Fully Automated Clustering may succeed sometimes, but fails in general: too hard to understand when each model applies An alternative: Computer-Assisted Clustering Easy in theory: list all clusterings; choose the best Impossible in practice: Too hard for us mere humans! An organized list will make the search possible Insight: Many clusterings are perceptually identical Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 35. Switch from Fully Automated to Computer Assisted Fully Automated Clustering may succeed sometimes, but fails in general: too hard to understand when each model applies An alternative: Computer-Assisted Clustering Easy in theory: list all clusterings; choose the best Impossible in practice: Too hard for us mere humans! An organized list will make the search possible Insight: Many clusterings are perceptually identical E.g.,: consider two clusterings that differ only because one document (of 10,000) moves from category 5 to 6 Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 36. Switch from Fully Automated to Computer Assisted Fully Automated Clustering may succeed sometimes, but fails in general: too hard to understand when each model applies An alternative: Computer-Assisted Clustering Easy in theory: list all clusterings; choose the best Impossible in practice: Too hard for us mere humans! An organized list will make the search possible Insight: Many clusterings are perceptually identical E.g.,: consider two clusterings that differ only because one document (of 10,000) moves from category 5 to 6 Question: How to organize clusterings so humans can understand? Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 37. Our Idea: Meaning Through Geography Set of clusterings Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 38. Our Idea: Meaning Through Geography Set of clusterings ≈ A list of unconnected addresses Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 39. Our Idea: Meaning Through Geography Set of clusterings ≈ A list of unconnected addresses Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 40. Our Idea: Meaning Through Geography Set of clusterings ≈ A list of unconnected addresses We develop a (conceptual) geography of clusterings Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 41. A New Strategy Make it easy to choose best clustering from millions of choices Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 42. A New Strategy Make it easy to choose best clustering from millions of choices 1 Code text as numbers (in one or more of several ways) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 43. A New Strategy Make it easy to choose best clustering from millions of choices 1 Code text as numbers (in one or more of several ways) 2 Apply all clustering methods we can find to the data — each representing different (unstated) substantive assumptions (<15 mins) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 44. A New Strategy Make it easy to choose best clustering from millions of choices 1 Code text as numbers (in one or more of several ways) 2 Apply all clustering methods we can find to the data — each representing different (unstated) substantive assumptions (<15 mins) 3 (Too much for a person to understand, but organization will help) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 45. A New Strategy Make it easy to choose best clustering from millions of choices 1 Code text as numbers (in one or more of several ways) 2 Apply all clustering methods we can find to the data — each representing different (unstated) substantive assumptions (<15 mins) 3 (Too much for a person to understand, but organization will help) 4 Develop an application-independent distance metric between clusterings, a metric space of clusterings, and a 2-D projection Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 46. A New Strategy Make it easy to choose best clustering from millions of choices 1 Code text as numbers (in one or more of several ways) 2 Apply all clustering methods we can find to the data — each representing different (unstated) substantive assumptions (<15 mins) 3 (Too much for a person to understand, but organization will help) 4 Develop an application-independent distance metric between clusterings, a metric space of clusterings, and a 2-D projection 5 “Local cluster ensemble” creates a new clustering at any point, based on weighted average of nearby clusterings Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 47. A New Strategy Make it easy to choose best clustering from millions of choices 1 Code text as numbers (in one or more of several ways) 2 Apply all clustering methods we can find to the data — each representing different (unstated) substantive assumptions (<15 mins) 3 (Too much for a person to understand, but organization will help) 4 Develop an application-independent distance metric between clusterings, a metric space of clusterings, and a 2-D projection 5 “Local cluster ensemble” creates a new clustering at any point, based on weighted average of nearby clusterings 6 A new animated visualization to explore the space of clusterings (smoothly morphing from one into others) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 48. A New Strategy Make it easy to choose best clustering from millions of choices 1 Code text as numbers (in one or more of several ways) 2 Apply all clustering methods we can find to the data — each representing different (unstated) substantive assumptions (<15 mins) 3 (Too much for a person to understand, but organization will help) 4 Develop an application-independent distance metric between clusterings, a metric space of clusterings, and a 2-D projection 5 “Local cluster ensemble” creates a new clustering at any point, based on weighted average of nearby clusterings 6 A new animated visualization to explore the space of clusterings (smoothly morphing from one into others) 7 Millions of clusterings, easily comprehended Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 49. Many Thousands of Clusterings, Sorted & Organized You choose one (or more), based on insight, discovery, useful information,. . . Obama Space of mixvmf Clusterings Clustering 2 Ford Clustering 1 affprop info.costs Carter Nixon kmedoids stand.euc Johnson Carter Eisenhower rock affprop maximum Ford Roosevelt kmeans correlation hclust correlation single hclust pearson single Eisenhower Truman Truman Johnson Roosevelt hclust maximum single hclust correlation median hclust binary hclust correlationmedian centroidpearson centroid hclust pearson centroid hclust spec_max Nixon ``Other hclust canberra centroid ``Roosevelt hclust correlationaverage average hclust pearson mcquitty mcquitty hclust kendall single hclust maximum ward Presidents '' hclust euclidean centroid To Carter'' hclust canberra mcquitty binary median hclust hclust canberra median kmeans kendall hclust canberra single mspec_max hclust binary single biclust_spectral affprop manhattan affprop cosine q Clinton hclust manhattan centroid hclust manhattanmedian hclust maximum single hclust spearman centroid hclust maximum centroid kmedoids manhattan kendall centroid mspec_canb hclust euclidean median hclust canberra average hclust correlation complete hclust pearson complete divisive stand.euc mspec_cos hclust kendall average hclust manhattan median hclust spearman median hclust kendall median kmeans maximum hclust euclideanaverage single hclust maximum mcquitty hclust maximum complete kmeans pearson affprop euclidean hclust mcquitty average hclust manhattan average euclidean Kennedy Kennedy hclust spearman single q divisive euclidean Bushkmeans binary hclust binary average kmedoids euclidean som hclust spearman average spec_mink mspec_euc mspec_mink hclust binary complete hclust binary mcquitty divisive manhattan mspec_man hclust euclidean mcquitty hclust euclidean complete hclust kendall complete hclust correlation ward complete hclust canberra Bush clust_convex hclust euclidean ward hclust spearman mcquitty hclust kendall mcquitty dismea Obama hclust binary ward hclust canberra ward hclust spearman complete hclust manhattan complete spec_canb hclust kendall ward mixvmfVA spec_cos spec_euc hclust manhattan ward kmeans euclidean kmeans manhattan spec_man hclust pearson ward ``Reagan `` Reagan To Republicans'' hclust spearman ward Obama '' kmeans spearman Reagan kmeans canberra HWBush HWBush Clinton Reagan mult_dirproc Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 50. Software Screenshot Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 51. Evaluating Performance Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 52. Evaluating Performance Goals: Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 53. Evaluating Performance Goals: Validate Claim: computer-assisted conceptualization outperforms human conceptualization Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 54. Evaluating Performance Goals: Validate Claim: computer-assisted conceptualization outperforms human conceptualization Demonstrate: new experimental designs for cluster evaluation Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 55. Evaluating Performance Goals: Validate Claim: computer-assisted conceptualization outperforms human conceptualization Demonstrate: new experimental designs for cluster evaluation Inject human judgement: relying on insights from survey research Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 56. Evaluating Performance Goals: Validate Claim: computer-assisted conceptualization outperforms human conceptualization Demonstrate: new experimental designs for cluster evaluation Inject human judgement: relying on insights from survey research We now present three evaluations Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 57. Evaluating Performance Goals: Validate Claim: computer-assisted conceptualization outperforms human conceptualization Demonstrate: new experimental designs for cluster evaluation Inject human judgement: relying on insights from survey research We now present three evaluations Cluster Quality ⇒ RA coders Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 58. Evaluating Performance Goals: Validate Claim: computer-assisted conceptualization outperforms human conceptualization Demonstrate: new experimental designs for cluster evaluation Inject human judgement: relying on insights from survey research We now present three evaluations Cluster Quality ⇒ RA coders Informative discoveries ⇒ Experienced scholars analyzing texts Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 59. Evaluating Performance Goals: Validate Claim: computer-assisted conceptualization outperforms human conceptualization Demonstrate: new experimental designs for cluster evaluation Inject human judgement: relying on insights from survey research We now present three evaluations Cluster Quality ⇒ RA coders Informative discoveries ⇒ Experienced scholars analyzing texts Discovery ⇒ You’re the judge Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 60. Evaluation 1: Cluster Quality Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 61. Evaluation 1: Cluster Quality What Are Humans Good For? Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 62. Evaluation 1: Cluster Quality What Are Humans Good For? They can’t: keep many documents & clusters in their head Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 63. Evaluation 1: Cluster Quality What Are Humans Good For? They can’t: keep many documents & clusters in their head They can: compare two documents at a time Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 64. Evaluation 1: Cluster Quality What Are Humans Good For? They can’t: keep many documents & clusters in their head They can: compare two documents at a time =⇒ Cluster quality evaluation: human judgement of document pairs Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 65. Evaluation 1: Cluster Quality What Are Humans Good For? They can’t: keep many documents & clusters in their head They can: compare two documents at a time =⇒ Cluster quality evaluation: human judgement of document pairs Experimental Design to Assess Cluster Quality Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 66. Evaluation 1: Cluster Quality What Are Humans Good For? They can’t: keep many documents & clusters in their head They can: compare two documents at a time =⇒ Cluster quality evaluation: human judgement of document pairs Experimental Design to Assess Cluster Quality automated visualization to choose one clustering Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 67. Evaluation 1: Cluster Quality What Are Humans Good For? They can’t: keep many documents & clusters in their head They can: compare two documents at a time =⇒ Cluster quality evaluation: human judgement of document pairs Experimental Design to Assess Cluster Quality automated visualization to choose one clustering many pairs of documents Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 68. Evaluation 1: Cluster Quality What Are Humans Good For? They can’t: keep many documents & clusters in their head They can: compare two documents at a time =⇒ Cluster quality evaluation: human judgement of document pairs Experimental Design to Assess Cluster Quality automated visualization to choose one clustering many pairs of documents for coders: (1) unrelated, (2) loosely related, (3) closely related Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 69. Evaluation 1: Cluster Quality What Are Humans Good For? They can’t: keep many documents & clusters in their head They can: compare two documents at a time =⇒ Cluster quality evaluation: human judgement of document pairs Experimental Design to Assess Cluster Quality automated visualization to choose one clustering many pairs of documents for coders: (1) unrelated, (2) loosely related, (3) closely related Quality = mean(within cluster) - mean(between clusters) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 70. Evaluation 1: Cluster Quality What Are Humans Good For? They can’t: keep many documents & clusters in their head They can: compare two documents at a time =⇒ Cluster quality evaluation: human judgement of document pairs Experimental Design to Assess Cluster Quality automated visualization to choose one clustering many pairs of documents for coders: (1) unrelated, (2) loosely related, (3) closely related Quality = mean(within cluster) - mean(between clusters) Bias results against ourselves by not letting evaluators choose clustering Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 71. Evaluation 1: Cluster Quality What Are Humans Good For? They can’t: keep many documents & clusters in their head They can: compare two documents at a time =⇒ Cluster quality evaluation: human judgement of document pairs Experimental Design to Assess Cluster Quality automated visualization to choose one clustering many pairs of documents for coders: (1) unrelated, (2) loosely related, (3) closely related Quality = mean(within cluster) - mean(between clusters) Bias results against ourselves by not letting evaluators choose clustering Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 72. Evaluation 1: Cluster Quality −0.3 −0.2 −0.1 0.1 0.2 0.3 (Our Method) − (Human Coders) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 73. Evaluation 1: Cluster Quality Lautenberg Press Releases q −0.3 −0.2 −0.1 0.1 0.2 0.3 (Our Method) − (Human Coders) Lautenberg: 200 Senate Press Releases (appropriations, economy, education, tax, veterans, . . . ) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 74. Evaluation 1: Cluster Quality Lautenberg Press Releases q Policy Agendas Project q −0.3 −0.2 −0.1 0.1 0.2 0.3 (Our Method) − (Human Coders) Policy Agendas: 213 quasi-sentences from Bush’s State of the Union (agriculture, banking & commerce, civil rights/liberties, defense, . . . ) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 75. Evaluation 1: Cluster Quality Lautenberg Press Releases q Policy Agendas Project q Reuter's Gold Standard q −0.3 −0.2 −0.1 0.1 0.2 0.3 (Our Method) − (Human Coders) Reuter’s: financial news (trade, earnings, copper, gold, coffee, . . . ); “gold standard” for supervised learning studies Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 76. Evaluation 2: More Informative Discoveries Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 77. Evaluation 2: More Informative Discoveries Found 2 scholars analyzing lots of textual data for their work Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 78. Evaluation 2: More Informative Discoveries Found 2 scholars analyzing lots of textual data for their work Created 6 clusterings: Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 79. Evaluation 2: More Informative Discoveries Found 2 scholars analyzing lots of textual data for their work Created 6 clusterings: 2 clusterings selected with our method (biased against us) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 80. Evaluation 2: More Informative Discoveries Found 2 scholars analyzing lots of textual data for their work Created 6 clusterings: 2 clusterings selected with our method (biased against us) 2 clusterings from each of 2 other methods (varying tuning parameters) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 81. Evaluation 2: More Informative Discoveries Found 2 scholars analyzing lots of textual data for their work Created 6 clusterings: 2 clusterings selected with our method (biased against us) 2 clusterings from each of 2 other methods (varying tuning parameters) Created info packet on each clustering (for each cluster: exemplar document, automated content summary) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 82. Evaluation 2: More Informative Discoveries Found 2 scholars analyzing lots of textual data for their work Created 6 clusterings: 2 clusterings selected with our method (biased against us) 2 clusterings from each of 2 other methods (varying tuning parameters) Created info packet on each clustering (for each cluster: exemplar document, automated content summary) 6 Asked for 2 =15 pairwise comparisons Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 83. Evaluation 2: More Informative Discoveries Found 2 scholars analyzing lots of textual data for their work Created 6 clusterings: 2 clusterings selected with our method (biased against us) 2 clusterings from each of 2 other methods (varying tuning parameters) Created info packet on each clustering (for each cluster: exemplar document, automated content summary) 6 Asked for 2 =15 pairwise comparisons User chooses ⇒ only care about the one clustering that wins Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 84. Evaluation 2: More Informative Discoveries Found 2 scholars analyzing lots of textual data for their work Created 6 clusterings: 2 clusterings selected with our method (biased against us) 2 clusterings from each of 2 other methods (varying tuning parameters) Created info packet on each clustering (for each cluster: exemplar document, automated content summary) 6 Asked for 2 =15 pairwise comparisons User chooses ⇒ only care about the one clustering that wins Both cases a Condorcet winner: Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 85. Evaluation 2: More Informative Discoveries Found 2 scholars analyzing lots of textual data for their work Created 6 clusterings: 2 clusterings selected with our method (biased against us) 2 clusterings from each of 2 other methods (varying tuning parameters) Created info packet on each clustering (for each cluster: exemplar document, automated content summary) 6 Asked for 2 =15 pairwise comparisons User chooses ⇒ only care about the one clustering that wins Both cases a Condorcet winner: “Immigration”: Our Method 1 → vMF 1 → vMF 2 → Our Method 2 → K-Means 1 → K-Means 2 Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 86. Evaluation 2: More Informative Discoveries Found 2 scholars analyzing lots of textual data for their work Created 6 clusterings: 2 clusterings selected with our method (biased against us) 2 clusterings from each of 2 other methods (varying tuning parameters) Created info packet on each clustering (for each cluster: exemplar document, automated content summary) 6 Asked for 2 =15 pairwise comparisons User chooses ⇒ only care about the one clustering that wins Both cases a Condorcet winner: “Immigration”: Our Method 1 → vMF 1 → vMF 2 → Our Method 2 → K-Means 1 → K-Means 2 “Genetic testing”: Our Method 1 → {Our Method 2, K-Means 1, K-means 2} → Dir Proc. 1 → Dir Proc. 2 Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 87. Evaluation 3: What Do Members of Congress Do? Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 88. Evaluation 3: What Do Members of Congress Do? - David Mayhew’s (1974) famous typology Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 89. Evaluation 3: What Do Members of Congress Do? - David Mayhew’s (1974) famous typology - Advertising Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 90. Evaluation 3: What Do Members of Congress Do? - David Mayhew’s (1974) famous typology - Advertising - Credit Claiming Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 91. Evaluation 3: What Do Members of Congress Do? - David Mayhew’s (1974) famous typology - Advertising - Credit Claiming - Position Taking Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 92. Evaluation 3: What Do Members of Congress Do? - David Mayhew’s (1974) famous typology - Advertising - Credit Claiming - Position Taking - Data: 200 press releases from Frank Lautenberg’s office (D-NJ) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 93. Evaluation 3: What Do Members of Congress Do? - David Mayhew’s (1974) famous typology - Advertising - Credit Claiming - Position Taking - Data: 200 press releases from Frank Lautenberg’s office (D-NJ) - Apply our method Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 94. Example Discovery mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust correlationmixvmfVAbinary complete hclust mcquitty hclust pearson single affprop cosine hclust pearson median hclust correlation single hclust pearson mcquitty hclust correlation median mec hclust pearson average hclust correlation complete hclust correlation averagehclust pearson complete hclust binary single hclust binary average kmeans pearson som hclust correlation centroid rock hclust pearson centroid hclust binary median hclust binary mcquitty hclust canberra single biclust_spectral hclust spearman complete spec_man spec_cos kmeans kendall median hclust canberra hclust canberra average spec_mink spec_euc spec_max mspec_minkspec_canb mspec_man affprop maximum kmeans spearman kmeans manhattan mspec_max mspec_cos mspec_canb mspec_euc kmeansbinary centroid hclust canberra hclust kendall single hclust spearman centroid hclusthclust kendall centroid kendall medianaverage average spearmankendall mcquitty hclust median hclust spearman single hclust spearman mcquitty kendall complete hclust canberra centroid hclust hclust manhattan medianmanhattan euclideankmedoids hclusthclustmanhattan centroid hclust manhattan single hclust manhattan average single manhattan affprop hclust euclidean median hclust maximum single manhattan divisive hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust correlation ward hclust euclidean mcquitty kmedoids euclidean hclust pearson wardstand.euc kmedoids hclust maximummedian divisive centroid hclust maximumeuclideanaffprop euclidean hclust canberra mcquitty hclust maximum average hclust maximum complete hclust euclidean complete hclust manhattan maximum mcquitty hclust complete dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max dist_cos dismea hclust manhattan ward affprop info.costs kmeanshclust euclidean ward euclidean hclust canberra complete sot_euc hclust binary ward hclusthclust spearman ward kendall ward hclust maximum ward kmeans binary kmeans maximum Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 95. Example Discovery mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust correlationmixvmfVAbinary complete hclust mcquitty hclust pearson single affprop cosine hclust pearson median hclust correlation single hclust pearson mcquitty hclust correlation median mec hclust pearson average hclust correlation complete hclust correlation averagehclust pearson complete hclust binary single hclust binary average som hclust correlation centroid rock hclust pearson centroid hclust binary median hclust canberra single biclust_spectral affprop cosine hclust spearman complete hclust binary mcquitty kmeans pearson spec_man spec_cos kmeans kendall median hclust canberra hclust canberra average spec_mink spec_euc spec_max mspec_minkspec_canb mspec_man affprop maximum kmeans spearman kmeans manhattan mspec_max mspec_cos mspec_canb mspec_euc kmeansbinary centroid hclust canberra hclust kendall single hclust spearman centroid hclusthclust kendall centroid kendall medianaverage average spearmankendall mcquitty hclust median hclust spearman single hclust spearman mcquitty kendall complete hclust canberra centroid hclust hclust manhattan medianmanhattan euclideankmedoids hclusthclustmanhattan centroid hclust manhattan single hclust manhattan average single manhattan affprop hclust euclidean median hclust maximum single manhattan divisive hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust correlation ward hclust euclidean mcquitty kmedoids euclidean hclust pearson wardstand.euc kmedoids hclust maximummedian divisive centroid hclust maximumeuclideanaffprop euclidean hclust maximum average hclust maximum complete hclust euclidean complete hclust manhattan maximum mcquitty hclust complete dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max hclust manhattan ward dist_cos dismea hclust canberra mcquitty Red point: a clustering by affprop info.costs kmeanshclust euclidean ward sot_euc euclidean hclust canberra complete hclust binary ward Affinity Propagation-Cosine hclust maximum ward hclusthclust spearman ward kendall ward kmeans binary (Dueck and Frey 2007) kmeans maximum Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
  • 96. Example Discovery mult_dirproc mixvmf kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust correlationmixvmfVAbinary complete hclust mcquitty hclust pearson single affprop cosine hclust pearson median hclust correlation single hclust pearson mcquitty hclust correlation median mec hclust pearson average hclust correlation complete hclust correlation averagehclust pearson complete hclust binary single hclust binary average som hclust correlation centroid rock hclust pearson centroid hclust binary median hclust canberra single biclust_spectral affprop cosine hclust spearman complete hclust binary mcquitty kmeans pearson spec_man spec_cos kmeans kendall median hclust canberra hclust canberra average spec_mink spec_euc spec_max mspec_minkspec_canb mspec_man affprop maximum kmeans spearman kmeans manhattan mspec_max mspec_cos mspec_canb mspec_euc kmeansbinary centroid hclust canberra hclust kendall single hclust spearman centroid hclusthclust kendall centroid kendall medianaverage average spearmankendall mcquitty hclust median hclust spearman single hclust spearman mcquitty kendall complete hclust canberra centroid hclust hclust manhattan medianmanhattan euclideankmedoids hclusthclustmanhattan centroid hclust manhattan single hclust manhattan average single manhattan affprop hclust euclidean median hclust maximum single manhattan divisive hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust correlation ward hclust euclidean mcquitty kmedoids euclidean hclust pearson wardstand.euc kmedoids hclust maximummedian divisive centroid hclust maximumeuclideanaffprop euclidean hclust maximum average hclust maximum complete hclust euclidean complete hclust manhattan maximum mcquitty hclust complete dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max hclust manhattan ward dist_cos dismea hclust canberra mcquitty Red point: a clustering by affprop info.costs kmeanshclust euclidean ward sot_euc euclidean hclust canberra complete hclust binary ward Affinity Propagation-Cosine hclust maximum ward hclusthclust spearman ward kendall ward kmeans binary (Dueck and Frey 2007) kmeans maximum Close to: Mixture of von Mises-Fisher distributions (Banerjee et. al. 2005) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20