Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
discov-uga_1.pdf
1. Computer-Assisted Clustering and Conceptualization
Gary King
Institute for Quantitative Social Science
Harvard University
Parthemos Lecture at University of Georgia, 3/4/2011
1
Based on joint work with Justin Grimmer (Harvard Stanford)
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
2. A Method for Computer Assisted Conceptualization
Conceptualization through Classification: “one of the most central
and generic of all our conceptual exercises. . . . the foundation not
only for conceptualization, language, and speech, but also for
mathematics, statistics, and data analysis. . . . Without classification,
there could be no advanced conceptualization, reasoning, language,
data analysis or,for that matter, social science research.” (Bailey,
1994).
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
3. A Method for Computer Assisted Conceptualization
Conceptualization through Classification: “one of the most central
and generic of all our conceptual exercises. . . . the foundation not
only for conceptualization, language, and speech, but also for
mathematics, statistics, and data analysis. . . . Without classification,
there could be no advanced conceptualization, reasoning, language,
data analysis or,for that matter, social science research.” (Bailey,
1994).
Cluster Analysis: simultaneously (1) invents categories and (2)
assigns documents to categories
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
4. A Method for Computer Assisted Conceptualization
Conceptualization through Classification: “one of the most central
and generic of all our conceptual exercises. . . . the foundation not
only for conceptualization, language, and speech, but also for
mathematics, statistics, and data analysis. . . . Without classification,
there could be no advanced conceptualization, reasoning, language,
data analysis or,for that matter, social science research.” (Bailey,
1994).
Cluster Analysis: simultaneously (1) invents categories and (2)
assigns documents to categories
We focus on unstructured text; methods apply more broadly.
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
5. A Method for Computer Assisted Conceptualization
Conceptualization through Classification: “one of the most central
and generic of all our conceptual exercises. . . . the foundation not
only for conceptualization, language, and speech, but also for
mathematics, statistics, and data analysis. . . . Without classification,
there could be no advanced conceptualization, reasoning, language,
data analysis or,for that matter, social science research.” (Bailey,
1994).
Cluster Analysis: simultaneously (1) invents categories and (2)
assigns documents to categories
We focus on unstructured text; methods apply more broadly.
Main goal: Switch from Fully Automated to Computer Assisted
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
6. What’s Hard about Clustering?
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
7. What’s Hard about Clustering?
(aka Why Johnny Can’t Classify)
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
8. What’s Hard about Clustering?
(aka Why Johnny Can’t Classify)
Clustering seems easy; its not!
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
9. What’s Hard about Clustering?
(aka Why Johnny Can’t Classify)
Clustering seems easy; its not!
Bell(n) = number of ways of partitioning n objects
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
10. What’s Hard about Clustering?
(aka Why Johnny Can’t Classify)
Clustering seems easy; its not!
Bell(n) = number of ways of partitioning n objects
Bell(2) = 2 (AB, A B)
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
11. What’s Hard about Clustering?
(aka Why Johnny Can’t Classify)
Clustering seems easy; its not!
Bell(n) = number of ways of partitioning n objects
Bell(2) = 2 (AB, A B)
Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
12. What’s Hard about Clustering?
(aka Why Johnny Can’t Classify)
Clustering seems easy; its not!
Bell(n) = number of ways of partitioning n objects
Bell(2) = 2 (AB, A B)
Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)
Bell(5) = 52
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
13. What’s Hard about Clustering?
(aka Why Johnny Can’t Classify)
Clustering seems easy; its not!
Bell(n) = number of ways of partitioning n objects
Bell(2) = 2 (AB, A B)
Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)
Bell(5) = 52
Bell(100) ≈
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
14. What’s Hard about Clustering?
(aka Why Johnny Can’t Classify)
Clustering seems easy; its not!
Bell(n) = number of ways of partitioning n objects
Bell(2) = 2 (AB, A B)
Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)
Bell(5) = 52
Bell(100) ≈ 1028 × Number of elementary particles in the universe
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
15. What’s Hard about Clustering?
(aka Why Johnny Can’t Classify)
Clustering seems easy; its not!
Bell(n) = number of ways of partitioning n objects
Bell(2) = 2 (AB, A B)
Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)
Bell(5) = 52
Bell(100) ≈ 1028 × Number of elementary particles in the universe
Now imagine choosing the optimal classification scheme by hand!
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
16. What’s Hard about Clustering?
(aka Why Johnny Can’t Classify)
Clustering seems easy; its not!
Bell(n) = number of ways of partitioning n objects
Bell(2) = 2 (AB, A B)
Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)
Bell(5) = 52
Bell(100) ≈ 1028 × Number of elementary particles in the universe
Now imagine choosing the optimal classification scheme by hand!
Fully automated algorithms can help, but which ones?
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
17. The Problem with Fully Automated Clustering
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
18. The Problem with Fully Automated Clustering
The (Impossible) Goal: optimal, fully automated,
application-independent cluster analysis
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
19. The Problem with Fully Automated Clustering
The (Impossible) Goal: optimal, fully automated,
application-independent cluster analysis
No free lunch theorem: every possible clustering method performs
equally well on average over all possible substantive applications
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
20. The Problem with Fully Automated Clustering
The (Impossible) Goal: optimal, fully automated,
application-independent cluster analysis
No free lunch theorem: every possible clustering method performs
equally well on average over all possible substantive applications
Existing methods:
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
21. The Problem with Fully Automated Clustering
The (Impossible) Goal: optimal, fully automated,
application-independent cluster analysis
No free lunch theorem: every possible clustering method performs
equally well on average over all possible substantive applications
Existing methods:
Many choices: model-based, subspace, spectral, grid-based, graph-
based, fuzzy k-modes, affinity propagation, self-organizing maps,. . .
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
22. The Problem with Fully Automated Clustering
The (Impossible) Goal: optimal, fully automated,
application-independent cluster analysis
No free lunch theorem: every possible clustering method performs
equally well on average over all possible substantive applications
Existing methods:
Many choices: model-based, subspace, spectral, grid-based, graph-
based, fuzzy k-modes, affinity propagation, self-organizing maps,. . .
Well-defined statistical, data analytic, or machine learning foundations
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
23. The Problem with Fully Automated Clustering
The (Impossible) Goal: optimal, fully automated,
application-independent cluster analysis
No free lunch theorem: every possible clustering method performs
equally well on average over all possible substantive applications
Existing methods:
Many choices: model-based, subspace, spectral, grid-based, graph-
based, fuzzy k-modes, affinity propagation, self-organizing maps,. . .
Well-defined statistical, data analytic, or machine learning foundations
How to add substantive knowledge: With few exceptions, unclear
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
24. The Problem with Fully Automated Clustering
The (Impossible) Goal: optimal, fully automated,
application-independent cluster analysis
No free lunch theorem: every possible clustering method performs
equally well on average over all possible substantive applications
Existing methods:
Many choices: model-based, subspace, spectral, grid-based, graph-
based, fuzzy k-modes, affinity propagation, self-organizing maps,. . .
Well-defined statistical, data analytic, or machine learning foundations
How to add substantive knowledge: With few exceptions, unclear
The literature: little guidance on when methods apply
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
25. The Problem with Fully Automated Clustering
The (Impossible) Goal: optimal, fully automated,
application-independent cluster analysis
No free lunch theorem: every possible clustering method performs
equally well on average over all possible substantive applications
Existing methods:
Many choices: model-based, subspace, spectral, grid-based, graph-
based, fuzzy k-modes, affinity propagation, self-organizing maps,. . .
Well-defined statistical, data analytic, or machine learning foundations
How to add substantive knowledge: With few exceptions, unclear
The literature: little guidance on when methods apply
Deriving such guidance: difficult or impossible
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
26. The Problem with Fully Automated Clustering
The (Impossible) Goal: optimal, fully automated,
application-independent cluster analysis
No free lunch theorem: every possible clustering method performs
equally well on average over all possible substantive applications
Existing methods:
Many choices: model-based, subspace, spectral, grid-based, graph-
based, fuzzy k-modes, affinity propagation, self-organizing maps,. . .
Well-defined statistical, data analytic, or machine learning foundations
How to add substantive knowledge: With few exceptions, unclear
The literature: little guidance on when methods apply
Deriving such guidance: difficult or impossible
Deep problem: full automation requires more information
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
27. The Problem with Fully Automated Clustering
The (Impossible) Goal: optimal, fully automated,
application-independent cluster analysis
No free lunch theorem: every possible clustering method performs
equally well on average over all possible substantive applications
Existing methods:
Many choices: model-based, subspace, spectral, grid-based, graph-
based, fuzzy k-modes, affinity propagation, self-organizing maps,. . .
Well-defined statistical, data analytic, or machine learning foundations
How to add substantive knowledge: With few exceptions, unclear
The literature: little guidance on when methods apply
Deriving such guidance: difficult or impossible
Deep problem: full automation requires more information
No surprise: everyone’s tried cluster analysis; very few are satisfied
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
28. Switch from Fully Automated to Computer Assisted
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
29. Switch from Fully Automated to Computer Assisted
Fully Automated Clustering may succeed sometimes, but fails in
general: too hard to understand when each model applies
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
30. Switch from Fully Automated to Computer Assisted
Fully Automated Clustering may succeed sometimes, but fails in
general: too hard to understand when each model applies
An alternative: Computer-Assisted Clustering
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
31. Switch from Fully Automated to Computer Assisted
Fully Automated Clustering may succeed sometimes, but fails in
general: too hard to understand when each model applies
An alternative: Computer-Assisted Clustering
Easy in theory: list all clusterings; choose the best
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
32. Switch from Fully Automated to Computer Assisted
Fully Automated Clustering may succeed sometimes, but fails in
general: too hard to understand when each model applies
An alternative: Computer-Assisted Clustering
Easy in theory: list all clusterings; choose the best
Impossible in practice: Too hard for us mere humans!
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
33. Switch from Fully Automated to Computer Assisted
Fully Automated Clustering may succeed sometimes, but fails in
general: too hard to understand when each model applies
An alternative: Computer-Assisted Clustering
Easy in theory: list all clusterings; choose the best
Impossible in practice: Too hard for us mere humans!
An organized list will make the search possible
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
34. Switch from Fully Automated to Computer Assisted
Fully Automated Clustering may succeed sometimes, but fails in
general: too hard to understand when each model applies
An alternative: Computer-Assisted Clustering
Easy in theory: list all clusterings; choose the best
Impossible in practice: Too hard for us mere humans!
An organized list will make the search possible
Insight: Many clusterings are perceptually identical
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
35. Switch from Fully Automated to Computer Assisted
Fully Automated Clustering may succeed sometimes, but fails in
general: too hard to understand when each model applies
An alternative: Computer-Assisted Clustering
Easy in theory: list all clusterings; choose the best
Impossible in practice: Too hard for us mere humans!
An organized list will make the search possible
Insight: Many clusterings are perceptually identical
E.g.,: consider two clusterings that differ only because one document
(of 10,000) moves from category 5 to 6
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
36. Switch from Fully Automated to Computer Assisted
Fully Automated Clustering may succeed sometimes, but fails in
general: too hard to understand when each model applies
An alternative: Computer-Assisted Clustering
Easy in theory: list all clusterings; choose the best
Impossible in practice: Too hard for us mere humans!
An organized list will make the search possible
Insight: Many clusterings are perceptually identical
E.g.,: consider two clusterings that differ only because one document
(of 10,000) moves from category 5 to 6
Question: How to organize clusterings so humans can understand?
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
37. Our Idea: Meaning Through Geography
Set of clusterings
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
38. Our Idea: Meaning Through Geography
Set of clusterings ≈
A list of unconnected addresses
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
39. Our Idea: Meaning Through Geography
Set of clusterings ≈
A list of unconnected addresses
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
40. Our Idea: Meaning Through Geography
Set of clusterings ≈
A list of unconnected addresses
We develop a (conceptual) geography of clusterings
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
41. A New Strategy
Make it easy to choose best clustering from millions of choices
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
42. A New Strategy
Make it easy to choose best clustering from millions of choices
1 Code text as numbers (in one or more of several ways)
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
43. A New Strategy
Make it easy to choose best clustering from millions of choices
1 Code text as numbers (in one or more of several ways)
2 Apply all clustering methods we can find to the data — each
representing different (unstated) substantive assumptions (<15 mins)
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
44. A New Strategy
Make it easy to choose best clustering from millions of choices
1 Code text as numbers (in one or more of several ways)
2 Apply all clustering methods we can find to the data — each
representing different (unstated) substantive assumptions (<15 mins)
3 (Too much for a person to understand, but organization will help)
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
45. A New Strategy
Make it easy to choose best clustering from millions of choices
1 Code text as numbers (in one or more of several ways)
2 Apply all clustering methods we can find to the data — each
representing different (unstated) substantive assumptions (<15 mins)
3 (Too much for a person to understand, but organization will help)
4 Develop an application-independent distance metric between
clusterings, a metric space of clusterings, and a 2-D projection
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
46. A New Strategy
Make it easy to choose best clustering from millions of choices
1 Code text as numbers (in one or more of several ways)
2 Apply all clustering methods we can find to the data — each
representing different (unstated) substantive assumptions (<15 mins)
3 (Too much for a person to understand, but organization will help)
4 Develop an application-independent distance metric between
clusterings, a metric space of clusterings, and a 2-D projection
5 “Local cluster ensemble” creates a new clustering at any point, based
on weighted average of nearby clusterings
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
47. A New Strategy
Make it easy to choose best clustering from millions of choices
1 Code text as numbers (in one or more of several ways)
2 Apply all clustering methods we can find to the data — each
representing different (unstated) substantive assumptions (<15 mins)
3 (Too much for a person to understand, but organization will help)
4 Develop an application-independent distance metric between
clusterings, a metric space of clusterings, and a 2-D projection
5 “Local cluster ensemble” creates a new clustering at any point, based
on weighted average of nearby clusterings
6 A new animated visualization to explore the space of clusterings
(smoothly morphing from one into others)
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
48. A New Strategy
Make it easy to choose best clustering from millions of choices
1 Code text as numbers (in one or more of several ways)
2 Apply all clustering methods we can find to the data — each
representing different (unstated) substantive assumptions (<15 mins)
3 (Too much for a person to understand, but organization will help)
4 Develop an application-independent distance metric between
clusterings, a metric space of clusterings, and a 2-D projection
5 “Local cluster ensemble” creates a new clustering at any point, based
on weighted average of nearby clusterings
6 A new animated visualization to explore the space of clusterings
(smoothly morphing from one into others)
7 Millions of clusterings, easily comprehended
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
49. Many Thousands of Clusterings, Sorted & Organized
You choose one (or more), based on insight, discovery, useful information,. . .
Obama Space of mixvmf
Clusterings Clustering 2
Ford Clustering 1
affprop info.costs Carter
Nixon
kmedoids stand.euc Johnson
Carter Eisenhower
rock affprop maximum Ford Roosevelt
kmeans correlation hclust correlation single
hclust pearson single
Eisenhower
Truman Truman
Johnson Roosevelt hclust maximum single
hclust correlation median
hclust binary hclust correlationmedian
centroidpearson centroid
hclust pearson centroid
hclust spec_max Nixon
``Other hclust canberra centroid
``Roosevelt
hclust correlationaverage
average
hclust pearson mcquitty
mcquitty
hclust kendall single hclust maximum ward
Presidents '' hclust euclidean centroid To Carter''
hclust canberra mcquitty binary median
hclust
hclust canberra median
kmeans kendall hclust canberra single mspec_max
hclust binary single biclust_spectral
affprop manhattan
affprop cosine
q
Clinton hclust manhattan centroid
hclust manhattanmedian
hclust maximum single
hclust spearman centroid
hclust maximum centroid
kmedoids manhattan
kendall centroid
mspec_canb hclust euclidean median
hclust canberra average hclust correlation complete
hclust pearson complete
divisive stand.euc
mspec_cos hclust kendall average
hclust manhattan median
hclust spearman median
hclust kendall median kmeans maximum
hclust euclideanaverage
single
hclust maximum mcquitty
hclust maximum complete
kmeans pearson affprop euclidean hclust mcquitty average
hclust manhattan average
euclidean
Kennedy
Kennedy hclust spearman single q divisive euclidean
Bushkmeans binary
hclust binary average kmedoids euclidean
som
hclust spearman average spec_mink
mspec_euc
mspec_mink
hclust binary complete
hclust binary mcquitty divisive manhattan
mspec_man
hclust euclidean mcquitty
hclust euclidean complete hclust kendall complete
hclust correlation ward complete
hclust canberra Bush
clust_convex
hclust euclidean ward
hclust spearman mcquitty
hclust kendall mcquitty dismea Obama
hclust binary ward
hclust canberra ward hclust spearman complete
hclust manhattan complete
spec_canb hclust kendall ward
mixvmfVA
spec_cos spec_euc
hclust manhattan ward kmeans euclidean
kmeans manhattan
spec_man
hclust pearson ward
``Reagan `` Reagan To
Republicans'' hclust spearman ward Obama ''
kmeans spearman
Reagan
kmeans canberra
HWBush
HWBush
Clinton
Reagan
mult_dirproc
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
50. Software Screenshot
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
51. Evaluating Performance
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
52. Evaluating Performance
Goals:
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
53. Evaluating Performance
Goals:
Validate Claim: computer-assisted conceptualization outperforms
human conceptualization
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
54. Evaluating Performance
Goals:
Validate Claim: computer-assisted conceptualization outperforms
human conceptualization
Demonstrate: new experimental designs for cluster evaluation
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
55. Evaluating Performance
Goals:
Validate Claim: computer-assisted conceptualization outperforms
human conceptualization
Demonstrate: new experimental designs for cluster evaluation
Inject human judgement: relying on insights from survey research
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
56. Evaluating Performance
Goals:
Validate Claim: computer-assisted conceptualization outperforms
human conceptualization
Demonstrate: new experimental designs for cluster evaluation
Inject human judgement: relying on insights from survey research
We now present three evaluations
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
57. Evaluating Performance
Goals:
Validate Claim: computer-assisted conceptualization outperforms
human conceptualization
Demonstrate: new experimental designs for cluster evaluation
Inject human judgement: relying on insights from survey research
We now present three evaluations
Cluster Quality ⇒ RA coders
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
58. Evaluating Performance
Goals:
Validate Claim: computer-assisted conceptualization outperforms
human conceptualization
Demonstrate: new experimental designs for cluster evaluation
Inject human judgement: relying on insights from survey research
We now present three evaluations
Cluster Quality ⇒ RA coders
Informative discoveries ⇒ Experienced scholars analyzing texts
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
59. Evaluating Performance
Goals:
Validate Claim: computer-assisted conceptualization outperforms
human conceptualization
Demonstrate: new experimental designs for cluster evaluation
Inject human judgement: relying on insights from survey research
We now present three evaluations
Cluster Quality ⇒ RA coders
Informative discoveries ⇒ Experienced scholars analyzing texts
Discovery ⇒ You’re the judge
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
60. Evaluation 1: Cluster Quality
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
61. Evaluation 1: Cluster Quality
What Are Humans Good For?
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
62. Evaluation 1: Cluster Quality
What Are Humans Good For?
They can’t: keep many documents & clusters in their head
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
63. Evaluation 1: Cluster Quality
What Are Humans Good For?
They can’t: keep many documents & clusters in their head
They can: compare two documents at a time
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
64. Evaluation 1: Cluster Quality
What Are Humans Good For?
They can’t: keep many documents & clusters in their head
They can: compare two documents at a time
=⇒ Cluster quality evaluation: human judgement of document pairs
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
65. Evaluation 1: Cluster Quality
What Are Humans Good For?
They can’t: keep many documents & clusters in their head
They can: compare two documents at a time
=⇒ Cluster quality evaluation: human judgement of document pairs
Experimental Design to Assess Cluster Quality
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
66. Evaluation 1: Cluster Quality
What Are Humans Good For?
They can’t: keep many documents & clusters in their head
They can: compare two documents at a time
=⇒ Cluster quality evaluation: human judgement of document pairs
Experimental Design to Assess Cluster Quality
automated visualization to choose one clustering
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
67. Evaluation 1: Cluster Quality
What Are Humans Good For?
They can’t: keep many documents & clusters in their head
They can: compare two documents at a time
=⇒ Cluster quality evaluation: human judgement of document pairs
Experimental Design to Assess Cluster Quality
automated visualization to choose one clustering
many pairs of documents
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
68. Evaluation 1: Cluster Quality
What Are Humans Good For?
They can’t: keep many documents & clusters in their head
They can: compare two documents at a time
=⇒ Cluster quality evaluation: human judgement of document pairs
Experimental Design to Assess Cluster Quality
automated visualization to choose one clustering
many pairs of documents
for coders: (1) unrelated, (2) loosely related, (3) closely related
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
69. Evaluation 1: Cluster Quality
What Are Humans Good For?
They can’t: keep many documents & clusters in their head
They can: compare two documents at a time
=⇒ Cluster quality evaluation: human judgement of document pairs
Experimental Design to Assess Cluster Quality
automated visualization to choose one clustering
many pairs of documents
for coders: (1) unrelated, (2) loosely related, (3) closely related
Quality = mean(within cluster) - mean(between clusters)
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
70. Evaluation 1: Cluster Quality
What Are Humans Good For?
They can’t: keep many documents & clusters in their head
They can: compare two documents at a time
=⇒ Cluster quality evaluation: human judgement of document pairs
Experimental Design to Assess Cluster Quality
automated visualization to choose one clustering
many pairs of documents
for coders: (1) unrelated, (2) loosely related, (3) closely related
Quality = mean(within cluster) - mean(between clusters)
Bias results against ourselves by not letting evaluators choose clustering
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
71. Evaluation 1: Cluster Quality
What Are Humans Good For?
They can’t: keep many documents & clusters in their head
They can: compare two documents at a time
=⇒ Cluster quality evaluation: human judgement of document pairs
Experimental Design to Assess Cluster Quality
automated visualization to choose one clustering
many pairs of documents
for coders: (1) unrelated, (2) loosely related, (3) closely related
Quality = mean(within cluster) - mean(between clusters)
Bias results against ourselves by not letting evaluators choose clustering
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
72. Evaluation 1: Cluster Quality
−0.3 −0.2 −0.1 0.1 0.2 0.3
(Our Method) − (Human Coders)
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
73. Evaluation 1: Cluster Quality
Lautenberg Press Releases
q
−0.3 −0.2 −0.1 0.1 0.2 0.3
(Our Method) − (Human Coders)
Lautenberg: 200 Senate Press Releases (appropriations, economy,
education, tax, veterans, . . . )
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
74. Evaluation 1: Cluster Quality
Lautenberg Press Releases
q
Policy Agendas Project
q
−0.3 −0.2 −0.1 0.1 0.2 0.3
(Our Method) − (Human Coders)
Policy Agendas: 213 quasi-sentences from Bush’s State of the Union
(agriculture, banking & commerce, civil rights/liberties, defense, . . . )
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
75. Evaluation 1: Cluster Quality
Lautenberg Press Releases
q
Policy Agendas Project
q
Reuter's Gold Standard
q
−0.3 −0.2 −0.1 0.1 0.2 0.3
(Our Method) − (Human Coders)
Reuter’s: financial news (trade, earnings, copper, gold, coffee, . . . ); “gold
standard” for supervised learning studies
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
76. Evaluation 2: More Informative Discoveries
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
77. Evaluation 2: More Informative Discoveries
Found 2 scholars analyzing lots of textual data for their work
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
78. Evaluation 2: More Informative Discoveries
Found 2 scholars analyzing lots of textual data for their work
Created 6 clusterings:
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
79. Evaluation 2: More Informative Discoveries
Found 2 scholars analyzing lots of textual data for their work
Created 6 clusterings:
2 clusterings selected with our method (biased against us)
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
80. Evaluation 2: More Informative Discoveries
Found 2 scholars analyzing lots of textual data for their work
Created 6 clusterings:
2 clusterings selected with our method (biased against us)
2 clusterings from each of 2 other methods (varying tuning parameters)
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
81. Evaluation 2: More Informative Discoveries
Found 2 scholars analyzing lots of textual data for their work
Created 6 clusterings:
2 clusterings selected with our method (biased against us)
2 clusterings from each of 2 other methods (varying tuning parameters)
Created info packet on each clustering (for each cluster: exemplar
document, automated content summary)
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
82. Evaluation 2: More Informative Discoveries
Found 2 scholars analyzing lots of textual data for their work
Created 6 clusterings:
2 clusterings selected with our method (biased against us)
2 clusterings from each of 2 other methods (varying tuning parameters)
Created info packet on each clustering (for each cluster: exemplar
document, automated content summary)
6
Asked for 2 =15 pairwise comparisons
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
83. Evaluation 2: More Informative Discoveries
Found 2 scholars analyzing lots of textual data for their work
Created 6 clusterings:
2 clusterings selected with our method (biased against us)
2 clusterings from each of 2 other methods (varying tuning parameters)
Created info packet on each clustering (for each cluster: exemplar
document, automated content summary)
6
Asked for 2 =15 pairwise comparisons
User chooses ⇒ only care about the one clustering that wins
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
84. Evaluation 2: More Informative Discoveries
Found 2 scholars analyzing lots of textual data for their work
Created 6 clusterings:
2 clusterings selected with our method (biased against us)
2 clusterings from each of 2 other methods (varying tuning parameters)
Created info packet on each clustering (for each cluster: exemplar
document, automated content summary)
6
Asked for 2 =15 pairwise comparisons
User chooses ⇒ only care about the one clustering that wins
Both cases a Condorcet winner:
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
85. Evaluation 2: More Informative Discoveries
Found 2 scholars analyzing lots of textual data for their work
Created 6 clusterings:
2 clusterings selected with our method (biased against us)
2 clusterings from each of 2 other methods (varying tuning parameters)
Created info packet on each clustering (for each cluster: exemplar
document, automated content summary)
6
Asked for 2 =15 pairwise comparisons
User chooses ⇒ only care about the one clustering that wins
Both cases a Condorcet winner:
“Immigration”:
Our Method 1 → vMF 1 → vMF 2 → Our Method 2 → K-Means 1 → K-Means 2
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
86. Evaluation 2: More Informative Discoveries
Found 2 scholars analyzing lots of textual data for their work
Created 6 clusterings:
2 clusterings selected with our method (biased against us)
2 clusterings from each of 2 other methods (varying tuning parameters)
Created info packet on each clustering (for each cluster: exemplar
document, automated content summary)
6
Asked for 2 =15 pairwise comparisons
User chooses ⇒ only care about the one clustering that wins
Both cases a Condorcet winner:
“Immigration”:
Our Method 1 → vMF 1 → vMF 2 → Our Method 2 → K-Means 1 → K-Means 2
“Genetic testing”:
Our Method 1 → {Our Method 2, K-Means 1, K-means 2} → Dir Proc. 1 → Dir Proc. 2
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
87. Evaluation 3: What Do Members of Congress Do?
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
88. Evaluation 3: What Do Members of Congress Do?
- David Mayhew’s (1974) famous typology
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
89. Evaluation 3: What Do Members of Congress Do?
- David Mayhew’s (1974) famous typology
- Advertising
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
90. Evaluation 3: What Do Members of Congress Do?
- David Mayhew’s (1974) famous typology
- Advertising
- Credit Claiming
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
91. Evaluation 3: What Do Members of Congress Do?
- David Mayhew’s (1974) famous typology
- Advertising
- Credit Claiming
- Position Taking
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
92. Evaluation 3: What Do Members of Congress Do?
- David Mayhew’s (1974) famous typology
- Advertising
- Credit Claiming
- Position Taking
- Data: 200 press releases from Frank Lautenberg’s office (D-NJ)
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
93. Evaluation 3: What Do Members of Congress Do?
- David Mayhew’s (1974) famous typology
- Advertising
- Credit Claiming
- Position Taking
- Data: 200 press releases from Frank Lautenberg’s office (D-NJ)
- Apply our method
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
94. Example Discovery
mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
mixvmf
hclust correlationmixvmfVAbinary complete
hclust
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single hclust pearson mcquitty
hclust correlation median
mec hclust pearson average hclust correlation complete
hclust correlation averagehclust pearson complete
hclust binary single
hclust binary average kmeans pearson
som
hclust correlation centroid rock
hclust pearson centroid
hclust binary median hclust binary mcquitty
hclust canberra single
biclust_spectral hclust spearman complete
spec_man
spec_cos
kmeans kendall median
hclust canberra
hclust canberra average
spec_mink
spec_euc
spec_max
mspec_minkspec_canb
mspec_man
affprop maximum kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeansbinary centroid
hclust canberra hclust kendall single
hclust spearman centroid
hclusthclust kendall centroid
kendall medianaverage
average
spearmankendall mcquitty
hclust median
hclust spearman single
hclust spearman mcquitty kendall complete
hclust canberra centroid hclust
hclust manhattan medianmanhattan
euclideankmedoids
hclusthclustmanhattan centroid
hclust manhattan single
hclust manhattan average
single manhattan
affprop
hclust euclidean median
hclust maximum single manhattan
divisive
hclust euclidean centroid
hclust euclidean average
hclust manhattan mcquitty clust_convex hclust correlation ward
hclust euclidean mcquitty
kmedoids euclidean hclust pearson wardstand.euc
kmedoids
hclust maximummedian
divisive centroid
hclust maximumeuclideanaffprop euclidean hclust canberra mcquitty
hclust maximum average
hclust maximum complete
hclust euclidean complete
hclust manhattan maximum mcquitty
hclust complete
dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
hclust manhattan ward affprop info.costs
kmeanshclust euclidean ward
euclidean hclust canberra complete
sot_euc
hclust binary ward
hclusthclust spearman ward
kendall ward
hclust maximum ward kmeans binary
kmeans maximum
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
95. Example Discovery
mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
mixvmf
hclust correlationmixvmfVAbinary complete
hclust
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single hclust pearson mcquitty
hclust correlation median
mec hclust pearson average hclust correlation complete
hclust correlation averagehclust pearson complete
hclust binary single
hclust binary average
som
hclust correlation centroid rock
hclust pearson centroid
hclust binary median
hclust canberra single
biclust_spectral
affprop cosine
hclust spearman complete
hclust binary mcquitty
kmeans pearson
spec_man
spec_cos
kmeans kendall median
hclust canberra
hclust canberra average
spec_mink
spec_euc
spec_max
mspec_minkspec_canb
mspec_man
affprop maximum kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeansbinary centroid
hclust canberra hclust kendall single
hclust spearman centroid
hclusthclust kendall centroid
kendall medianaverage
average
spearmankendall mcquitty
hclust median
hclust spearman single
hclust spearman mcquitty kendall complete
hclust canberra centroid hclust
hclust manhattan medianmanhattan
euclideankmedoids
hclusthclustmanhattan centroid
hclust manhattan single
hclust manhattan average
single manhattan
affprop
hclust euclidean median
hclust maximum single manhattan
divisive
hclust euclidean centroid
hclust euclidean average
hclust manhattan mcquitty clust_convex hclust correlation ward
hclust euclidean mcquitty
kmedoids euclidean hclust pearson wardstand.euc
kmedoids
hclust maximummedian
divisive centroid
hclust maximumeuclideanaffprop euclidean
hclust maximum average
hclust maximum complete
hclust euclidean complete
hclust manhattan maximum mcquitty
hclust complete
dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
hclust manhattan ward
dist_cos
dismea
hclust canberra mcquitty
Red point: a clustering by
affprop info.costs
kmeanshclust euclidean ward
sot_euc
euclidean hclust canberra complete
hclust binary ward Affinity Propagation-Cosine
hclust maximum ward
hclusthclust spearman ward
kendall ward
kmeans binary (Dueck and Frey 2007)
kmeans maximum
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20
96. Example Discovery
mult_dirproc
mixvmf kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
mixvmf
hclust correlationmixvmfVAbinary complete
hclust
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single hclust pearson mcquitty
hclust correlation median
mec hclust pearson average hclust correlation complete
hclust correlation averagehclust pearson complete
hclust binary single
hclust binary average
som
hclust correlation centroid rock
hclust pearson centroid
hclust binary median
hclust canberra single
biclust_spectral
affprop cosine
hclust spearman complete
hclust binary mcquitty
kmeans pearson
spec_man
spec_cos
kmeans kendall median
hclust canberra
hclust canberra average
spec_mink
spec_euc
spec_max
mspec_minkspec_canb
mspec_man
affprop maximum kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeansbinary centroid
hclust canberra hclust kendall single
hclust spearman centroid
hclusthclust kendall centroid
kendall medianaverage
average
spearmankendall mcquitty
hclust median
hclust spearman single
hclust spearman mcquitty kendall complete
hclust canberra centroid hclust
hclust manhattan medianmanhattan
euclideankmedoids
hclusthclustmanhattan centroid
hclust manhattan single
hclust manhattan average
single manhattan
affprop
hclust euclidean median
hclust maximum single manhattan
divisive
hclust euclidean centroid
hclust euclidean average
hclust manhattan mcquitty clust_convex hclust correlation ward
hclust euclidean mcquitty
kmedoids euclidean hclust pearson wardstand.euc
kmedoids
hclust maximummedian
divisive centroid
hclust maximumeuclideanaffprop euclidean
hclust maximum average
hclust maximum complete
hclust euclidean complete
hclust manhattan maximum mcquitty
hclust complete
dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
hclust manhattan ward
dist_cos
dismea
hclust canberra mcquitty
Red point: a clustering by
affprop info.costs
kmeanshclust euclidean ward
sot_euc
euclidean hclust canberra complete
hclust binary ward Affinity Propagation-Cosine
hclust maximum ward
hclusthclust spearman ward
kendall ward
kmeans binary (Dueck and Frey 2007)
kmeans maximum
Close to:
Mixture of von Mises-Fisher
distributions (Banerjee et. al.
2005)
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20