discov-uga_1.pdf

Computer-Assisted Clustering and Conceptualization

Gary King

Institute for Quantitative Social Science
Harvard University

Parthemos Lecture at University of Georgia, 3/4/2011

1
Based on joint work with Justin Grimmer (Harvard Stanford)
Parthemos Lecture at University of
Gary King (Harvard IQSS) Quantitative Discovery / 20

A Method for Computer Assisted Conceptualization

Conceptualization through Classiﬁcation: “one of the most central
and generic of all our conceptual exercises. . . . the foundation not
only for conceptualization, language, and speech, but also for
mathematics, statistics, and data analysis. . . . Without classiﬁcation,
there could be no advanced conceptualization, reasoning, language,
data analysis or,for that matter, social science research.” (Bailey,
1994).



1994).
Cluster Analysis: simultaneously (1) invents categories and (2)
assigns documents to categories



1994).
We focus on unstructured text; methods apply more broadly.



1994).
We focus on unstructured text; methods apply more broadly.
Main goal: Switch from Fully Automated to Computer Assisted


What’s Hard about Clustering?


(aka Why Johnny Can’t Classify)



Clustering seems easy; its not!



Bell(n) = number of ways of partitioning n objects



Bell(2) = 2 (AB, A B)



Bell(2) = 2 (AB, A B)
Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)



Bell(2) = 2 (AB, A B)
Bell(5) = 52



Bell(2) = 2 (AB, A B)
Bell(5) = 52
Bell(100) ≈



Bell(2) = 2 (AB, A B)
Bell(5) = 52
Bell(100) ≈ 1028 × Number of elementary particles in the universe



Bell(2) = 2 (AB, A B)
Bell(5) = 52
Now imagine choosing the optimal classiﬁcation scheme by hand!



Bell(2) = 2 (AB, A B)
Bell(5) = 52
Now imagine choosing the optimal classiﬁcation scheme by hand!
Fully automated algorithms can help, but which ones?


The Problem with Fully Automated Clustering



The (Impossible) Goal: optimal, fully automated,
application-independent cluster analysis



No free lunch theorem: every possible clustering method performs
equally well on average over all possible substantive applications



Existing methods:



Existing methods:
Many choices: model-based, subspace, spectral, grid-based, graph-
based, fuzzy k-modes, aﬃnity propagation, self-organizing maps,. . .



Existing methods:
Well-deﬁned statistical, data analytic, or machine learning foundations



Existing methods:
How to add substantive knowledge: With few exceptions, unclear



Existing methods:
The literature: little guidance on when methods apply



Existing methods:
Deriving such guidance: diﬃcult or impossible



Existing methods:
Deep problem: full automation requires more information



Existing methods:
Deep problem: full automation requires more information
No surprise: everyone’s tried cluster analysis; very few are satisﬁed


Switch from Fully Automated to Computer Assisted



Fully Automated Clustering may succeed sometimes, but fails in
general: too hard to understand when each model applies



An alternative: Computer-Assisted Clustering



Easy in theory: list all clusterings; choose the best



Impossible in practice: Too hard for us mere humans!



An organized list will make the search possible



Insight: Many clusterings are perceptually identical



E.g.,: consider two clusterings that diﬀer only because one document
(of 10,000) moves from category 5 to 6



E.g.,: consider two clusterings that diﬀer only because one document
(of 10,000) moves from category 5 to 6
Question: How to organize clusterings so humans can understand?


Our Idea: Meaning Through Geography

Set of clusterings



Set of clusterings ≈
A list of unconnected addresses



Set of clusterings ≈
A list of unconnected addresses

We develop a (conceptual) geography of clusterings

A New Strategy
Make it easy to choose best clustering from millions of choices


A New Strategy

1 Code text as numbers (in one or more of several ways)


A New Strategy

2 Apply all clustering methods we can ﬁnd to the data — each
representing diﬀerent (unstated) substantive assumptions (<15 mins)


A New Strategy

3 (Too much for a person to understand, but organization will help)


A New Strategy

4 Develop an application-independent distance metric between
clusterings, a metric space of clusterings, and a 2-D projection


A New Strategy

5 “Local cluster ensemble” creates a new clustering at any point, based
on weighted average of nearby clusterings


A New Strategy

6 A new animated visualization to explore the space of clusterings
(smoothly morphing from one into others)


A New Strategy

6 A new animated visualization to explore the space of clusterings
(smoothly morphing from one into others)
7 Millions of clusterings, easily comprehended


Many Thousands of Clusterings, Sorted & Organized
You choose one (or more), based on insight, discovery, useful information,. . .

Obama Space of mixvmf
Clusterings Clustering 2
Ford Clustering 1
affprop info.costs Carter
Nixon
kmedoids stand.euc Johnson

Carter Eisenhower
rock affprop maximum Ford Roosevelt
kmeans correlation hclust correlation single
hclust pearson single
Eisenhower
Truman Truman

Johnson Roosevelt hclust maximum single
hclust correlation median
hclust binary hclust correlationmedian
centroidpearson centroid
hclust pearson centroid
hclust spec_max Nixon
``Other hclust canberra centroid
``Roosevelt
hclust correlationaverage
average
hclust pearson mcquitty
mcquitty
hclust kendall single hclust maximum ward
Presidents '' hclust euclidean centroid To Carter''
hclust canberra mcquitty binary median
hclust
hclust canberra median
kmeans kendall hclust canberra single mspec_max
hclust binary single biclust_spectral
affprop manhattan
affprop cosine
q
Clinton hclust manhattan centroid
hclust manhattanmedian
hclust maximum single
hclust spearman centroid
hclust maximum centroid
kmedoids manhattan
kendall centroid
mspec_canb hclust euclidean median
hclust canberra average hclust correlation complete
hclust pearson complete
divisive stand.euc
mspec_cos hclust kendall average
hclust manhattan median
hclust spearman median
hclust kendall median kmeans maximum
hclust euclideanaverage
single
hclust maximum mcquitty
hclust maximum complete
kmeans pearson affprop euclidean hclust mcquitty average
hclust manhattan average
euclidean
Kennedy
Kennedy hclust spearman single q divisive euclidean
Bushkmeans binary
hclust binary average kmedoids euclidean
som
hclust spearman average spec_mink
mspec_euc
mspec_mink

hclust binary complete
hclust binary mcquitty divisive manhattan
mspec_man
hclust euclidean mcquitty
hclust euclidean complete hclust kendall complete
hclust correlation ward complete
hclust canberra Bush
clust_convex
hclust euclidean ward
hclust spearman mcquitty
hclust kendall mcquitty dismea Obama
hclust binary ward
hclust canberra ward hclust spearman complete
hclust manhattan complete
spec_canb hclust kendall ward
mixvmfVA
spec_cos spec_euc
hclust manhattan ward kmeans euclidean
kmeans manhattan
spec_man
hclust pearson ward
``Reagan `` Reagan To
Republicans'' hclust spearman ward Obama ''
kmeans spearman

Reagan
kmeans canberra
HWBush
HWBush
Clinton
Reagan
mult_dirproc


Software Screenshot


Evaluating Performance



Goals:



Goals:
Validate Claim: computer-assisted conceptualization outperforms
human conceptualization



Goals:
Demonstrate: new experimental designs for cluster evaluation



Goals:
Inject human judgement: relying on insights from survey research



Goals:
We now present three evaluations



Goals:
Cluster Quality ⇒ RA coders



Goals:
Informative discoveries ⇒ Experienced scholars analyzing texts



Goals:
Informative discoveries ⇒ Experienced scholars analyzing texts
Discovery ⇒ You’re the judge


Evaluation 1: Cluster Quality



What Are Humans Good For?



They can’t: keep many documents & clusters in their head



They can: compare two documents at a time



=⇒ Cluster quality evaluation: human judgement of document pairs



Experimental Design to Assess Cluster Quality



automated visualization to choose one clustering



many pairs of documents



for coders: (1) unrelated, (2) loosely related, (3) closely related



Quality = mean(within cluster) - mean(between clusters)



Quality = mean(within cluster) - mean(between clusters)
Bias results against ourselves by not letting evaluators choose clustering



−0.3 −0.2 −0.1 0.1 0.2 0.3

(Our Method) − (Human Coders)



Lautenberg Press Releases
q

−0.3 −0.2 −0.1 0.1 0.2 0.3


Lautenberg: 200 Senate Press Releases (appropriations, economy,
education, tax, veterans, . . . )



q

Policy Agendas Project
q

−0.3 −0.2 −0.1 0.1 0.2 0.3


Policy Agendas: 213 quasi-sentences from Bush’s State of the Union
(agriculture, banking & commerce, civil rights/liberties, defense, . . . )



q

Policy Agendas Project
q

Reuter's Gold Standard
q

−0.3 −0.2 −0.1 0.1 0.2 0.3


Reuter’s: ﬁnancial news (trade, earnings, copper, gold, coﬀee, . . . ); “gold
standard” for supervised learning studies


Evaluation 2: More Informative Discoveries



Found 2 scholars analyzing lots of textual data for their work



Created 6 clusterings:



2 clusterings selected with our method (biased against us)



2 clusterings from each of 2 other methods (varying tuning parameters)



Created info packet on each clustering (for each cluster: exemplar
document, automated content summary)



6
Asked for 2 =15 pairwise comparisons



6
User chooses ⇒ only care about the one clustering that wins



6
Both cases a Condorcet winner:



6
“Immigration”:

Our Method 1 → vMF 1 → vMF 2 → Our Method 2 → K-Means 1 → K-Means 2



6
“Immigration”:

Our Method 1 → vMF 1 → vMF 2 → Our Method 2 → K-Means 1 → K-Means 2

“Genetic testing”:

Our Method 1 → {Our Method 2, K-Means 1, K-means 2} → Dir Proc. 1 → Dir Proc. 2

Evaluation 3: What Do Members of Congress Do?



- David Mayhew’s (1974) famous typology



- Advertising



- Advertising
- Credit Claiming



- Advertising
- Credit Claiming
- Position Taking



- Advertising
- Credit Claiming
- Position Taking
- Data: 200 press releases from Frank Lautenberg’s oﬃce (D-NJ)



- Advertising
- Credit Claiming
- Position Taking
- Data: 200 press releases from Frank Lautenberg’s oﬃce (D-NJ)
- Apply our method


Example Discovery

mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
mixvmf
hclust correlationmixvmfVAbinary complete
hclust
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single hclust pearson mcquitty
mec hclust pearson average hclust correlation complete
hclust correlation averagehclust pearson complete
hclust binary single
hclust binary average kmeans pearson
som
hclust correlation centroid rock
hclust binary median hclust binary mcquitty
hclust canberra single
biclust_spectral hclust spearman complete
spec_man
spec_cos
kmeans kendall median
hclust canberra
hclust canberra average
spec_mink
spec_euc
spec_max
mspec_minkspec_canb
mspec_man
affprop maximum kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeansbinary centroid
hclust canberra hclust kendall single
hclusthclust kendall centroid
kendall medianaverage
average
spearmankendall mcquitty
hclust median
hclust spearman single
hclust spearman mcquitty kendall complete
hclust canberra centroid hclust
hclust manhattan medianmanhattan
euclideankmedoids
hclusthclustmanhattan centroid
hclust manhattan single
single manhattan
affprop
hclust euclidean median
hclust maximum single manhattan
divisive
hclust euclidean centroid
hclust euclidean average
hclust manhattan mcquitty clust_convex hclust correlation ward
kmedoids euclidean hclust pearson wardstand.euc
kmedoids
hclust maximummedian
divisive centroid
hclust maximumeuclideanaffprop euclidean hclust canberra mcquitty
hclust maximum average
hclust euclidean complete
hclust manhattan maximum mcquitty
hclust complete
dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
hclust manhattan ward affprop info.costs
kmeanshclust euclidean ward
euclidean hclust canberra complete
sot_euc
hclust binary ward

hclusthclust spearman ward
kendall ward
hclust maximum ward kmeans binary

kmeans maximum


Example Discovery

mult_dirproc
kmeans correlation
divisive stand.euc
mixvmf
hclust
mcquitty
hclust binary average
som
hclust binary median
biclust_spectral
affprop cosine
hclust spearman complete
hclust binary mcquitty
kmeans pearson

spec_man
spec_cos
hclust canberra
spec_mink
spec_euc
spec_max
mspec_minkspec_canb
mspec_man
mspec_cos
mspec_canb
mspec_euc
average
hclust median
euclideankmedoids
single manhattan
affprop
divisive
kmedoids
divisive centroid
hclust maximumeuclideanaffprop euclidean
hclust complete
dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
hclust manhattan ward
dist_cos
dismea
hclust canberra mcquitty

Red point: a clustering by
affprop info.costs
sot_euc
hclust binary ward Aﬃnity Propagation-Cosine
hclust maximum ward
kendall ward

kmeans binary (Dueck and Frey 2007)
kmeans maximum


Example Discovery

mult_dirproc

mixvmf kmeans correlation
divisive stand.euc
mixvmf
hclust
mcquitty
hclust binary average
som
hclust binary median
biclust_spectral
affprop cosine
hclust spearman complete
hclust binary mcquitty
kmeans pearson

spec_man
spec_cos
hclust canberra
spec_mink
spec_euc
spec_max
mspec_minkspec_canb
mspec_man
mspec_cos
mspec_canb
mspec_euc
average
hclust median
euclideankmedoids
single manhattan
affprop
divisive
kmedoids
divisive centroid
hclust maximumeuclideanaffprop euclidean
hclust complete
dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
hclust manhattan ward
dist_cos
dismea
hclust canberra mcquitty

Red point: a clustering by
affprop info.costs
sot_euc
hclust binary ward Aﬃnity Propagation-Cosine
hclust maximum ward
kendall ward

kmeans binary (Dueck and Frey 2007)
kmeans maximum

Close to:
Mixture of von Mises-Fisher
distributions (Banerjee et. al.
2005)


discov-uga_1.pdf

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (20)

discov-uga_1.pdf