Legal Analytics Course - Class 9 - Clustering Algorithms (K-Means & Hierarchical Clustering) - Professor Daniel Martin Katz + Professor Michael J Bommarito
Legal Analytics Course - Class 9 - Clustering Algorithms (K-Means & Hierarchical Clustering) - Professor Daniel Martin Katz + Professor Michael J Bommarito
{Law, Tech, Design, Delivery} Observations Regarding Innovation in the Legal ...
Similar a Legal Analytics Course - Class 9 - Clustering Algorithms (K-Means & Hierarchical Clustering) - Professor Daniel Martin Katz + Professor Michael J Bommarito
Similar a Legal Analytics Course - Class 9 - Clustering Algorithms (K-Means & Hierarchical Clustering) - Professor Daniel Martin Katz + Professor Michael J Bommarito (20)
Legal Analytics Course - Class 9 - Clustering Algorithms (K-Means & Hierarchical Clustering) - Professor Daniel Martin Katz + Professor Michael J Bommarito
1. Class 9
K-Means & Hierarchical Clustering
Legal Analytics
Professor Daniel Martin Katz
Professor Michael J Bommarito II
legalanalyticscourse.com
5. Task = Can We Determine to Which
Group the Agent Belongs?
Clustering (Unsupervised Learning)
f( )
Group?
Cluster
access more at legalanalyticscourse.com
12. “Similar” is the Key Idea (but it is a slippery concept)
Clustering is a Method of Grouping Similar Objects
Clustering is typically Unsupervised Learning
access more at legalanalyticscourse.com
13. There are a variety of methods used in this area
(Agglomerative versus Divisive Methods)
“Similar” is the Key Idea (but it is a slippery concept)
Clustering is a Method of Grouping Similar Objects
Clustering is typically Unsupervised Learning
access more at legalanalyticscourse.com
14. There are a variety of methods used in this area
(Agglomerative versus Divisive Methods)
Remember real data is n-dimensional
(which makes implementation / accuracy challenging)
“Similar” is the Key Idea (but it is a slippery concept)
Clustering is a Method of Grouping Similar Objects
Clustering is typically Unsupervised Learning
access more at legalanalyticscourse.com
21. in clustering, we are interested in trying
to formalize the idea of ‘similarity’
access more at legalanalyticscourse.com
22. A typical approach is to project
n-dimensional data into
a unidimensional ‘similarity index’
f( )
dimension 1
dimension 2
dimension 3
.
.
.
.
dimension n
similarity
or
distance function
similarity
index
access more at legalanalyticscourse.com
23. everything in its own cluster
(i.e. everyone is a special snowflake)
everything in one cluster
unidimensional similarity spectrum
access more at legalanalyticscourse.com
24. everything in its own cluster
(i.e. everyone is a special snowflake)
everything in one cluster
unidimensional similarity spectrum
as we slide across this spectrum is where the groupings become interesting
0% similarity threshold
hard question is where to stop as move from left to right
100% similarity threshold
access more at legalanalyticscourse.com
25. The Heavy Lifting is the
develop/apply the optimal
similarity/distance function
for the substantive problem at issue
access more at legalanalyticscourse.com
27. Goal for Any Clustering Method:
Achieve High Within Cluster Similarity
Achieve Low Cross Cluster Similarity
access more at legalanalyticscourse.com
28. We Want to Develop a Notion
of Distance Between Objects
Similarity is inversely related to distance
access more at legalanalyticscourse.com
33. K Means
How do we find the clusters in the data shown below?
We select K clusters in advance
Iteratively seek to min sum of
squared distances
Iteratively seek to min sum of
squared distances
34. K Means Optimization
We start with K clusters with unknown centers
We are attempting to min the sum of squared distances
(i.e. the objective function shown below)
Tricky Part is that this minimization problem
cannot be solved analytically
access more at legalanalyticscourse.com
35. Stuart Lloyd proposed a simple heuristic solution
“Lloyd’s algorithm” aka “k-means” is a good candidate solution
K Means Optimization
from
FlachText
Page 248
37. K-Means
where k = 2
Adapted from Example by Piyush Rai
initialization step
access more at legalanalyticscourse.com
38. K-Means
where k = 2
Adapted from Example by Piyush Rai
First Iteration - Assigning Points
access more at legalanalyticscourse.com
39. K-Means
where k = 2
Adapted from Example by Piyush Rai
First Iteration - Recalculate the Center of the Cluster
access more at legalanalyticscourse.com
40. K-Means
where k = 2
Adapted from Example by Piyush Rai
Second Iteration - Assigning Points
access more at legalanalyticscourse.com
41. K-Means
where k = 2
Adapted from Example by Piyush Rai
Second Iteration - Recalculate the Center of the Cluster
access more at legalanalyticscourse.com
42. K-Means
where k = 2
Adapted from Example by Piyush Rai
Third Iteration - Assigning Points
access more at legalanalyticscourse.com
43. K-Means
where k = 2
Adapted from Example by Piyush Rai
Third Iteration - Recalculate the Center of the Cluster
access more at legalanalyticscourse.com
44. K Means Clustering
Fast Method But Leads to Local Minimum
Should repeat from different starting conditions
(must then figure best heuristic to find global min)
Important Weakness is it often not clear what value of K
access more at legalanalyticscourse.com
48. Hierarchical Clustering
Partitions can be visualized using a tree structure (a dendrogram)
Does not need the number of clusters as input
Possible to view partitions at different levels of granularities
(i.e., can refine/coarsen clusters) using different K
DescriptionVia: Piyush Rai
50. Agglomerative: This is a "bottom up" approach: each
observation starts in its own cluster, and pairs of
clusters are merged as one moves up the hierarchy.
Divisive: This is a "top down" approach: all
observations start in one cluster, and splits are
performed recursively as one moves down the
hierarchy.
Agglomerative versus Divisive Methods
access more at legalanalyticscourse.com
62. Hierarchical Clustering
“(1) Start by assigning each item to a cluster, so that if you have
N items, you now have N clusters, each containing just one item.
Let the distances (similarities) between the clusters the same as
the distances (similarities) between the items they contain.
(2) Find the closest (most similar) pair of clusters and merge
them into a single cluster, so that now you have one cluster less.
(3) Compute distances (similarities) between the new cluster and
each of the old clusters.
(4) Repeat steps 2 and 3 until all items are clustered into a single
cluster of size N. (*)”
S. C. Johnson (1967): "Hierarchical Clustering Schemes" Psychometrika, 2:241-254
63. Hierarchical Clustering
There are a variety of different approaches to Step 3
(3) Compute distances (similarities) between the new
cluster and each of the old clusters.
single-linkage clustering
complete-linkage clustering
average-linkage clustering
centroid linkage clustering
(see pages 253-258 of Flach)
79. Legal Analytics
Class 9 - K-Means & Hierarchical Clustering
daniel martin katz
blog | ComputationalLegalStudies
corp | LexPredict
michael j bommarito
twitter | @computational
blog | ComputationalLegalStudies
corp | LexPredict
twitter | @mjbommar
more content available at legalanalyticscourse.com
site | danielmartinkatz.com site | bommaritollc.com