SlideShare una empresa de Scribd logo
1 de 88
Text Clustering
(Part-2)
1
What is clustering?
Text Clustering 2
Inter-cluster
distances are
maximized
Intra-cluster
distances are
minimized
 Finding groups of objects such that the objects in a group will be similar (or
related) to one another and different from (or unrelated to) the objects in other
groups.
 Partition unlabeled examples into disjoint subsets of clusters, such that:
• Examples within a cluster are very similar
• Examples in different clusters are very different
 Discover new categories in an unsupervised manner (no sample category labels
provided).
• Grouping of text documents into meaningful clusters in an unsupervised
manner.
• Cluster Hypothesis : Relevant documents tend to be more similar to each
other than to non-relevant ones.
Text Clustering 3
Document Clustering
Government
Science
Arts
Document Clustering
Motivation
• Automatically group related documents based on their contents
• No predetermined training sets or taxonomies
• Generate a taxonomy at runtime
Clustering Process
• Data preprocessing:
– remove stop words, stem, feature extraction, lexical analysis, etc.
• Hierarchical clustering:
– compute similarities applying clustering algorithms.
• Model-Based clustering (Neural Network Approach):
– clusters are represented by “exemplars”. (e.g.: SOM)
Text Clustering 4
Document Clustering
Clustering of documents based on the similarity of their content improve the
search effectiveness in terms of:
1. Improving Search Recall
When a query matches a document its whole cluster can be returned.
2. Improving Search Precision
Grouping the documents into a much smaller number of groups of related
documents
3. Scatter/Gather
Enhance the efficiency of human browsing of a document collection when
a specific search query cannot be formulated.
4. Query-Specific Clustering
The most related documents will appear in the small tight clusters, nested
inside bigger clusters containing less similar documents.
Text Clustering 5
Text Clustering 6
Text Clustering
Overview
Standard (Text) Clustering Methods
 Bisecting k-means
 Agglomerative Hierarchical Clustering
Specialised Text Clustering Methods
 Suffix Tree Clustering
 Frequent-Termset-Based Clustering
Joint Cluster Analysis
 Attribute data: text content
 Relationship data: hyperlinks
Text Clustering 7
Text Clustering
Bisecting k-means [Steinbach, Karypis & Kumar 2000]
K-means
Bisecting k-means
 Partition the database into 2 clusters
 Repeat: partition the largest cluster into 2 clusters . . .
 Until k clusters have been discovered
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Text Clustering 8
Text Clustering
Bisecting k-means
Two types of clusterings
• Hierarchical clustering
• Flat clustering: any cut of this hierarchy
Distance Function
• Cosine measure (similarity measure)
))(())((
))(()),((
),(



cgcg
cgcg
s


Text Clustering 9
Text Clustering
Agglomerative and Hierarchical Clustering
1. Form initial clusters consisting of a singleton object, and compute
the distance between each pair of clusters.
2. Merge the two clusters having minimum distance.
3. Calculate the distance between the new cluster and all other clusters.
4. If there is only one cluster containing all objects:
Stop, otherwise go to step 2.
Representation of a Cluster C
Vt
i
Cd i
tdnCrep
 





  ),()(
Text Clustering 10
Text Clustering
Experimental Comparison
[Steinbach, Karypis & Kumar 2000]
Clustering Quality
 Measured as entropy on a prelabeled test data set
 Using several text and web data sets
 Bisecting k-means outperforms k-means.
 Bisecting k-means outperforms agglomerative hierarchical
clustering.
Efficiency
 Bisecting k-means is much more efficient than agglomerative
hierarchical clustering.
 O(n) vs. O(n2)
Text Clustering 11
Text Clustering
Suffix Tree Clustering
[Zamir & Etzioni 1998]
Forming Clusters
Not by similar feature vectors
But by common terms
Strengths of Suffix Tree Clustering (STC)
Efficiency: runtime O(n) for n text documents
Overlapping clusters
Method
1. Identification of “ basic clusters“
2. Combination of basic clusters
Text Clustering 12
Text Clustering
Identification of Basic Clusters
• Basic Cluster: set of documents sharing one specific phrase
• Phrase: multi-word term
• Efficient identification of basic clusters using a suffix-tree
Insertion of (1) “ cat ate cheese“
cat ate cheese
cheese
ate cheese
1, 1 1, 2 1, 3
Insertion of (2) “ mouse ate cheese too“
cat ate cheese
cheese
ate cheese
too too
mouse ate cheese too
too
1, 1 1, 2
2, 2
1, 3
2, 3
2, 1
2, 4
Text Clustering 13
Text Clustering
Combination of Basic Clusters
• Basic clusters are highly overlapping
• Merge basic clusters having too much overlap
• Basic clusters graph: nodes represent basic clusters
Edge between A and B iff |A  B| / |A| > 0,5 and |A  B| / |B| > 0,5
• Composite cluster:
a component of the basic clusters graph
• Drawback of this approach:
Distant members of the same component need not be similar
No evaluation on standard test data
Text Clustering 14
Text Clustering
Example from the Grouper System
Text Clustering 15
Text Clustering
Frequent-Term-Based Clustering
[Beil, Ester & Xu 2002]
• Frequent term set:
description of a cluster
• Set of documents containing
all terms of the frequent term
set: cluster
• Clustering: subset of set of all
frequent term sets covering
the DB with a low mutual
overlap
{} {D1, . . ., D16}
{sun} {fun} {beach}
{D1, D2, D4, D5, {D1, D3, D4, D6, {D2, D7, D8,
D6, D8, D9, D10, D7, D8, D10, D11, D9, D10, D12,
D11, D13, D15} D14, D15, D16} D13,D14, D15}
{sun, fun} {sun, beach} {fun, surf} {beach, surf}
{D1, D4, D6, D8, {D2, D8, D9, . . . . . .
D10, D11, D15} D10, D11, D15}
{sun, fun, surf} {sun, beach, fun}
{D1, D6, D10, D11} {D8, D10, D11, D15}
Text Clustering 16
Text Clustering
Method
• Task: Efficient calculation of the overlap of a given cluster
(description) Fi with the union of the other cluster
(description)s
• F: set of all frequent term sets in D
• f j : the number of all frequent term sets supported by document Dj
• Standard overlap of a cluster Ci:
|}|{| jiij DFFFf 
||
)1(
)(
i
CiDj
j
i
C
f
CSO



Text Clustering 17
Text Clustering
Algorithm FTC
FTC(database D, float minsup)
SelectedTermSets:= {};
n:= |D|;
RemainingTermSets:= DetermineFrequentTermsets(D, minsup);
while |cov(SelectedTermSets)| ≠ n do
for each set in RemainingTermSets do
Calculate overlap for set;
BestCandidate:=element of Remaining TermSets with minimum overlap;
SelectedTermSets:= SelectedTermSets  {BestCandidate};
RemainingTermSets:= RemainingTermSets - {BestCandidate};
Remove all documents in cov(BestCandidate) from D and from the
coverage of all of the RemainingTermSets;
return SelectedTermSets;
Text Clustering 18
Text Clustering
Joint Cluster Analysis [Ester, Ge, Gao et al 2006]
Attribute data: intrinsic properties of entities
Relationship data: extrinsic properties of entities
Existing clustering algorithms use either attribute or relationship data
Often: attribute and relationship data somewhat related, but contain
complementary information
 Joint cluster analysis of attribute and relationship data
Edges = relationships
2D location = attributes
Informative
Graph
Text Clustering 19
Text Clustering
The Connected k-Center Problem
Given:
• Dataset D as informative graph
• k - # of clusters
Connectivity constraint:
Clusters need to be connected graph components
Objective function:
maximum (attribute) distance (radius) between nodes and cluster center
Task:
Find k centers (nodes) and a corresponding partitioning of D into k clusters
such that
• each clusters satisfies the connectivity constraint and
• the objective function is minimized
Text Clustering 20
Text Clustering
The Hardness of the Connected k-Center Problem
• Given k centers and a radius threshold r, finding the optimal assignment for the
remaining nodes is NP-hard
• Because of articulation nodes (e.g., a)
Assignment of a implies
the assignment of b
Assignment of a to C1
minimizes the radiusa
b
C1 C2
Text Clustering 21
Text Clustering
Example: CS Publications
1073: Kempe, Dobra, Gehrke: “Gossip-based computation of aggregate information”, FOCS'03
1483: Fagin, Kleinberg, Raghavan: “Query strategies for priced information”, STOC'02
True cluster labels based on conference
Text Clustering 22
Text Clustering
Application for Web Pages
• Attribute data
text content
• Relationship data
hyperlinks
• Connectivity constraint
clusters must be connected
• New challenges
clusters can be overlapping
not all web pages must belong to a cluster (noise)
. . .
The General Clustering Problem
A clustering task may include the following components
(Jain, Murty, and Flynn 1999):
• Problem representation, including feature extraction, selection, or both,
• Definition of proximity measure suitable to the domain,
• Actual clustering of objects,
• Data abstraction, and
• Evaluation.
Text Clustering 23
The Vector-Space Model
• Assume t distinct terms remain after preprocessing; call them index terms or
the vocabulary.
• These “orthogonal” terms form a vector space.
Dimension = t = |vocabulary|
• Each term, i, in a document or query, j, is given a real-valued weight, wij.
• Both documents and queries are expressed as t-dimensional vectors:
dj = (w1j, w2j, …, wtj)
• New document is assigned to the most likely category based on vector
similarity.
Text Clustering 24
Graphic Representation
Text Clustering 25
T3
T1
T2
D1 = 2T1+ 3T2 + 5T3
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3
7
32
5
Example:
D1 = 2T1 + 3T2 + 5T3
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3
• Is D1 or D2 more similar to Q?
• How to measure the degree of
similarity? Distance? Angle?
Projection?
Document Collection
• A collection of n documents can be represented in the vector space model by
a term-document matrix.
• An entry in the matrix corresponds to the “weight” of a term in the document;
zero means the term has no significance in the document or it simply doesn’t
exist in the document.
Text Clustering 26
T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n …
T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn
Term Weights: Term Frequency
• More frequent terms in a document are more important, i.e. more
indicative of the topic.
fij = frequency of term i in document j
• May want to normalize term frequency (tf) by dividing by the frequency
of the most common term in the document:
tfij = fij / maxi{fij}
Text Clustering 27
Term Weights: Inverse Document Frequency
• Terms that appear in many different documents are less indicative of
overall topic
df i = document frequency of term i
= number of documents containing term i
idfi = inverse document frequency of term i,
= log2 (N/ df i)
(N: total number of documents)
• An indication of a term’s discrimination power.
• Log used to dampen the effect relative to tf.
Text Clustering 28
TF-IDF Weighting
• A typical combined term importance indicator is tf-idf weighting:
wij = tfij idfi = tfij log2 (N/ dfi)
• A term occurring frequently in the document but rarely in the rest of the
collection is given high weight.
• Many other ways of determining term weights have been proposed.
• Experimentally, tf-idf has been found to work well.
Text Clustering 29
Computing TF-IDF -- An Example
• Given a document containing terms with given frequencies:
A(3), B(2), C(1)
• Assume collection contains 10,000 documents and
document frequencies of these terms are:
A(50), B(1300), C(250)
Then:
A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6
B: tf = 2/3; idf = log2 (10000/1300) = 2.9; tf-idf = 2.0
C: tf = 1/3; idf = log2 (10000/250) = 5.3; tf-idf = 1.8
Text Clustering 30
Query Vector
• Query vector is typically treated as a document and also tf-idf
weighted.
• Alternative is for the user to supply weights for the given query
terms.
Text Clustering 31
Similarity Measure
• A similarity measure is a function that computes the degree of similarity
between two vectors.
• Using a similarity measure between the query and each document:
– It is possible to rank the retrieved documents in the order of presumed
relevance.
– It is possible to enforce a certain threshold so that the size of the
retrieved set can be controlled.
Text Clustering 32
Similarity Measure - Inner Product
• Similarity between vectors for the document di and query q can be computed
as the vector inner product (a.k.a. dot product):
sim(dj,q) = dj•q =
Where,
wij is the weight of term i in document j and
wiq is the weight of term i in the query
• For binary vectors, the inner product is the number of matched query terms in
the document (size of intersection).
• For weighted term vectors, it is the sum of the products of the weights of the
matched terms.
Text Clustering 33
iq
t
i
ij ww1
Properties of Inner Product
Text Clustering 34
• The inner product is unbounded.
• Favors long documents with a large number of unique terms.
• Measures how many terms matched but not how many terms are not
matched.
Text Clustering 3535
Inner Product -- Examples
Binary:
• D = 1, 1, 1, 0, 1, 1, 0
• Q = 1, 0 , 1, 0, 0, 1, 1
sim(D, Q) = 3
Size of vector = size of vocabulary = 7
0 means corresponding term not found in
document or query
Weighted:
D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3
Q = 0T1 + 0T2 + 2T3
sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10
sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2
Cosine Similarity Measure
Text Clustering 36
• Cosine similarity measures the cosine of the angle
between two vectors.
• Inner product normalized by the vector lengths.
D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81
D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13
Q = 0T1 + 0T2 + 2T3
2
t3
t1
t2
D1
D2
Q
1
D1 is 6 times better than D2 using cosine similarity but only 5 times better using
inner product.
 

 






t
i
t
i
t
i
ww
ww
qd
qd
iqij
iqij
j
j
1 1
22
1
)(


CosSim(dj, q) =
Naïve Implementation
• Convert all documents in collection D to tf-idf weighted vectors, dj, for
keyword vocabulary V.
• Convert query to a tf-idf-weighted vector q.
• For each dj in D do
Compute score sj = cosSim(dj, q)
• Sort documents by decreasing score.
• Present top ranked documents to the user.
Time complexity: O(|V|·|D|) Bad for large V & D !
|V| = 10,000; |D| = 100,000; |V|·|D| = 1,000,000,000
Text Clustering 37
Comments on Vector Space Models
• Simple, mathematically based approach.
• Considers both local (tf) and global (idf) word occurrence frequencies.
• Provides partial matching and ranked results.
• Tends to work quite well in practice despite obvious weaknesses.
• Allows efficient implementation for large document collections.
Text Clustering 38
Problems with Vector Space Model
• Missing semantic information (e.g. word sense).
• Missing syntactic information (e.g. phrase structure, word order, proximity
information).
• Assumption of term independence (e.g. ignores synonomy).
• Lacks the control of a Boolean model (e.g., requiring a term to appear in a
document).
– Given a two-term query “A B”, may prefer a document containing A
frequently but not B, over a document that contains both A and B, but
both less frequently.
Text Clustering 39
Clustering Algorithms
Problem variant
• A flat (or partitional) clustering produces a single partition of a set of
objects into disjoint groups.
• A hierarchical clustering results in a nested series of partitions.
Distance-based Clustering Algorithms
• Agglomerative and Hierarchical Clustering Algorithms
• Distance-based Partitioning Algorithms (k-Means)
• A Hybrid Approach: The Scatter-Gather Method (Buckshot,
Fractionization)
Text Clustering 40
Hard/Soft Clustering
Hard Clustering
• Every object may belong to exactly one cluster.
Soft Clustering
• The membership is fuzzy - Objects may belong to several clusters with a
fractional degree of membership in each.
Text Clustering 41
Clustering Algorithms
The most commonly used algorithms are
 The K-means (hard, flat, shuffling),
 The EM-based mixture resolving (soft, flat, probabilistic), and
 The HAC (hierarchical, agglomerative).
Text Clustering 42
Clustering Algorithms
Document hierarchical clustering
• Bottom-up, agglomerative
• Top-down, divisive
Document partitioning (flat clustering)
• K-means
• Probabilistic clustering using the Naïve Bayes or Gaussian mixture
model, etc.
Document clustering based on graph model
Text Clustering 43
K-Means Algorithm
Partitions a collection of vectors {x1, x2, . . . xn} into a set of clusters {C1, C2, . . . Ck}.
The algorithm needs k cluster seeds for initialization. The algorithm proceeds as
follows:
Initialization:
• k seeds, either given or selected randomly, form the core of k clusters.
• Every other vector is assigned to the cluster of the closest seed.
Iteration:
• The centroids Mi of the current clusters are computed:
• Each vector is reassigned to the cluster with the closest centroid.
Stopping condition:
• At convergence – when no more changes occur.
• The K-means algorithm maximizes the clustering quality function Q:
Text Clustering 44



icx
ii xC 1
||
)(),...,,(
1
21 

C Cx
iK
i
xSimCCCQ
K-Means
Works when we know k, the number of clusters
Idea:
• Randomly pick k points as the “centroids” of the k clusters
• Loop:
–  points, add to cluster with nearest centroid
– Recompute the cluster centroids
– Repeat loop (until no change)
Text Clustering 45
Iterative improvement of the objective function:
Sum of the squared distance from each point to the centroid of its cluster.
Finding the global optimum is NP-hard.
The k-means algorithm is guaranteed to converge a local optimum.
K-means Example
For simplicity, 1-dimension objects and k=2.
• Numerical difference is used as the distance
Objects: 1, 2, 5, 6,7
K-means:
Randomly select 5 and 6 as centroids;
=> Two clusters {1,2,5} and {6,7}; meanC1=8/3, meanC2=6.5
=> {1,2}, {5,6,7}; meanC1=1.5, meanC2=6
=> no change.
Aggregate dissimilarity
– (sum of squares of distance each point of each cluster from its cluster
center--(intra-cluster distance)
= 0.52+ 0.52+ 12+ 02+12 = 2.5
Text Clustering 46
|1-1.5|2
Text Clustering 4747
K Means Example (K=2)
Pick seeds
Reassign clusters
Compute centroids
x
x
Reasssign clusters
x
x xx Compute centroids
Reassign clusters
Converged!
Time Complexity
• Assume computing distance between two instances is O(m) where m is the
dimensionality of the vectors.
• Reassigning clusters: O(kn) distance computations, or O(knm).
• Computing centroids: Each instance vector gets added once to some
centroid: O(nm).
• Assume these two steps are each done once for I iterations: O(Iknm).
• Linear in all relevant factors, assuming a fixed number of iterations, more
efficient than O(n2) HAC.
Text Clustering 48
Problems with K-means
Need to know k in advance
• Could try out several k?
– Cluster tightness increases with increasing K.
– Look for a kink in the tightness vs. K curve
Tends to go to local minima that are sensitive to the starting centroids
• Try out multiple starting points
Disjoint and exhaustive
• Doesn’t have a notion of “outliers”
– Outlier problem can be handled by K-medoid or neighborhood-based
algorithms
Assumes clusters are spherical in vector space
• Sensitive to coordinate changes, weighting etc.
Text Clustering 49
EM-based Probabilistic Clustering Algorithm
Mixture-resolving Algorithms - Underlying Assumption:
• The objects to be clustered are drawn from k distributions, and
• The goal is to identify the parameters of each that would allow the
calculation of the probability P(Ci | x) of the given object’s belonging to
the cluster Ci
The Expectation Maximization (EM)
• A general purpose framework for estimating the parameters of distribution
in the presence of hidden variables in observable data.
• Probabilistic method for soft clustering.
• Direct method that assumes k clusters:{c1, c2,… ck} and Soft version of k-
means.
• For text, typically assume a naïve-Bayes category model.
– Parameters  = {P(ci), P(wj | ci): i{1,…k}, j {1,…,|V|}}
Text Clustering 50
EM Algorithm
Initialization:
The initial parameters of k distributions are selected either randomly or
externally.
Iteration:
E-Step: Compute the P(Ci |x) for all objects x by using the current
parameters of the distributions. Re-label all objects according to the
computed probabilities.
M-Step: Re-estimate the parameters of the distributions to maximize the
likelihood of the objects’ assuming their current labeling.
Stopping condition:
At convergence - when the change in log-likelihood after each iteration
becomes small.
Text Clustering 51
EM
Text Clustering 5252
Unlabeled Examples
+ 
+ 
+ 
+ 
+
Assign random probabilistic labels to unlabeled data
Initialize:
EM
Text Clustering 5353
Prob.
Learner
+ 
+ 
+ 
+ 
+
Give soft-labeled training data to a probabilistic learner
Initialize:
EM
Text Clustering 5454
Prob.
Learner
Prob.
Classifier
+ 
+ 
+ 
+ 
+
Produce a probabilistic classifier
Initialize:
EM
Text Clustering 5555
Prob.
Learner
Prob.
Classifier
Relabel unlabled data using the trained classifier
+ 
+ 
+ 
+ 
+
E Step:
EM
Text Clustering 5656
Prob.
Learner
+ 
+ 
+ 
+ 
+
Prob.
Classifier
Continue EM iterations until probabilistic labels on
unlabeled data converge.
Retrain classifier on relabeled data
M step:
Hierarchical Agglomerative Clustering (HAC)
• Assumes a similarity function for determining the similarity of two
instances.
• Starts with all instances in a separate cluster and then repeatedly joins the
two clusters that are most similar until there is only one cluster.
• The history of merging forms a binary tree or hierarchy.
HAC Algorithm
Initialization:
Every object is put into a separate cluster.
Iteration:
Find the pair of most similar clusters and merge them.
Stopping condition:
When everything is merged into a single cluster.
Text Clustering 57
Cluster Similarity
Assume a similarity function that determines the similarity of two instances:
sim(x,y).
• Cosine similarity of document vectors.
How to compute similarity of two clusters each possibly containing multiple
instances?
• Single Link: Similarity of two most similar members.
• Complete Link: Similarity of two least similar members.
• Group Average: Average similarity between members.
Text Clustering 58
Clustering functions - Merging criteria
Text Clustering 59
Extending the distance measure from samples to sets of samples
Similarity of 2 most similar
members
Similarity of 2 least similar
members
Average similarity between
members
Single Link Agglomerative Clustering
Use maximum similarity of pairs:
Can result in “straggly” (long and thin) clusters due to chaining effect.
• Appropriate in some domains, such as clustering islands.
Text Clustering 60
),(max),(
,
yxsimccsim
ji cycx
ji


Text Clustering 6161
Single Link Example
Complete Link Agglomerative Clustering
Use minimum similarity of pairs:
Makes more “tight,” spherical clusters that are typically preferable.
Text Clustering 62
),(min),(
,
yxsimccsim
ji cycx
ji


Complete Link Example
Text Clustering 63
Computational Complexity
• In the first iteration, all HAC methods need to compute similarity of all
pairs of n individual instances which is O(n2).
• In each of the subsequent n2 merging iterations, it must compute the
distance between the most recently created cluster and all other existing
clusters.
• In order to maintain an overall O(n2) performance, computing similarity to
each other cluster must be done in constant time.
Text Clustering 64
Computing Cluster Similarity
After merging ci and cj, the similarity of the resulting cluster to any
other cluster, ck, can be computed by:
• Single Link:
• Complete Link:
Text Clustering 65
)),(),,(max()),(( kjkikji ccsimccsimcccsim 
)),(),,(min()),(( kjkikji ccsimccsimcccsim 
Group Average Agglomerative Clustering
• Use average similarity across all pairs within the merged cluster to
measure the similarity of two clusters.
• Compromise between single and complete link.
• Averaged across all ordered pairs in the merged cluster instead of
unordered pairs between the two clusters (to encourage tighter final
clusters).
Text Clustering 66
  

)( :)(
),(
)1(
1
),(
ji jiccx xyccyjiji
ji yxsim
cccc
ccsim
 

Other Clustering Algorithms
Minimal Spanning Tree (MST)
• Single-link clusters are sub-graphs of the MST, which are also the
connected components
• Complete-link clusters are the maximal complete sub-graphs
The Nearest Neighbor Clustering
• Assigns each object to the cluster of its nearest labeled neighbor object
provided the similarity to that neighbor is sufficiently high.
• The process continues until all objects are labeled.
The Buckshot Algorithm
• Combines HAC and K-Means clustering.
• First randomly take a sample of instances of size n
• Run group-average HAC on this sample, which takes only O(n) time.
• Use the results of HAC as initial seeds for K-means.
• Overall algorithm is O(n) and avoids problems of bad seed selection.
Text Clustering 67
Word and Phrase-based Clustering
• Text documents from high-dimensional domain, important clusters of words
may be found and utilized for finding clusters of documents.
Example:
In a corpus containing d terms and n documents, view a term-
document matrix as an n × d matrix, in which the (i, j)th entry is the
frequency of the jth term in the ith document.
• The problem of clustering rows in this matrix is clustering documents,
clustering columns in this matrix is clustering words.
• The most general technique for simultaneous word and document clustering
is referred to as co-clustering.
• The problem of word clustering is related to dimensionality reduction,
whereas document clustering is related to traditional clustering
• Probabilistic framework which determines word clusters and document
clusters simultaneously is referred to as topic modeling
Text Clustering 68
Word and Phrase-based Clustering
Approaches
• Clustering with Frequent Word Patterns
• Leveraging Word Clusters for Document Clusters
• Co-clustering Words and Documents
• Clustering with Frequent Phrases
Text Clustering 69
Clustering with Frequent Word Patterns
• A frequent item set in the context of text data is referred to as a frequent
term set.
• Frequent term sets is defined on the basis of the overlaps between the
supporting documents of the different frequent term sets.
Idea:
Cluster the low dimensional frequent term sets as cluster candidates.
Flat Clustering:
The documents covered by the selected frequent term are removed from the
database, and the overlap in the next iteration is computed with respect to
the remaining documents.
Hierarchical Clustering:
Each level of the clustering is applied to a set of term sets containing a fixed
number k of terms.
Text Clustering 70
Leveraging Word Clusters for Document Clusters
A two phase clustering procedure, to perform document clustering:
First Phase
• Determine word-clusters from the documents in such a way that most of
mutual information between words and documents is preserved when
representing the documents in terms of word clusters rather than words.
Second Phase
• Use the condensed representation of the documents in terms of word-
clusters in order to perform the final document clustering.
• Specifically, replace the word occurrences in documents with word-cluster
occurrences in order to perform the document clustering.
Advantage of two-phase procedure
• Significant reduction in the noise in the representation.
Text Clustering 71
Co-clustering Words and Documents
• Co-clustering is defined as a pair of maps from rows to row-cluster indices
and columns to column-cluster indices.
• These maps are determined simultaneously by the algorithm in order to
optimize the corresponding cluster representations
• Matrix Factorization Approach
 Discovers word clusters and document clusters simultaneously.
 Transform knowledge from word space to document space in the
context of document clustering techniques.
• Co-clustering is a form of local feature selection, in which the features
selected are specific to each cluster
• Methods for document co-clustering
 Co-clustering with graph partitioning
 Information-Theoretic Co-clustering
Text Clustering 72
Clustering with Frequent Phrases
Phrase cluster
• A phrase cluster is a phrase that is shared by at least two documents, and
the group of documents that contain the phrase.
• Each document is treated as a string of words, rather than characters.
• It uses an indexing method in order to organize the phrases in the
document collection, and then uses this organization to create the clusters.
Steps to create the clusters
Step 1:
• perform the cleaning of the strings representing the documents.
• A light stemming algorithm is used by deleting word prefixes and suffixes
and reducing plural to singular.
• Sentence boundaries are marked and non-word tokens are stripped.
Text Clustering 73
Clustering with Frequent Phrases
Step 2:
• Identification of base clusters.
• These are defined by the frequent phases in the collection which are
represented in the form of a suffix tree.
Step 3:
• An important characteristic of the base clusters created by the suffix tree is
that they do not define a strict partitioning and have overlaps with one
another
• The algorithm merges the clusters based on the similarity of their
underlying document sets.
Let P and Q be the document sets corresponding to two clusters. The base
similarity BS(P,Q) is defined as follows:
Text Clustering 74








 5.0
|}||,max{|
||
),(
QP
QP
QPBS
Benefits of Word Clustering
Useful Semantic word clustering
• Automatically generates a “Thesaurus”
Higher classification accuracy
Smaller classification models
• size reductions as dramatic as 50000  50
Text Clustering 75
Semi-supervised clustering: problem definition
Input:
• A set of unlabeled objects, each described by a set of attributes (numeric
and/or categorical)
• A small amount of domain knowledge
Output:
• A partitioning of the objects into k clusters (possibly with some discarded
as outliers)
Objective:
• Maximum intra-cluster similarity
• Minimum inter-cluster similarity
• High consistency between the partitioning and the domain knowledge
Text Clustering 76
Why semi-supervised clustering?
Why not clustering?
• The clusters produced may not be the ones required.
• Sometimes there are multiple possible groupings.
Why not classification?
• Sometimes there are insufficient labeled data.
Potential applications
• Bioinformatics (gene and protein clustering)
• Document hierarchy construction
• News/email categorization
• Image categorization
Text Clustering 77
Semi-Supervised Clustering
Domain knowledge
• Partial label information is given
• Apply some constraints (must-links and cannot-links)
Approaches
• Search-based Semi-Supervised Clustering
– Alter the clustering algorithm using the constraints
• Similarity-based Semi-Supervised Clustering
– Alter the similarity measure based on the constraints
• Combination of both
Text Clustering 78
Search-Based Semi-Supervised Clustering
Alter the clustering algorithm that searches for a good partitioning by:
• Modifying the objective function to give a reward for obeying labels on
the supervised data [Demeriz:ANNIE99].
• Enforcing constraints (must-link, cannot-link) on the labeled data during
clustering [Wagstaff:ICML00, Wagstaff:ICML01].
• Use the labeled data to initialize clusters in an iterative refinement
algorithm (k-Means,) [Basu:ICML02].
Text Clustering 79
Similarity-based semi-supervised clustering
• Alter the similarity measure based on the constraints
• Two types of constraints: Must-links and Cannot-links
• Clustering algorithm: Hierarchical clustering
Text Clustering 80
Transfer Learning
Text Clustering 81
It is motivated by human learning. People can often transfer knowledge
learnt previously to novel situations
 Chess  Checkers
 Mathematics  Computer Science
 Table Tennis  Tennis
TL - Solve the fundamental problem of different distributions between the
training and testing data.
Transfer Learning (TL):
The ability of a system to recognize and apply knowledge and skills
learned in previous tasks to novel tasks (in new domains)
Traditional ML vs. TL
Text Clustering 82
Traditional ML in
multiple domains
Transfer of learning
across domains
trainingitems
testitems
trainingitems
testitems
Humans can learn in many domains. Humans can also transfer from one
domain to other domains.
Traditional ML vs. TL
Text Clustering 83
Learning Process of
Traditional ML
Learning Process of
Transfer Learning
training items
Learning System Learning System
training items
Learning System
Learning SystemKnowledge
Why Transfer Learning?
• In some domains, labeled data are in short supply.
• In some domains, the calibration effort is very expensive.
• In some domains, the learning process is time consuming
Text Clustering 84
Transfer learning techniques may help!
 How to extract knowledge learnt from related domains to help learning
in a target domain with a few labeled data?
 How to extract knowledge learnt from related domains to speed up
learning in a target domain?
Settings of Transfer Learning
Text Clustering 85
Transfer learning settings
Labeled data in
a source domain
Labeled data in
a target
domain
Tasks
Inductive Transfer Learning
× √ Classification
Regression
…√ √
Transductive Transfer Learning √ ×
Classification
Regression
…
Unsupervised Transfer Learning × ×
Clustering
…
Approaches to Transfer Learning
Text Clustering 86
Transfer learning approaches Description
Instance-transfer
To re-weight some labeled data in a source
domain for use in the target domain
Feature-representation-transfer
Find a “good” feature representation that reduces
difference between a source and a target domain
or minimizes error of models
Model-transfer
Discover shared parameters or priors of models
between a source domain and a target domain
Relational-knowledge-transfer
Build mapping of relational knowledge between a
source domain and a target domain.
Approaches to Transfer Learning
Text Clustering 87
Inductive
Transfer Learning
Transductive
Transfer Learning
Unsupervised
Transfer Learning
Instance-transfer √ √
Feature-representation-
transfer
√ √ √
Model-transfer √
Relational-knowledge-
transfer
√
Unsupervised Transfer Learning
Feature-representation-transfer Approaches
Self-taught Clustering (STC)
[Dai et al. ICML-08]
Text Clustering 88
Input: A lot of unlabeled data in a source domain and a few unlabeled data in
a target domain.
Goal: Clustering the target domain data.
Assumption: The source domain and target domain data share some common
features, which can help clustering in the target domain.
Main Idea: To extend the information theoretic co-clustering algorithm
[Dhillon et al. KDD-03] for transfer learning.

Más contenido relacionado

La actualidad más candente

An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learning
butest
 

La actualidad más candente (20)

Dbscan algorithom
Dbscan algorithomDbscan algorithom
Dbscan algorithom
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysis
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text Classification
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Data, Text and Web Mining
Data, Text and Web Mining Data, Text and Web Mining
Data, Text and Web Mining
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
Types of Machine Learning
Types of Machine LearningTypes of Machine Learning
Types of Machine Learning
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
 
Machine learning clustering
Machine learning clusteringMachine learning clustering
Machine learning clustering
 
Text mining
Text miningText mining
Text mining
 
NLP
NLPNLP
NLP
 
NLP_KASHK:Minimum Edit Distance
NLP_KASHK:Minimum Edit DistanceNLP_KASHK:Minimum Edit Distance
NLP_KASHK:Minimum Edit Distance
 
Fuzzy Clustering(C-means, K-means)
Fuzzy Clustering(C-means, K-means)Fuzzy Clustering(C-means, K-means)
Fuzzy Clustering(C-means, K-means)
 
Document clustering and classification
Document clustering and classification Document clustering and classification
Document clustering and classification
 
Artificial Neural Networks for Data Mining
Artificial Neural Networks for Data MiningArtificial Neural Networks for Data Mining
Artificial Neural Networks for Data Mining
 
Lect12 graph mining
Lect12 graph miningLect12 graph mining
Lect12 graph mining
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learning
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithm
 
Text classification presentation
Text classification presentationText classification presentation
Text classification presentation
 

Destacado

Text data mining1
Text data mining1Text data mining1
Text data mining1
KU Leuven
 
Text Data Mining
Text Data MiningText Data Mining
Text Data Mining
KU Leuven
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrieval
KU Leuven
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
KU Leuven
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)
KU Leuven
 
Tdm recent trends
Tdm recent trendsTdm recent trends
Tdm recent trends
KU Leuven
 

Destacado (6)

Text data mining1
Text data mining1Text data mining1
Text data mining1
 
Text Data Mining
Text Data MiningText Data Mining
Text Data Mining
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrieval
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)
 
Tdm recent trends
Tdm recent trendsTdm recent trends
Tdm recent trends
 

Similar a Text clustering

Query expansion_Team42_IRE2k14
Query expansion_Team42_IRE2k14Query expansion_Team42_IRE2k14
Query expansion_Team42_IRE2k14
sudhir11292rt
 
The science behind predictive analytics a text mining perspective
The science behind predictive analytics  a text mining perspectiveThe science behind predictive analytics  a text mining perspective
The science behind predictive analytics a text mining perspective
ankurpandeyinfo
 
Text categorization as a graph
Text categorization as a graphText categorization as a graph
Text categorization as a graph
James Wong
 
Text categorization as a graph
Text categorization as a graphText categorization as a graph
Text categorization as a graph
Young Alista
 
Text categorization as a graph
Text categorization as a graphText categorization as a graph
Text categorization as a graph
Fraboni Ec
 

Similar a Text clustering (20)

machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overview
 
Cluster
ClusterCluster
Cluster
 
algoritma klastering.pdf
algoritma klastering.pdfalgoritma klastering.pdf
algoritma klastering.pdf
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
 
Introduction to Multi-Objective Clustering Ensemble
Introduction to Multi-Objective Clustering EnsembleIntroduction to Multi-Objective Clustering Ensemble
Introduction to Multi-Objective Clustering Ensemble
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
 
4.4 text mining
4.4 text mining4.4 text mining
4.4 text mining
 
Query expansion_Team42_IRE2k14
Query expansion_Team42_IRE2k14Query expansion_Team42_IRE2k14
Query expansion_Team42_IRE2k14
 
Enhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataEnhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online Data
 
The science behind predictive analytics a text mining perspective
The science behind predictive analytics  a text mining perspectiveThe science behind predictive analytics  a text mining perspective
The science behind predictive analytics a text mining perspective
 
Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic   analysis an approach for improving compute...Document clustering for forensic   analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...
 
TEXT CLUSTERING.doc
TEXT CLUSTERING.docTEXT CLUSTERING.doc
TEXT CLUSTERING.doc
 
Chapter 5.pdf
Chapter 5.pdfChapter 5.pdf
Chapter 5.pdf
 
Diversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesDiversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News Stories
 
Data clustring
Data clustring Data clustring
Data clustring
 
Text categorization as a graph
Text categorization as a graphText categorization as a graph
Text categorization as a graph
 
Text categorization as a graph
Text categorization as a graphText categorization as a graph
Text categorization as a graph
 
Graph classification problem.pptx
Graph classification problem.pptxGraph classification problem.pptx
Graph classification problem.pptx
 
Text categorization as a graph
Text categorization as a graphText categorization as a graph
Text categorization as a graph
 

Text clustering

  • 2. What is clustering? Text Clustering 2 Inter-cluster distances are maximized Intra-cluster distances are minimized  Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups.  Partition unlabeled examples into disjoint subsets of clusters, such that: • Examples within a cluster are very similar • Examples in different clusters are very different  Discover new categories in an unsupervised manner (no sample category labels provided).
  • 3. • Grouping of text documents into meaningful clusters in an unsupervised manner. • Cluster Hypothesis : Relevant documents tend to be more similar to each other than to non-relevant ones. Text Clustering 3 Document Clustering Government Science Arts
  • 4. Document Clustering Motivation • Automatically group related documents based on their contents • No predetermined training sets or taxonomies • Generate a taxonomy at runtime Clustering Process • Data preprocessing: – remove stop words, stem, feature extraction, lexical analysis, etc. • Hierarchical clustering: – compute similarities applying clustering algorithms. • Model-Based clustering (Neural Network Approach): – clusters are represented by “exemplars”. (e.g.: SOM) Text Clustering 4
  • 5. Document Clustering Clustering of documents based on the similarity of their content improve the search effectiveness in terms of: 1. Improving Search Recall When a query matches a document its whole cluster can be returned. 2. Improving Search Precision Grouping the documents into a much smaller number of groups of related documents 3. Scatter/Gather Enhance the efficiency of human browsing of a document collection when a specific search query cannot be formulated. 4. Query-Specific Clustering The most related documents will appear in the small tight clusters, nested inside bigger clusters containing less similar documents. Text Clustering 5
  • 6. Text Clustering 6 Text Clustering Overview Standard (Text) Clustering Methods  Bisecting k-means  Agglomerative Hierarchical Clustering Specialised Text Clustering Methods  Suffix Tree Clustering  Frequent-Termset-Based Clustering Joint Cluster Analysis  Attribute data: text content  Relationship data: hyperlinks
  • 7. Text Clustering 7 Text Clustering Bisecting k-means [Steinbach, Karypis & Kumar 2000] K-means Bisecting k-means  Partition the database into 2 clusters  Repeat: partition the largest cluster into 2 clusters . . .  Until k clusters have been discovered 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
  • 8. Text Clustering 8 Text Clustering Bisecting k-means Two types of clusterings • Hierarchical clustering • Flat clustering: any cut of this hierarchy Distance Function • Cosine measure (similarity measure) ))(())(( ))(()),(( ),(    cgcg cgcg s  
  • 9. Text Clustering 9 Text Clustering Agglomerative and Hierarchical Clustering 1. Form initial clusters consisting of a singleton object, and compute the distance between each pair of clusters. 2. Merge the two clusters having minimum distance. 3. Calculate the distance between the new cluster and all other clusters. 4. If there is only one cluster containing all objects: Stop, otherwise go to step 2. Representation of a Cluster C Vt i Cd i tdnCrep          ),()(
  • 10. Text Clustering 10 Text Clustering Experimental Comparison [Steinbach, Karypis & Kumar 2000] Clustering Quality  Measured as entropy on a prelabeled test data set  Using several text and web data sets  Bisecting k-means outperforms k-means.  Bisecting k-means outperforms agglomerative hierarchical clustering. Efficiency  Bisecting k-means is much more efficient than agglomerative hierarchical clustering.  O(n) vs. O(n2)
  • 11. Text Clustering 11 Text Clustering Suffix Tree Clustering [Zamir & Etzioni 1998] Forming Clusters Not by similar feature vectors But by common terms Strengths of Suffix Tree Clustering (STC) Efficiency: runtime O(n) for n text documents Overlapping clusters Method 1. Identification of “ basic clusters“ 2. Combination of basic clusters
  • 12. Text Clustering 12 Text Clustering Identification of Basic Clusters • Basic Cluster: set of documents sharing one specific phrase • Phrase: multi-word term • Efficient identification of basic clusters using a suffix-tree Insertion of (1) “ cat ate cheese“ cat ate cheese cheese ate cheese 1, 1 1, 2 1, 3 Insertion of (2) “ mouse ate cheese too“ cat ate cheese cheese ate cheese too too mouse ate cheese too too 1, 1 1, 2 2, 2 1, 3 2, 3 2, 1 2, 4
  • 13. Text Clustering 13 Text Clustering Combination of Basic Clusters • Basic clusters are highly overlapping • Merge basic clusters having too much overlap • Basic clusters graph: nodes represent basic clusters Edge between A and B iff |A  B| / |A| > 0,5 and |A  B| / |B| > 0,5 • Composite cluster: a component of the basic clusters graph • Drawback of this approach: Distant members of the same component need not be similar No evaluation on standard test data
  • 14. Text Clustering 14 Text Clustering Example from the Grouper System
  • 15. Text Clustering 15 Text Clustering Frequent-Term-Based Clustering [Beil, Ester & Xu 2002] • Frequent term set: description of a cluster • Set of documents containing all terms of the frequent term set: cluster • Clustering: subset of set of all frequent term sets covering the DB with a low mutual overlap {} {D1, . . ., D16} {sun} {fun} {beach} {D1, D2, D4, D5, {D1, D3, D4, D6, {D2, D7, D8, D6, D8, D9, D10, D7, D8, D10, D11, D9, D10, D12, D11, D13, D15} D14, D15, D16} D13,D14, D15} {sun, fun} {sun, beach} {fun, surf} {beach, surf} {D1, D4, D6, D8, {D2, D8, D9, . . . . . . D10, D11, D15} D10, D11, D15} {sun, fun, surf} {sun, beach, fun} {D1, D6, D10, D11} {D8, D10, D11, D15}
  • 16. Text Clustering 16 Text Clustering Method • Task: Efficient calculation of the overlap of a given cluster (description) Fi with the union of the other cluster (description)s • F: set of all frequent term sets in D • f j : the number of all frequent term sets supported by document Dj • Standard overlap of a cluster Ci: |}|{| jiij DFFFf  || )1( )( i CiDj j i C f CSO   
  • 17. Text Clustering 17 Text Clustering Algorithm FTC FTC(database D, float minsup) SelectedTermSets:= {}; n:= |D|; RemainingTermSets:= DetermineFrequentTermsets(D, minsup); while |cov(SelectedTermSets)| ≠ n do for each set in RemainingTermSets do Calculate overlap for set; BestCandidate:=element of Remaining TermSets with minimum overlap; SelectedTermSets:= SelectedTermSets  {BestCandidate}; RemainingTermSets:= RemainingTermSets - {BestCandidate}; Remove all documents in cov(BestCandidate) from D and from the coverage of all of the RemainingTermSets; return SelectedTermSets;
  • 18. Text Clustering 18 Text Clustering Joint Cluster Analysis [Ester, Ge, Gao et al 2006] Attribute data: intrinsic properties of entities Relationship data: extrinsic properties of entities Existing clustering algorithms use either attribute or relationship data Often: attribute and relationship data somewhat related, but contain complementary information  Joint cluster analysis of attribute and relationship data Edges = relationships 2D location = attributes Informative Graph
  • 19. Text Clustering 19 Text Clustering The Connected k-Center Problem Given: • Dataset D as informative graph • k - # of clusters Connectivity constraint: Clusters need to be connected graph components Objective function: maximum (attribute) distance (radius) between nodes and cluster center Task: Find k centers (nodes) and a corresponding partitioning of D into k clusters such that • each clusters satisfies the connectivity constraint and • the objective function is minimized
  • 20. Text Clustering 20 Text Clustering The Hardness of the Connected k-Center Problem • Given k centers and a radius threshold r, finding the optimal assignment for the remaining nodes is NP-hard • Because of articulation nodes (e.g., a) Assignment of a implies the assignment of b Assignment of a to C1 minimizes the radiusa b C1 C2
  • 21. Text Clustering 21 Text Clustering Example: CS Publications 1073: Kempe, Dobra, Gehrke: “Gossip-based computation of aggregate information”, FOCS'03 1483: Fagin, Kleinberg, Raghavan: “Query strategies for priced information”, STOC'02 True cluster labels based on conference
  • 22. Text Clustering 22 Text Clustering Application for Web Pages • Attribute data text content • Relationship data hyperlinks • Connectivity constraint clusters must be connected • New challenges clusters can be overlapping not all web pages must belong to a cluster (noise) . . .
  • 23. The General Clustering Problem A clustering task may include the following components (Jain, Murty, and Flynn 1999): • Problem representation, including feature extraction, selection, or both, • Definition of proximity measure suitable to the domain, • Actual clustering of objects, • Data abstraction, and • Evaluation. Text Clustering 23
  • 24. The Vector-Space Model • Assume t distinct terms remain after preprocessing; call them index terms or the vocabulary. • These “orthogonal” terms form a vector space. Dimension = t = |vocabulary| • Each term, i, in a document or query, j, is given a real-valued weight, wij. • Both documents and queries are expressed as t-dimensional vectors: dj = (w1j, w2j, …, wtj) • New document is assigned to the most likely category based on vector similarity. Text Clustering 24
  • 25. Graphic Representation Text Clustering 25 T3 T1 T2 D1 = 2T1+ 3T2 + 5T3 D2 = 3T1 + 7T2 + T3 Q = 0T1 + 0T2 + 2T3 7 32 5 Example: D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + T3 Q = 0T1 + 0T2 + 2T3 • Is D1 or D2 more similar to Q? • How to measure the degree of similarity? Distance? Angle? Projection?
  • 26. Document Collection • A collection of n documents can be represented in the vector space model by a term-document matrix. • An entry in the matrix corresponds to the “weight” of a term in the document; zero means the term has no significance in the document or it simply doesn’t exist in the document. Text Clustering 26 T1 T2 …. Tt D1 w11 w21 … wt1 D2 w12 w22 … wt2 : : : : : : : : Dn w1n w2n … T1 T2 …. Tt D1 w11 w21 … wt1 D2 w12 w22 … wt2 : : : : : : : : Dn w1n w2n … wtn
  • 27. Term Weights: Term Frequency • More frequent terms in a document are more important, i.e. more indicative of the topic. fij = frequency of term i in document j • May want to normalize term frequency (tf) by dividing by the frequency of the most common term in the document: tfij = fij / maxi{fij} Text Clustering 27
  • 28. Term Weights: Inverse Document Frequency • Terms that appear in many different documents are less indicative of overall topic df i = document frequency of term i = number of documents containing term i idfi = inverse document frequency of term i, = log2 (N/ df i) (N: total number of documents) • An indication of a term’s discrimination power. • Log used to dampen the effect relative to tf. Text Clustering 28
  • 29. TF-IDF Weighting • A typical combined term importance indicator is tf-idf weighting: wij = tfij idfi = tfij log2 (N/ dfi) • A term occurring frequently in the document but rarely in the rest of the collection is given high weight. • Many other ways of determining term weights have been proposed. • Experimentally, tf-idf has been found to work well. Text Clustering 29
  • 30. Computing TF-IDF -- An Example • Given a document containing terms with given frequencies: A(3), B(2), C(1) • Assume collection contains 10,000 documents and document frequencies of these terms are: A(50), B(1300), C(250) Then: A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6 B: tf = 2/3; idf = log2 (10000/1300) = 2.9; tf-idf = 2.0 C: tf = 1/3; idf = log2 (10000/250) = 5.3; tf-idf = 1.8 Text Clustering 30
  • 31. Query Vector • Query vector is typically treated as a document and also tf-idf weighted. • Alternative is for the user to supply weights for the given query terms. Text Clustering 31
  • 32. Similarity Measure • A similarity measure is a function that computes the degree of similarity between two vectors. • Using a similarity measure between the query and each document: – It is possible to rank the retrieved documents in the order of presumed relevance. – It is possible to enforce a certain threshold so that the size of the retrieved set can be controlled. Text Clustering 32
  • 33. Similarity Measure - Inner Product • Similarity between vectors for the document di and query q can be computed as the vector inner product (a.k.a. dot product): sim(dj,q) = dj•q = Where, wij is the weight of term i in document j and wiq is the weight of term i in the query • For binary vectors, the inner product is the number of matched query terms in the document (size of intersection). • For weighted term vectors, it is the sum of the products of the weights of the matched terms. Text Clustering 33 iq t i ij ww1
  • 34. Properties of Inner Product Text Clustering 34 • The inner product is unbounded. • Favors long documents with a large number of unique terms. • Measures how many terms matched but not how many terms are not matched.
  • 35. Text Clustering 3535 Inner Product -- Examples Binary: • D = 1, 1, 1, 0, 1, 1, 0 • Q = 1, 0 , 1, 0, 0, 1, 1 sim(D, Q) = 3 Size of vector = size of vocabulary = 7 0 means corresponding term not found in document or query Weighted: D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3 Q = 0T1 + 0T2 + 2T3 sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10 sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2
  • 36. Cosine Similarity Measure Text Clustering 36 • Cosine similarity measures the cosine of the angle between two vectors. • Inner product normalized by the vector lengths. D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81 D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13 Q = 0T1 + 0T2 + 2T3 2 t3 t1 t2 D1 D2 Q 1 D1 is 6 times better than D2 using cosine similarity but only 5 times better using inner product.            t i t i t i ww ww qd qd iqij iqij j j 1 1 22 1 )(   CosSim(dj, q) =
  • 37. Naïve Implementation • Convert all documents in collection D to tf-idf weighted vectors, dj, for keyword vocabulary V. • Convert query to a tf-idf-weighted vector q. • For each dj in D do Compute score sj = cosSim(dj, q) • Sort documents by decreasing score. • Present top ranked documents to the user. Time complexity: O(|V|·|D|) Bad for large V & D ! |V| = 10,000; |D| = 100,000; |V|·|D| = 1,000,000,000 Text Clustering 37
  • 38. Comments on Vector Space Models • Simple, mathematically based approach. • Considers both local (tf) and global (idf) word occurrence frequencies. • Provides partial matching and ranked results. • Tends to work quite well in practice despite obvious weaknesses. • Allows efficient implementation for large document collections. Text Clustering 38
  • 39. Problems with Vector Space Model • Missing semantic information (e.g. word sense). • Missing syntactic information (e.g. phrase structure, word order, proximity information). • Assumption of term independence (e.g. ignores synonomy). • Lacks the control of a Boolean model (e.g., requiring a term to appear in a document). – Given a two-term query “A B”, may prefer a document containing A frequently but not B, over a document that contains both A and B, but both less frequently. Text Clustering 39
  • 40. Clustering Algorithms Problem variant • A flat (or partitional) clustering produces a single partition of a set of objects into disjoint groups. • A hierarchical clustering results in a nested series of partitions. Distance-based Clustering Algorithms • Agglomerative and Hierarchical Clustering Algorithms • Distance-based Partitioning Algorithms (k-Means) • A Hybrid Approach: The Scatter-Gather Method (Buckshot, Fractionization) Text Clustering 40
  • 41. Hard/Soft Clustering Hard Clustering • Every object may belong to exactly one cluster. Soft Clustering • The membership is fuzzy - Objects may belong to several clusters with a fractional degree of membership in each. Text Clustering 41
  • 42. Clustering Algorithms The most commonly used algorithms are  The K-means (hard, flat, shuffling),  The EM-based mixture resolving (soft, flat, probabilistic), and  The HAC (hierarchical, agglomerative). Text Clustering 42
  • 43. Clustering Algorithms Document hierarchical clustering • Bottom-up, agglomerative • Top-down, divisive Document partitioning (flat clustering) • K-means • Probabilistic clustering using the Naïve Bayes or Gaussian mixture model, etc. Document clustering based on graph model Text Clustering 43
  • 44. K-Means Algorithm Partitions a collection of vectors {x1, x2, . . . xn} into a set of clusters {C1, C2, . . . Ck}. The algorithm needs k cluster seeds for initialization. The algorithm proceeds as follows: Initialization: • k seeds, either given or selected randomly, form the core of k clusters. • Every other vector is assigned to the cluster of the closest seed. Iteration: • The centroids Mi of the current clusters are computed: • Each vector is reassigned to the cluster with the closest centroid. Stopping condition: • At convergence – when no more changes occur. • The K-means algorithm maximizes the clustering quality function Q: Text Clustering 44    icx ii xC 1 || )(),...,,( 1 21   C Cx iK i xSimCCCQ
  • 45. K-Means Works when we know k, the number of clusters Idea: • Randomly pick k points as the “centroids” of the k clusters • Loop: –  points, add to cluster with nearest centroid – Recompute the cluster centroids – Repeat loop (until no change) Text Clustering 45 Iterative improvement of the objective function: Sum of the squared distance from each point to the centroid of its cluster. Finding the global optimum is NP-hard. The k-means algorithm is guaranteed to converge a local optimum.
  • 46. K-means Example For simplicity, 1-dimension objects and k=2. • Numerical difference is used as the distance Objects: 1, 2, 5, 6,7 K-means: Randomly select 5 and 6 as centroids; => Two clusters {1,2,5} and {6,7}; meanC1=8/3, meanC2=6.5 => {1,2}, {5,6,7}; meanC1=1.5, meanC2=6 => no change. Aggregate dissimilarity – (sum of squares of distance each point of each cluster from its cluster center--(intra-cluster distance) = 0.52+ 0.52+ 12+ 02+12 = 2.5 Text Clustering 46 |1-1.5|2
  • 47. Text Clustering 4747 K Means Example (K=2) Pick seeds Reassign clusters Compute centroids x x Reasssign clusters x x xx Compute centroids Reassign clusters Converged!
  • 48. Time Complexity • Assume computing distance between two instances is O(m) where m is the dimensionality of the vectors. • Reassigning clusters: O(kn) distance computations, or O(knm). • Computing centroids: Each instance vector gets added once to some centroid: O(nm). • Assume these two steps are each done once for I iterations: O(Iknm). • Linear in all relevant factors, assuming a fixed number of iterations, more efficient than O(n2) HAC. Text Clustering 48
  • 49. Problems with K-means Need to know k in advance • Could try out several k? – Cluster tightness increases with increasing K. – Look for a kink in the tightness vs. K curve Tends to go to local minima that are sensitive to the starting centroids • Try out multiple starting points Disjoint and exhaustive • Doesn’t have a notion of “outliers” – Outlier problem can be handled by K-medoid or neighborhood-based algorithms Assumes clusters are spherical in vector space • Sensitive to coordinate changes, weighting etc. Text Clustering 49
  • 50. EM-based Probabilistic Clustering Algorithm Mixture-resolving Algorithms - Underlying Assumption: • The objects to be clustered are drawn from k distributions, and • The goal is to identify the parameters of each that would allow the calculation of the probability P(Ci | x) of the given object’s belonging to the cluster Ci The Expectation Maximization (EM) • A general purpose framework for estimating the parameters of distribution in the presence of hidden variables in observable data. • Probabilistic method for soft clustering. • Direct method that assumes k clusters:{c1, c2,… ck} and Soft version of k- means. • For text, typically assume a naïve-Bayes category model. – Parameters  = {P(ci), P(wj | ci): i{1,…k}, j {1,…,|V|}} Text Clustering 50
  • 51. EM Algorithm Initialization: The initial parameters of k distributions are selected either randomly or externally. Iteration: E-Step: Compute the P(Ci |x) for all objects x by using the current parameters of the distributions. Re-label all objects according to the computed probabilities. M-Step: Re-estimate the parameters of the distributions to maximize the likelihood of the objects’ assuming their current labeling. Stopping condition: At convergence - when the change in log-likelihood after each iteration becomes small. Text Clustering 51
  • 52. EM Text Clustering 5252 Unlabeled Examples +  +  +  +  + Assign random probabilistic labels to unlabeled data Initialize:
  • 53. EM Text Clustering 5353 Prob. Learner +  +  +  +  + Give soft-labeled training data to a probabilistic learner Initialize:
  • 54. EM Text Clustering 5454 Prob. Learner Prob. Classifier +  +  +  +  + Produce a probabilistic classifier Initialize:
  • 55. EM Text Clustering 5555 Prob. Learner Prob. Classifier Relabel unlabled data using the trained classifier +  +  +  +  + E Step:
  • 56. EM Text Clustering 5656 Prob. Learner +  +  +  +  + Prob. Classifier Continue EM iterations until probabilistic labels on unlabeled data converge. Retrain classifier on relabeled data M step:
  • 57. Hierarchical Agglomerative Clustering (HAC) • Assumes a similarity function for determining the similarity of two instances. • Starts with all instances in a separate cluster and then repeatedly joins the two clusters that are most similar until there is only one cluster. • The history of merging forms a binary tree or hierarchy. HAC Algorithm Initialization: Every object is put into a separate cluster. Iteration: Find the pair of most similar clusters and merge them. Stopping condition: When everything is merged into a single cluster. Text Clustering 57
  • 58. Cluster Similarity Assume a similarity function that determines the similarity of two instances: sim(x,y). • Cosine similarity of document vectors. How to compute similarity of two clusters each possibly containing multiple instances? • Single Link: Similarity of two most similar members. • Complete Link: Similarity of two least similar members. • Group Average: Average similarity between members. Text Clustering 58
  • 59. Clustering functions - Merging criteria Text Clustering 59 Extending the distance measure from samples to sets of samples Similarity of 2 most similar members Similarity of 2 least similar members Average similarity between members
  • 60. Single Link Agglomerative Clustering Use maximum similarity of pairs: Can result in “straggly” (long and thin) clusters due to chaining effect. • Appropriate in some domains, such as clustering islands. Text Clustering 60 ),(max),( , yxsimccsim ji cycx ji  
  • 62. Complete Link Agglomerative Clustering Use minimum similarity of pairs: Makes more “tight,” spherical clusters that are typically preferable. Text Clustering 62 ),(min),( , yxsimccsim ji cycx ji  
  • 63. Complete Link Example Text Clustering 63
  • 64. Computational Complexity • In the first iteration, all HAC methods need to compute similarity of all pairs of n individual instances which is O(n2). • In each of the subsequent n2 merging iterations, it must compute the distance between the most recently created cluster and all other existing clusters. • In order to maintain an overall O(n2) performance, computing similarity to each other cluster must be done in constant time. Text Clustering 64
  • 65. Computing Cluster Similarity After merging ci and cj, the similarity of the resulting cluster to any other cluster, ck, can be computed by: • Single Link: • Complete Link: Text Clustering 65 )),(),,(max()),(( kjkikji ccsimccsimcccsim  )),(),,(min()),(( kjkikji ccsimccsimcccsim 
  • 66. Group Average Agglomerative Clustering • Use average similarity across all pairs within the merged cluster to measure the similarity of two clusters. • Compromise between single and complete link. • Averaged across all ordered pairs in the merged cluster instead of unordered pairs between the two clusters (to encourage tighter final clusters). Text Clustering 66     )( :)( ),( )1( 1 ),( ji jiccx xyccyjiji ji yxsim cccc ccsim   
  • 67. Other Clustering Algorithms Minimal Spanning Tree (MST) • Single-link clusters are sub-graphs of the MST, which are also the connected components • Complete-link clusters are the maximal complete sub-graphs The Nearest Neighbor Clustering • Assigns each object to the cluster of its nearest labeled neighbor object provided the similarity to that neighbor is sufficiently high. • The process continues until all objects are labeled. The Buckshot Algorithm • Combines HAC and K-Means clustering. • First randomly take a sample of instances of size n • Run group-average HAC on this sample, which takes only O(n) time. • Use the results of HAC as initial seeds for K-means. • Overall algorithm is O(n) and avoids problems of bad seed selection. Text Clustering 67
  • 68. Word and Phrase-based Clustering • Text documents from high-dimensional domain, important clusters of words may be found and utilized for finding clusters of documents. Example: In a corpus containing d terms and n documents, view a term- document matrix as an n × d matrix, in which the (i, j)th entry is the frequency of the jth term in the ith document. • The problem of clustering rows in this matrix is clustering documents, clustering columns in this matrix is clustering words. • The most general technique for simultaneous word and document clustering is referred to as co-clustering. • The problem of word clustering is related to dimensionality reduction, whereas document clustering is related to traditional clustering • Probabilistic framework which determines word clusters and document clusters simultaneously is referred to as topic modeling Text Clustering 68
  • 69. Word and Phrase-based Clustering Approaches • Clustering with Frequent Word Patterns • Leveraging Word Clusters for Document Clusters • Co-clustering Words and Documents • Clustering with Frequent Phrases Text Clustering 69
  • 70. Clustering with Frequent Word Patterns • A frequent item set in the context of text data is referred to as a frequent term set. • Frequent term sets is defined on the basis of the overlaps between the supporting documents of the different frequent term sets. Idea: Cluster the low dimensional frequent term sets as cluster candidates. Flat Clustering: The documents covered by the selected frequent term are removed from the database, and the overlap in the next iteration is computed with respect to the remaining documents. Hierarchical Clustering: Each level of the clustering is applied to a set of term sets containing a fixed number k of terms. Text Clustering 70
  • 71. Leveraging Word Clusters for Document Clusters A two phase clustering procedure, to perform document clustering: First Phase • Determine word-clusters from the documents in such a way that most of mutual information between words and documents is preserved when representing the documents in terms of word clusters rather than words. Second Phase • Use the condensed representation of the documents in terms of word- clusters in order to perform the final document clustering. • Specifically, replace the word occurrences in documents with word-cluster occurrences in order to perform the document clustering. Advantage of two-phase procedure • Significant reduction in the noise in the representation. Text Clustering 71
  • 72. Co-clustering Words and Documents • Co-clustering is defined as a pair of maps from rows to row-cluster indices and columns to column-cluster indices. • These maps are determined simultaneously by the algorithm in order to optimize the corresponding cluster representations • Matrix Factorization Approach  Discovers word clusters and document clusters simultaneously.  Transform knowledge from word space to document space in the context of document clustering techniques. • Co-clustering is a form of local feature selection, in which the features selected are specific to each cluster • Methods for document co-clustering  Co-clustering with graph partitioning  Information-Theoretic Co-clustering Text Clustering 72
  • 73. Clustering with Frequent Phrases Phrase cluster • A phrase cluster is a phrase that is shared by at least two documents, and the group of documents that contain the phrase. • Each document is treated as a string of words, rather than characters. • It uses an indexing method in order to organize the phrases in the document collection, and then uses this organization to create the clusters. Steps to create the clusters Step 1: • perform the cleaning of the strings representing the documents. • A light stemming algorithm is used by deleting word prefixes and suffixes and reducing plural to singular. • Sentence boundaries are marked and non-word tokens are stripped. Text Clustering 73
  • 74. Clustering with Frequent Phrases Step 2: • Identification of base clusters. • These are defined by the frequent phases in the collection which are represented in the form of a suffix tree. Step 3: • An important characteristic of the base clusters created by the suffix tree is that they do not define a strict partitioning and have overlaps with one another • The algorithm merges the clusters based on the similarity of their underlying document sets. Let P and Q be the document sets corresponding to two clusters. The base similarity BS(P,Q) is defined as follows: Text Clustering 74          5.0 |}||,max{| || ),( QP QP QPBS
  • 75. Benefits of Word Clustering Useful Semantic word clustering • Automatically generates a “Thesaurus” Higher classification accuracy Smaller classification models • size reductions as dramatic as 50000  50 Text Clustering 75
  • 76. Semi-supervised clustering: problem definition Input: • A set of unlabeled objects, each described by a set of attributes (numeric and/or categorical) • A small amount of domain knowledge Output: • A partitioning of the objects into k clusters (possibly with some discarded as outliers) Objective: • Maximum intra-cluster similarity • Minimum inter-cluster similarity • High consistency between the partitioning and the domain knowledge Text Clustering 76
  • 77. Why semi-supervised clustering? Why not clustering? • The clusters produced may not be the ones required. • Sometimes there are multiple possible groupings. Why not classification? • Sometimes there are insufficient labeled data. Potential applications • Bioinformatics (gene and protein clustering) • Document hierarchy construction • News/email categorization • Image categorization Text Clustering 77
  • 78. Semi-Supervised Clustering Domain knowledge • Partial label information is given • Apply some constraints (must-links and cannot-links) Approaches • Search-based Semi-Supervised Clustering – Alter the clustering algorithm using the constraints • Similarity-based Semi-Supervised Clustering – Alter the similarity measure based on the constraints • Combination of both Text Clustering 78
  • 79. Search-Based Semi-Supervised Clustering Alter the clustering algorithm that searches for a good partitioning by: • Modifying the objective function to give a reward for obeying labels on the supervised data [Demeriz:ANNIE99]. • Enforcing constraints (must-link, cannot-link) on the labeled data during clustering [Wagstaff:ICML00, Wagstaff:ICML01]. • Use the labeled data to initialize clusters in an iterative refinement algorithm (k-Means,) [Basu:ICML02]. Text Clustering 79
  • 80. Similarity-based semi-supervised clustering • Alter the similarity measure based on the constraints • Two types of constraints: Must-links and Cannot-links • Clustering algorithm: Hierarchical clustering Text Clustering 80
  • 81. Transfer Learning Text Clustering 81 It is motivated by human learning. People can often transfer knowledge learnt previously to novel situations  Chess  Checkers  Mathematics  Computer Science  Table Tennis  Tennis TL - Solve the fundamental problem of different distributions between the training and testing data. Transfer Learning (TL): The ability of a system to recognize and apply knowledge and skills learned in previous tasks to novel tasks (in new domains)
  • 82. Traditional ML vs. TL Text Clustering 82 Traditional ML in multiple domains Transfer of learning across domains trainingitems testitems trainingitems testitems Humans can learn in many domains. Humans can also transfer from one domain to other domains.
  • 83. Traditional ML vs. TL Text Clustering 83 Learning Process of Traditional ML Learning Process of Transfer Learning training items Learning System Learning System training items Learning System Learning SystemKnowledge
  • 84. Why Transfer Learning? • In some domains, labeled data are in short supply. • In some domains, the calibration effort is very expensive. • In some domains, the learning process is time consuming Text Clustering 84 Transfer learning techniques may help!  How to extract knowledge learnt from related domains to help learning in a target domain with a few labeled data?  How to extract knowledge learnt from related domains to speed up learning in a target domain?
  • 85. Settings of Transfer Learning Text Clustering 85 Transfer learning settings Labeled data in a source domain Labeled data in a target domain Tasks Inductive Transfer Learning × √ Classification Regression …√ √ Transductive Transfer Learning √ × Classification Regression … Unsupervised Transfer Learning × × Clustering …
  • 86. Approaches to Transfer Learning Text Clustering 86 Transfer learning approaches Description Instance-transfer To re-weight some labeled data in a source domain for use in the target domain Feature-representation-transfer Find a “good” feature representation that reduces difference between a source and a target domain or minimizes error of models Model-transfer Discover shared parameters or priors of models between a source domain and a target domain Relational-knowledge-transfer Build mapping of relational knowledge between a source domain and a target domain.
  • 87. Approaches to Transfer Learning Text Clustering 87 Inductive Transfer Learning Transductive Transfer Learning Unsupervised Transfer Learning Instance-transfer √ √ Feature-representation- transfer √ √ √ Model-transfer √ Relational-knowledge- transfer √
  • 88. Unsupervised Transfer Learning Feature-representation-transfer Approaches Self-taught Clustering (STC) [Dai et al. ICML-08] Text Clustering 88 Input: A lot of unlabeled data in a source domain and a few unlabeled data in a target domain. Goal: Clustering the target domain data. Assumption: The source domain and target domain data share some common features, which can help clustering in the target domain. Main Idea: To extend the information theoretic co-clustering algorithm [Dhillon et al. KDD-03] for transfer learning.