SlideShare una empresa de Scribd logo
1 de 47
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Introduction   Definitions  Clustering  Discussion Open Problems CONTENTS 
Prof. Dr. Benno Stein  Dr. Sven Meyer zu Eissen  Weimar University,  Germany Ideas,  Materials,  Collaboration - Structuring and Indexing - AI search  in IR  - Semi-Supervised Learning Dr. Xiaojin Zhu University of Wisconsin, Madison ,  USA
TEXTUAL DATA Subject of Grouping NON TEXTUAL DATA Local Terminology It is not important what is the source of data: textual or non textual. Data :  work in the space of numerical   parameters   Texts :  work in the space of  words   Example: typical dialog between  passenger (  US   ) and railway  directory inquires (  DI  )
TEXTS  (‘indexed’) Presentation of Textual Data  Vector model (‘parameterized’) Local Terminology Indexed   texts  are only parameterized texts in the  space of words
TEXTS  (‘indexed’) Presentation of Textual Data  TEXTS (‘parameterized’)  Local Terminology  Indexed   texts  are only parameterized texts in the  space of themes   = >  Category /Context  Vector Models... Example :  manually parameterized dialogs in the  space of parameters  (transport service and passenger needs)
Introduction   Definitions   Clustering  Discussion Open Problems CONTENTS 
Unsupervised Learning Types  of  Grouping Supervised Learning Semi-Supervised Learning We know nothing about data sructure We know well  data sructure We know something about data sructure
Clustering  Classification Characteristics:  Characteristics : Absence of  patterns  or  descriptions   Presence of  patterns  or  descriptiones of classes, so the results  are  of classes, so the results are  defined by the  nature   of  defined by the  user   (  N>=1  ) the data themselves  (  N >1  ) Synonyms:  Synonyms : Classification   without teacher   Classificatio n   with teacher Unsupervised   learning   Supervised   learning Number of clusters   Specials terms :   [  ] is known   exactly   Categorization   (of documents) [x] is known  approximately   Diagnostics   (technics, medicine) [  ]  is  not   known  =>  searching  Recognition  (technics, science) Types  of  Grouping
“ Semi Clustering/Classification”  Classification Characteristics:  Characteristics : Presence of  limited number   Presence of  patterns  or of patterns, so the results  are  descriptiones   of classes, so the results defined both by the  user  or  are defined by the  user   (  N>=1  ) by  the  data  themselves  (  N >1  ) Synonyms:  Synonyms : Semi-Classification   Classificatio n   with teacher Semi Supervised   learning   Supervised   learning Number of clusters/categories  Specials terms :   [  ] is known   exactly   Categorization   (of documents) [x] is known  approximately  Diagnostics   (technics, medicine) [  ]  is  not   known  =>  searching  Recognition  (technics, science) Types  of  Grouping
Objectives of Grouping 1. Organization  (structuring) of an object set  Process is named   data structuring   2. Searching interesting patterns  Process is named   navigation 3. Grouping for other applications: -  Knowledge   discovery (clustering) -  Summarization   of documents   Note:  Do not mix  the   type   of grouping  and its  objective
Classification of methods ,[object Object],[object Object],[object Object],[object Object],[object Object],Based on data presentation Methods oriented on  free metric space Every object is presented as a point in a free space Methods oriented on graphs Every object is presented as an element on graph
Hard grouping   Hard   clustering Hard   categorization Soft grouping Soft   clustering Soft   categorization Example The distribution of letters of Moscovites to the Government is  soft categorization   (numbers in the table reflect the relative weight of each theme)  Fuzzy Grouping
Preprocessing  <= Processing General Scheme of Clustering Process - I Principal idea : To transform texts to  num erical  form in order to use  matematical   tools   Remember :   our  problem is grouping textual  documents but  not  undestanding Here: Both rude and good matrixes are matrix  Object/Attributes
General Scheme of Clustering Process - II Preprocessing  Processing  <= Here: matrix   Attribute/Attribute  can be used instead of matrix  Object/Object
Matrixes to be Considered
Clustering for Categorization ,[object Object],[object Object],Matrix contains the value of word co-occurrences in texts.  Red :  if value more than some threshold. White :  if less.
Clustering for Categorizatión ,[object Object],[object Object],Words are groupped. Cluster = >   Subdictionary Absence of blocks means absence  of  Subthemes
Importance of Preprocessing  ( it takes   60%-90% of efforts)
Introduction   Definitions  Clustering   Discussion Open Problems CONTENTS 
Definitions Def. 1  “Let us V be the set of objects. Clustering  C  = { Ci   |  Ci  є  V   }   of V  is  division  of V  on subsets,  for which we have :  U i Ci = V  and  Ci  ∩ Cj = 0  i  ≠j“ Def. 2  “Let us V be the set of nodes, E be arcs,  φ   is weight function that reflects the distance between objects, so we have a weighted graph  G  = { V,E,  φ  }.   In this case  C   is named as clustering of  G .” In the framework of the second definition every Ci  produced subgraph G(Ci).  Both subsets  Ci  and subgraphs  G(Ci)   are  named  clusters . Graph Set Clique
Definitions Principal note Both  definitions  SAYS NOTHING :  -  about quality of clusters  -  about  numbers of  clusters   Reason of difficulties Nowadays  there is no any general agreement   about any universal defintion of the term  ‘ cluster ’ What means that clustering  is good ? 1. Closeness between  objets  inside clusters  is  essentially more  than  the  closeness  between  clusters   themselves 2. Constructed  clusters correspond  to  intuitive presentations  of users  ( they are  natural   clusters)
Classification  of  methods ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Based on the way of grouping
Hierarchy based methods Neighbors. Every object is cluster ,[object Object],[object Object],[object Object],[object Object],[object Object]
Hierarchy  based  methods Nearest neighbor  method   (NN)
Exemplar based methods K - means,  centroid
Method K-means ,[object Object],[object Object],[object Object],[object Object],Exemplar based methods
Method X-means (Dan Pelleg, Andrew Moor) Approach Using evaluation of object distribution  Selection of the most  likely points Advantage - More rapid - Number of cluster  is not fixed (in all cases it tends to be less)   Exemplar based methods
Density based methods ,[object Object],[object Object],[object Object],[object Object],[object Object],Suboptimal solution Only part  of neighbors are considered on every step  (to save time, to avoid mergence) .
Density based methods ,[object Object],[object Object],[object Object],[object Object],[object Object],Preprocessing for MajorClust Many weak links  can be stronger than the several  strongest ones that disfigures  results.  So: weak links should be  eliminated before clustering
Cluster Validity ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Dunn   index (to be  max )
Cluster Validity ,[object Object],[object Object],[object Object],[object Object],Dunn  index (to be  max ) is too sensible to extremal cases
Cluster Usability ,[object Object],[object Object],[object Object],[object Object],[object Object],Cluster  F -measure  (  Benno Stein  ) Data Expert Method Here:  i, j  are indexes of clusses and clusters C * i   , C j  are classes and clusters  prec(i,j), rec(i,j)  are precision and recall
Validity  and  Usability Conclusion Density expected measure  corresponds to  F -measure   reflecting expert ’s opinion. So,  DEM  can be an indicator of expert  opinion
Tecnologies of  Clustering ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Meta Methods Algorithm (example)  Notations: N   is  the number of objects in a given cluster D   is the diagonal of a given cluster  Initially  N 0   and their   centers   Ci  are   given Steps 1. Method  K - medoid (or any other one) is performed 2. If  N   >  N max  or  D  >  D max   (in any cluster), then this cluster is divided on 2 parts. Go to p.1 3. If  N  <  N min   or  D   <  D min  (in any cluster), then  this and the closest clusters are joined. Go to p.1 4. When the number of iteration  I  >  I max , Stop Otherwise go to p.1
Visual Clustering Clustering on dendrite  Clustering in space of factors
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Authorship References : Labbe C., Labbe D.  Inter-textual distance and authorship attribution Corneille and Molier.  Journ. of Quantitative Linguistics.  2001. Vol.8, N_3, pp.213-331
Clustering Authorship Results  1) 18 comedies of Molier  should be belonged to  Corneille   2) 15 comedies of Mollier  are weak connected with all his other works.  So, they can be written by  two authors   3) 2 comedies of Corneille now are considered as works of  Molier .  etc. Note : During a certain time Molier and Corneille were friends
Special and Universal packages with algorithms of  С lustering 1.  ClustAn  ( Scotland )  www. clustan.com   Clustan Graphics-7 (2006) 2.  MatLab  Descriptions are in Internet   3.  Statistica  Descriptions are in Internet   Learning Journals and  С ongresses about Clustering 1. Journal  “Journal of Classification”,   Springer   2. IFCS  -  International Federation of Classification Societies, Conferences 3. CSNA  -  Classification Society of North America, Seminars, Workshops
Introduction   Definitions  Clustering  Discussion Open Problems CONTENTS 
Certain Observations The numbers of methods for grouping data is a little bit more than the numbers of researchers working in this area.   Problem does not consist in searching the  best method  for all cases.  Problem consists in searching the  method being relevant  for your data.  Only you know what methods are the best for you own data . Principal problems consist in choice of indexes (parameters) and measure of closeness to be adecuate  to a given problem and given data  Frecuently the results are bad  because of the  bad indexes   and   bad measure   but not the  bad   method   !
Certain Observations Antipodal methods To be sure that results are really good and  do not depend on the method  used  one should test these results using any  antipodal  methods Solomon G, 1977: “The most antipodes are:   NN-method   and  K-means ” Sensibility To be sure that results  do not depend  essentially on the  method’s parameters   one should perform the analysis of sensibility by changing parameters  of  adjustment.
Introduction   Definitions  Clustering  Conclusions Open Problems CONTENTS 
Some Problems Question 1  How to reveal  alien   objects? Solution (idea) Revealing  a stable  structure on different sets of objects. They are subsets of a given set.   Object distribution reflects: real structure ( nature )  +  noise ( alien objects )
Some Problems Question 2  How to  accelerate  classification? Solution (idea) Filtering  objects, which give  a minimum contribution to  decisive function  Representative objects of each cluster
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Más contenido relacionado

La actualidad más candente

K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithmparry prabhu
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machinesnextlib
 
Statistical Pattern recognition(1)
Statistical Pattern recognition(1)Statistical Pattern recognition(1)
Statistical Pattern recognition(1)Syed Atif Naseem
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clusteringArshad Farhad
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Salah Amean
 
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...Simplilearn
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision treeKrish_ver2
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data MiningValerii Klymchuk
 
Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods Marina Santini
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Simplilearn
 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisJaclyn Kokx
 

La actualidad más candente (20)

K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
PCA
PCAPCA
PCA
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Hierarchical Clustering
Hierarchical ClusteringHierarchical Clustering
Hierarchical Clustering
 
Statistical Pattern recognition(1)
Statistical Pattern recognition(1)Statistical Pattern recognition(1)
Statistical Pattern recognition(1)
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision tree
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
 
Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods
 
K mean-clustering
K mean-clusteringK mean-clustering
K mean-clustering
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant Analysis
 
Neural network
Neural networkNeural network
Neural network
 
KNN
KNNKNN
KNN
 
K Nearest Neighbor Algorithm
K Nearest Neighbor AlgorithmK Nearest Neighbor Algorithm
K Nearest Neighbor Algorithm
 

Similar a Clustering

Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basicHouw Liong The
 
Capter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberCapter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberHouw Liong The
 
Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10mqasimsheikh5
 
DM UNIT_4 PPT for btech final year students
DM UNIT_4 PPT for btech final year studentsDM UNIT_4 PPT for btech final year students
DM UNIT_4 PPT for btech final year studentssriharipatilin
 
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasicengrasi
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapterNaveenKumar5162
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapterNaveenKumar5162
 
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptChapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptSubrata Kumer Paul
 
Paper id 26201478
Paper id 26201478Paper id 26201478
Paper id 26201478IJRAT
 
iiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdfiiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdfVIKASGUPTA127897
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.pptvikassingh569137
 
Chapter 10.1,2,3.pptx
Chapter 10.1,2,3.pptxChapter 10.1,2,3.pptx
Chapter 10.1,2,3.pptxAmy Aung
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptionsrefedey275
 
20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.ppt20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.pptSamPrem3
 
20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.ppt20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.pptPalaniKumarR2
 

Similar a Clustering (20)

Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basic
 
Capter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberCapter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & Kamber
 
Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10
 
DM UNIT_4 PPT for btech final year students
DM UNIT_4 PPT for btech final year studentsDM UNIT_4 PPT for btech final year students
DM UNIT_4 PPT for btech final year students
 
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasic
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
 
CLUSTERING
CLUSTERINGCLUSTERING
CLUSTERING
 
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptChapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
 
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasic
 
Paper id 26201478
Paper id 26201478Paper id 26201478
Paper id 26201478
 
iiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdfiiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdf
 
47 292-298
47 292-29847 292-298
47 292-298
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
 
My8clst
My8clstMy8clst
My8clst
 
Chapter 10.1,2,3.pptx
Chapter 10.1,2,3.pptxChapter 10.1,2,3.pptx
Chapter 10.1,2,3.pptx
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
 
20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.ppt20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.ppt
 
20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.ppt20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.ppt
 
ClusetrigBasic.ppt
ClusetrigBasic.pptClusetrigBasic.ppt
ClusetrigBasic.ppt
 

Más de NLPseminar

[ИТ-лекторий ФКН ВШЭ]: Диалоговые системы. Татьяна Ландо
[ИТ-лекторий ФКН ВШЭ]: Диалоговые системы. Татьяна Ландо[ИТ-лекторий ФКН ВШЭ]: Диалоговые системы. Татьяна Ландо
[ИТ-лекторий ФКН ВШЭ]: Диалоговые системы. Татьяна ЛандоNLPseminar
 
клышинский
клышинскийклышинский
клышинскийNLPseminar
 
конф ии и ея гаврилова
конф ии и ея  гавриловаконф ии и ея  гаврилова
конф ии и ея гавриловаNLPseminar
 
кудрявцев V3
кудрявцев V3кудрявцев V3
кудрявцев V3NLPseminar
 
акинина осмоловская
акинина осмоловскаяакинина осмоловская
акинина осмоловскаяNLPseminar
 
потапов
потаповпотапов
потаповNLPseminar
 
molchanov(promt)
molchanov(promt)molchanov(promt)
molchanov(promt)NLPseminar
 
белканова
белкановабелканова
белкановаNLPseminar
 
гвоздикин
гвоздикингвоздикин
гвоздикинNLPseminar
 
веселов
веселоввеселов
веселовNLPseminar
 

Más de NLPseminar (20)

[ИТ-лекторий ФКН ВШЭ]: Диалоговые системы. Татьяна Ландо
[ИТ-лекторий ФКН ВШЭ]: Диалоговые системы. Татьяна Ландо[ИТ-лекторий ФКН ВШЭ]: Диалоговые системы. Татьяна Ландо
[ИТ-лекторий ФКН ВШЭ]: Диалоговые системы. Татьяна Ландо
 
Events
EventsEvents
Events
 
Tomita
TomitaTomita
Tomita
 
бетин
бетинбетин
бетин
 
Andreev
AndreevAndreev
Andreev
 
клышинский
клышинскийклышинский
клышинский
 
конф ии и ея гаврилова
конф ии и ея  гавриловаконф ии и ея  гаврилова
конф ии и ея гаврилова
 
кудрявцев V3
кудрявцев V3кудрявцев V3
кудрявцев V3
 
rubashkin
rubashkinrubashkin
rubashkin
 
Vlasova
VlasovaVlasova
Vlasova
 
Ageev
AgeevAgeev
Ageev
 
Khomitsevich
Khomitsevich Khomitsevich
Khomitsevich
 
акинина осмоловская
акинина осмоловскаяакинина осмоловская
акинина осмоловская
 
Serebryakov
SerebryakovSerebryakov
Serebryakov
 
потапов
потаповпотапов
потапов
 
molchanov(promt)
molchanov(promt)molchanov(promt)
molchanov(promt)
 
белканова
белкановабелканова
белканова
 
Skatov
SkatovSkatov
Skatov
 
гвоздикин
гвоздикингвоздикин
гвоздикин
 
веселов
веселоввеселов
веселов
 

Último

Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 

Último (20)

Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 

Clustering

  • 1.
  • 2. Introduction Definitions Clustering Discussion Open Problems CONTENTS 
  • 3. Prof. Dr. Benno Stein Dr. Sven Meyer zu Eissen Weimar University, Germany Ideas, Materials, Collaboration - Structuring and Indexing - AI search in IR - Semi-Supervised Learning Dr. Xiaojin Zhu University of Wisconsin, Madison , USA
  • 4. TEXTUAL DATA Subject of Grouping NON TEXTUAL DATA Local Terminology It is not important what is the source of data: textual or non textual. Data : work in the space of numerical parameters Texts : work in the space of words Example: typical dialog between passenger ( US ) and railway directory inquires ( DI )
  • 5. TEXTS (‘indexed’) Presentation of Textual Data Vector model (‘parameterized’) Local Terminology Indexed texts are only parameterized texts in the space of words
  • 6. TEXTS (‘indexed’) Presentation of Textual Data TEXTS (‘parameterized’) Local Terminology Indexed texts are only parameterized texts in the space of themes = > Category /Context Vector Models... Example : manually parameterized dialogs in the space of parameters (transport service and passenger needs)
  • 7. Introduction Definitions Clustering Discussion Open Problems CONTENTS 
  • 8. Unsupervised Learning Types of Grouping Supervised Learning Semi-Supervised Learning We know nothing about data sructure We know well data sructure We know something about data sructure
  • 9. Clustering Classification Characteristics: Characteristics : Absence of patterns or descriptions Presence of patterns or descriptiones of classes, so the results are of classes, so the results are defined by the nature of defined by the user ( N>=1 ) the data themselves ( N >1 ) Synonyms: Synonyms : Classification without teacher Classificatio n with teacher Unsupervised learning Supervised learning Number of clusters Specials terms : [ ] is known exactly Categorization (of documents) [x] is known approximately Diagnostics (technics, medicine) [ ] is not known => searching Recognition (technics, science) Types of Grouping
  • 10. “ Semi Clustering/Classification” Classification Characteristics: Characteristics : Presence of limited number Presence of patterns or of patterns, so the results are descriptiones of classes, so the results defined both by the user or are defined by the user ( N>=1 ) by the data themselves ( N >1 ) Synonyms: Synonyms : Semi-Classification Classificatio n with teacher Semi Supervised learning Supervised learning Number of clusters/categories Specials terms : [ ] is known exactly Categorization (of documents) [x] is known approximately Diagnostics (technics, medicine) [ ] is not known => searching Recognition (technics, science) Types of Grouping
  • 11. Objectives of Grouping 1. Organization (structuring) of an object set Process is named data structuring 2. Searching interesting patterns Process is named navigation 3. Grouping for other applications: - Knowledge discovery (clustering) - Summarization of documents Note: Do not mix the type of grouping and its objective
  • 12.
  • 13. Hard grouping Hard clustering Hard categorization Soft grouping Soft clustering Soft categorization Example The distribution of letters of Moscovites to the Government is soft categorization (numbers in the table reflect the relative weight of each theme) Fuzzy Grouping
  • 14. Preprocessing <= Processing General Scheme of Clustering Process - I Principal idea : To transform texts to num erical form in order to use matematical tools Remember : our problem is grouping textual documents but not undestanding Here: Both rude and good matrixes are matrix Object/Attributes
  • 15. General Scheme of Clustering Process - II Preprocessing Processing <= Here: matrix Attribute/Attribute can be used instead of matrix Object/Object
  • 16. Matrixes to be Considered
  • 17.
  • 18.
  • 19. Importance of Preprocessing ( it takes 60%-90% of efforts)
  • 20. Introduction Definitions Clustering Discussion Open Problems CONTENTS 
  • 21. Definitions Def. 1 “Let us V be the set of objects. Clustering C = { Ci | Ci є V } of V is division of V on subsets, for which we have : U i Ci = V and Ci ∩ Cj = 0 i ≠j“ Def. 2 “Let us V be the set of nodes, E be arcs, φ is weight function that reflects the distance between objects, so we have a weighted graph G = { V,E, φ }. In this case C is named as clustering of G .” In the framework of the second definition every Ci produced subgraph G(Ci). Both subsets Ci and subgraphs G(Ci) are named clusters . Graph Set Clique
  • 22. Definitions Principal note Both definitions SAYS NOTHING : - about quality of clusters - about numbers of clusters Reason of difficulties Nowadays there is no any general agreement about any universal defintion of the term ‘ cluster ’ What means that clustering is good ? 1. Closeness between objets inside clusters is essentially more than the closeness between clusters themselves 2. Constructed clusters correspond to intuitive presentations of users ( they are natural clusters)
  • 23.
  • 24.
  • 25. Hierarchy based methods Nearest neighbor method (NN)
  • 26. Exemplar based methods K - means, centroid
  • 27.
  • 28. Method X-means (Dan Pelleg, Andrew Moor) Approach Using evaluation of object distribution Selection of the most likely points Advantage - More rapid - Number of cluster is not fixed (in all cases it tends to be less) Exemplar based methods
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34. Validity and Usability Conclusion Density expected measure corresponds to F -measure reflecting expert ’s opinion. So, DEM can be an indicator of expert opinion
  • 35.
  • 36. Meta Methods Algorithm (example) Notations: N is the number of objects in a given cluster D is the diagonal of a given cluster Initially N 0 and their centers Ci are given Steps 1. Method K - medoid (or any other one) is performed 2. If N > N max or D > D max (in any cluster), then this cluster is divided on 2 parts. Go to p.1 3. If N < N min or D < D min (in any cluster), then this and the closest clusters are joined. Go to p.1 4. When the number of iteration I > I max , Stop Otherwise go to p.1
  • 37. Visual Clustering Clustering on dendrite Clustering in space of factors
  • 38.
  • 39. Clustering Authorship Results 1) 18 comedies of Molier should be belonged to Corneille 2) 15 comedies of Mollier are weak connected with all his other works. So, they can be written by two authors 3) 2 comedies of Corneille now are considered as works of Molier . etc. Note : During a certain time Molier and Corneille were friends
  • 40. Special and Universal packages with algorithms of С lustering 1. ClustAn ( Scotland ) www. clustan.com Clustan Graphics-7 (2006) 2. MatLab Descriptions are in Internet 3. Statistica Descriptions are in Internet Learning Journals and С ongresses about Clustering 1. Journal “Journal of Classification”, Springer 2. IFCS - International Federation of Classification Societies, Conferences 3. CSNA - Classification Society of North America, Seminars, Workshops
  • 41. Introduction Definitions Clustering Discussion Open Problems CONTENTS 
  • 42. Certain Observations The numbers of methods for grouping data is a little bit more than the numbers of researchers working in this area. Problem does not consist in searching the best method for all cases. Problem consists in searching the method being relevant for your data. Only you know what methods are the best for you own data . Principal problems consist in choice of indexes (parameters) and measure of closeness to be adecuate to a given problem and given data Frecuently the results are bad because of the bad indexes and bad measure but not the bad method !
  • 43. Certain Observations Antipodal methods To be sure that results are really good and do not depend on the method used one should test these results using any antipodal methods Solomon G, 1977: “The most antipodes are: NN-method and K-means ” Sensibility To be sure that results do not depend essentially on the method’s parameters one should perform the analysis of sensibility by changing parameters of adjustment.
  • 44. Introduction Definitions Clustering Conclusions Open Problems CONTENTS 
  • 45. Some Problems Question 1 How to reveal alien objects? Solution (idea) Revealing a stable structure on different sets of objects. They are subsets of a given set. Object distribution reflects: real structure ( nature ) + noise ( alien objects )
  • 46. Some Problems Question 2 How to accelerate classification? Solution (idea) Filtering objects, which give a minimum contribution to decisive function Representative objects of each cluster
  • 47.