SlideShare una empresa de Scribd logo
1 de 36
Canopy Clustering and K-Means Clustering Machine Learning Big Data  at Hacker Dojo Anandha L Ranganathan (Anand)analog76@gmail.com Anandha L Ranganathan  analog76@gmail.com MLBigData 1
Movie Dataset	  Download the movie dataset from  	http://www.grouplens.org/node/73 The data is in the format UserID::MovieID::Rating::Timestamp 1::1193::5::978300760 2::1194::4::978300762 7::1123::1::978300760 Anandha L Ranganathan analog76@gmail.com MLBigData
Similarity Measure	 Jaccard similarity coefficient  Cosine similarity Anandha L Ranganathan analog76@gmail.com MLBigData
JaccardIndex Distance = # of movies watched by by User A and B / Total # of movies watched by either user. In other words       A  B   /  A  B. For our applicaton I am going to compare the the subset of user z₁ and  z₂  where z₁,z₂  ε Z http://en.wikipedia.org/wiki/Jaccard_index Anandha L Ranganathan analog76@gmail.com MLBigData
Jaccard Similarity Coefficient. similarity(String[] s1, String[] s2){ 	List<String> lstSx=Arrays.asList(s1); 	List<String> lstSy=Arrays.asList(s2); 	Set<String> unionSxSy = new HashSet<String>(lstSx); unionSxSy.addAll(lstSy); 	Set<String> intersectionSxSy =new HashSet<String>(lstSx); intersectionSxSy.retainAll(lstSy);  sim= intersectionSxSy.size() /  (double)unionSxSy.size(); }  Anandha L Ranganathan analog76@gmail.com MLBigData
Cosine Similiarty distance  =  Dot Inner Product (A, B) / sqrt(||A||*||B||) Simple distance calculation will be used for Canopy clustering. Expensive distance calculation will be used for K-means clustering. Anandha L Ranganathan analog76@gmail.com MLBigData
Canopy Clustering- Mapper Canopy cluster are subset of total popultation. Points in that cluster are movies. If z₁subset of the whole population, rated movie M1 and same subset are rated M2 also then the movie M1and M2 are belong the same canopy  cluster. Anandha L Ranganathan analog76@gmail.com MLBigData
Canopy Cluster – Mapper Anandha L Ranganathan analog76@gmail.com MLBigData  First received point/data is center of Canopy .  Say P1 Receive the second point and if it is distance from canopy center is less than T2then they are point of that canopy.   If d(P1,P2) >T2then P2 point is new canopy center. If d(P1,P2) < T2 then P1is point of centroidP1. Continue the step 2,3,4  until the mappercomplets its job.  Distances are measured between 0 to 1.  T2 value is 0.005 and I expect around 200 canopy clusters. T1 value is 0.0010.
Canopy Cluster – Mapper Anandha L Ranganathan analog76@gmail.com MLBigData  Pseudo Code. booleanpointStronglyBoundToCanopyCenter = false 	for (Canopy canopy : canopies) { 	double centerPoint= canopyCenter.getPoint(); 	if(distanceMeasure.similarity(centerPoint, movie_id) > T1) pointStronglyBoundToCanopyCenter = true } 	if(!pointStronglyBoundToCanopyCenter){ canopies.add(new Canopy(0.0d));
Data Massaging Convert the data into the required format.  In this case the converted data to be displayed in <MovieId,List of Users> <MovieId, List<userId,ranking>> Anandha L Ranganathan analog76@gmail.com MLBigData
Canopy Cluster – Mapper A Anandha L Ranganathan analog76@gmail.com MLBigData
Threshold value  Anandha L Ranganathan analog76@gmail.com MLBigData
Anandha L Ranganathan analog76@gmail.com MLBigData  T1 and T2 are  wrong. Inner circle is T2 and outer circle is T1.
Anandha L Ranganathan analog76@gmail.com MLBigData
Anandha L Ranganathan analog76@gmail.com MLBigData
Anandha L Ranganathan analog76@gmail.com MLBigData
Anandha L Ranganathan analog76@gmail.com MLBigData
Anandha L Ranganathan analog76@gmail.com MLBigData
ReducerMapper A -  Red center  Mapper B – Green center Anandha L Ranganathan analog76@gmail.com MLBigData
Redundant centers within the threshold of each other. Anandha L Ranganathan analog76@gmail.com MLBigData
Add small error  => Threshold+ξ Anandha L Ranganathan analog76@gmail.com MLBigData
So far we found , only the canopy center. Run another MR job to find out points that are belong to canopy center. canopy clusters areready when the job is completed. How it would look like ?  Anandha L Ranganathan analog76@gmail.com MLBigData
Canopy Cluster -  Before MR jobSparse Matrix Anandha L Ranganathan analog76@gmail.com MLBigData
 Canopy Cluster – After  MR job Anandha L Ranganathan analog76@gmail.com MLBigData
Anandha L Ranganathan analog76@gmail.com MLBigData  Cells with values 1 are grouped together and users are moved from their original location
K – Means Clustering	 Output of Canopy cluster will become input of K-means clustering. Apply Cosine similarity metric to find out similar users.  To find Cosine similarity create a vector  in the format  <UserId,List<Movies>> <UserId,{m1,m2,m3,m4,m5}> Anandha L Ranganathan analog76@gmail.com MLBigData
Anandha L Ranganathan analog76@gmail.com MLBigData
Anandha L Ranganathan analog76@gmail.com MLBigData  Vector(A) - 1111000  Vector (B)-  0100111 Vector (C)-  1110010 distance(A,B) = Vector (A) * Vector (B) / 					(||A||*||B||) Vector(A)*Vector(B) = 1 ||A||*||B||=2*2=4   ¼=.25 Similarity (A,B) = .25
Find k-neighbors from the same canopy cluster. Do not get any point from another canopy cluster if you want small number of neighbors # of K-means cluster  > # of Canopy cluster. After couple of map-reduce jobs  K-means cluster is ready Anandha L Ranganathan analog76@gmail.com MLBigData
Find Nearest Cluster of a point	- Map Public void addPointToCluster(Point p ,Iterable<KMeansCluster>  lstKMeansCluster) { kMeansClusterclosesCluster = null; Double closestDistance = CanopyThresholdT1/3 For(KMeansClustercluster :lstKMeansCluster){    double distance=distance(cluster.getCenter(),point) if(closesCluster ||  closestDistance >distance){ closesetCluster= cluster; closesDistance= distance          }  } closesCluster.add(point); } Anandha L Ranganathan analog76@gmail.com MLBigData
Compute centroid till it converges. Public void computeConvergence((Iterable<KMeansCluster> clusters){ 	for(Cluster cluster:clusters){ newCentroid = cluster.computeCentroid(cluster);                 if(cluster.getCentroid()==newCentroid){ cluster.converged=true;               }     else             { cluster.setCentroid(newCentroid)    }   } Run the process to find nearest cluster of a point and centroid until the centroidbecomes static. Anandha L Ranganathan analog76@gmail.com MLBigData
All points –before clustering Anandha L Ranganathan analog76@gmail.com MLBigData
Canopy - clustering Anandha L Ranganathan analog76@gmail.com MLBigData
Canopy Clusering and K means clustering. Anandha L Ranganathan analog76@gmail.com MLBigData
? Anandha L Ranganathan analog76@gmail.com MLBigData
References Apache Mahout - https://cwiki.apache.org/MAHOUT/canopy-clustering.html Canopy Clustering  - http://code.google.com/p/canopy-clustering/  Google Lectures. http://www.youtube.com/watch?v=1ZDybXl212Q http://cs.boisestate.edu/~amit/research/makho_ngazimbi_project.pdf Anandha L Ranganathan analog76@gmail.com MLBigData

Más contenido relacionado

La actualidad más candente

強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)
強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)
強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)
Shota Imai
 
Densely Connected Convolutional Networks
Densely Connected Convolutional NetworksDensely Connected Convolutional Networks
Densely Connected Convolutional Networks
harmonylab
 
私は如何にして心配するのを止めてPyTorchを愛するようになったか
私は如何にして心配するのを止めてPyTorchを愛するようになったか私は如何にして心配するのを止めてPyTorchを愛するようになったか
私は如何にして心配するのを止めてPyTorchを愛するようになったか
Yuta Kashino
 
大規模グラフアルゴリズムの最先端
大規模グラフアルゴリズムの最先端大規模グラフアルゴリズムの最先端
大規模グラフアルゴリズムの最先端
Takuya Akiba
 

La actualidad más candente (20)

プログラマのための線形代数再入門
プログラマのための線形代数再入門プログラマのための線形代数再入門
プログラマのための線形代数再入門
 
高い並列性能と耐障害性を持つElixirとNervesでIoTの新しいカタチを切り拓く
高い並列性能と耐障害性を持つElixirとNervesでIoTの新しいカタチを切り拓く高い並列性能と耐障害性を持つElixirとNervesでIoTの新しいカタチを切り拓く
高い並列性能と耐障害性を持つElixirとNervesでIoTの新しいカタチを切り拓く
 
【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners
【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners
【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners
 
PythonによるCVアルゴリズム実装
PythonによるCVアルゴリズム実装PythonによるCVアルゴリズム実装
PythonによるCVアルゴリズム実装
 
Iclr2016 vaeまとめ
Iclr2016 vaeまとめIclr2016 vaeまとめ
Iclr2016 vaeまとめ
 
深層強化学習と実装例
深層強化学習と実装例深層強化学習と実装例
深層強化学習と実装例
 
強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)
強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)
強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)
 
統計的ボイチェン研究事情
統計的ボイチェン研究事情統計的ボイチェン研究事情
統計的ボイチェン研究事情
 
機械学習による統計的実験計画(ベイズ最適化を中心に)
機械学習による統計的実験計画(ベイズ最適化を中心に)機械学習による統計的実験計画(ベイズ最適化を中心に)
機械学習による統計的実験計画(ベイズ最適化を中心に)
 
非ガウス性を利用した 因果構造探索
非ガウス性を利用した因果構造探索非ガウス性を利用した因果構造探索
非ガウス性を利用した 因果構造探索
 
モデル高速化百選
モデル高速化百選モデル高速化百選
モデル高速化百選
 
深層生成モデルと世界モデル
深層生成モデルと世界モデル深層生成モデルと世界モデル
深層生成モデルと世界モデル
 
[DL輪読会]Diffusion-based Voice Conversion with Fast Maximum Likelihood Samplin...
[DL輪読会]Diffusion-based Voice Conversion with Fast  Maximum Likelihood Samplin...[DL輪読会]Diffusion-based Voice Conversion with Fast  Maximum Likelihood Samplin...
[DL輪読会]Diffusion-based Voice Conversion with Fast Maximum Likelihood Samplin...
 
FPGAX2019
FPGAX2019FPGAX2019
FPGAX2019
 
Densely Connected Convolutional Networks
Densely Connected Convolutional NetworksDensely Connected Convolutional Networks
Densely Connected Convolutional Networks
 
私は如何にして心配するのを止めてPyTorchを愛するようになったか
私は如何にして心配するのを止めてPyTorchを愛するようになったか私は如何にして心配するのを止めてPyTorchを愛するようになったか
私は如何にして心配するのを止めてPyTorchを愛するようになったか
 
WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装
 
20180115_東大医学部機能生物学セミナー_深層学習の最前線とこれから_岡野原大輔
20180115_東大医学部機能生物学セミナー_深層学習の最前線とこれから_岡野原大輔20180115_東大医学部機能生物学セミナー_深層学習の最前線とこれから_岡野原大輔
20180115_東大医学部機能生物学セミナー_深層学習の最前線とこれから_岡野原大輔
 
大規模グラフアルゴリズムの最先端
大規模グラフアルゴリズムの最先端大規模グラフアルゴリズムの最先端
大規模グラフアルゴリズムの最先端
 
【チュートリアル】コンピュータビジョンによる動画認識
【チュートリアル】コンピュータビジョンによる動画認識【チュートリアル】コンピュータビジョンによる動画認識
【チュートリアル】コンピュータビジョンによる動画認識
 

Similar a Canopy k-means using Hadoop

Canopy kmeans
Canopy kmeansCanopy kmeans
Canopy kmeans
nagwww
 
Satellite_Image_Analysis[1]
Satellite_Image_Analysis[1]Satellite_Image_Analysis[1]
Satellite_Image_Analysis[1]
Joachim Nkendeys
 

Similar a Canopy k-means using Hadoop (20)

Canopy kmeans
Canopy kmeansCanopy kmeans
Canopy kmeans
 
K-Nearest Neighbor(KNN)
K-Nearest Neighbor(KNN)K-Nearest Neighbor(KNN)
K-Nearest Neighbor(KNN)
 
Big data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial UsecasesBig data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial Usecases
 
KNN - Classification Model (Step by Step)
KNN - Classification Model (Step by Step)KNN - Classification Model (Step by Step)
KNN - Classification Model (Step by Step)
 
Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013
 
Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5
Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5
Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5
 
Using Graph Algorithms for Advanced Analytics - Part 5 Classification
Using Graph Algorithms for Advanced Analytics - Part 5 ClassificationUsing Graph Algorithms for Advanced Analytics - Part 5 Classification
Using Graph Algorithms for Advanced Analytics - Part 5 Classification
 
Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)
 
Knn Algorithm presentation
Knn Algorithm presentationKnn Algorithm presentation
Knn Algorithm presentation
 
Data analysis of weather forecasting
Data analysis of weather forecastingData analysis of weather forecasting
Data analysis of weather forecasting
 
KNN
KNNKNN
KNN
 
About decision tree induction which helps in learning
About decision tree induction  which helps in learningAbout decision tree induction  which helps in learning
About decision tree induction which helps in learning
 
CS267_Graph_Lab
CS267_Graph_LabCS267_Graph_Lab
CS267_Graph_Lab
 
Tutorial ground classification with Laserdata LiS
Tutorial ground classification with Laserdata LiSTutorial ground classification with Laserdata LiS
Tutorial ground classification with Laserdata LiS
 
Recognition of Handwritten Mathematical Equations
Recognition of  Handwritten Mathematical EquationsRecognition of  Handwritten Mathematical Equations
Recognition of Handwritten Mathematical Equations
 
Fa18_P2.pptx
Fa18_P2.pptxFa18_P2.pptx
Fa18_P2.pptx
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
Satellite_Image_Analysis[1]
Satellite_Image_Analysis[1]Satellite_Image_Analysis[1]
Satellite_Image_Analysis[1]
 
Tutorial: Image Generation and Image-to-Image Translation using GAN
Tutorial: Image Generation and Image-to-Image Translation using GANTutorial: Image Generation and Image-to-Image Translation using GAN
Tutorial: Image Generation and Image-to-Image Translation using GAN
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 

Canopy k-means using Hadoop

  • 1. Canopy Clustering and K-Means Clustering Machine Learning Big Data at Hacker Dojo Anandha L Ranganathan (Anand)analog76@gmail.com Anandha L Ranganathan analog76@gmail.com MLBigData 1
  • 2. Movie Dataset Download the movie dataset from http://www.grouplens.org/node/73 The data is in the format UserID::MovieID::Rating::Timestamp 1::1193::5::978300760 2::1194::4::978300762 7::1123::1::978300760 Anandha L Ranganathan analog76@gmail.com MLBigData
  • 3. Similarity Measure Jaccard similarity coefficient Cosine similarity Anandha L Ranganathan analog76@gmail.com MLBigData
  • 4. JaccardIndex Distance = # of movies watched by by User A and B / Total # of movies watched by either user. In other words A  B / A  B. For our applicaton I am going to compare the the subset of user z₁ and z₂ where z₁,z₂ ε Z http://en.wikipedia.org/wiki/Jaccard_index Anandha L Ranganathan analog76@gmail.com MLBigData
  • 5. Jaccard Similarity Coefficient. similarity(String[] s1, String[] s2){ List<String> lstSx=Arrays.asList(s1); List<String> lstSy=Arrays.asList(s2); Set<String> unionSxSy = new HashSet<String>(lstSx); unionSxSy.addAll(lstSy); Set<String> intersectionSxSy =new HashSet<String>(lstSx); intersectionSxSy.retainAll(lstSy); sim= intersectionSxSy.size() / (double)unionSxSy.size(); } Anandha L Ranganathan analog76@gmail.com MLBigData
  • 6. Cosine Similiarty distance = Dot Inner Product (A, B) / sqrt(||A||*||B||) Simple distance calculation will be used for Canopy clustering. Expensive distance calculation will be used for K-means clustering. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 7. Canopy Clustering- Mapper Canopy cluster are subset of total popultation. Points in that cluster are movies. If z₁subset of the whole population, rated movie M1 and same subset are rated M2 also then the movie M1and M2 are belong the same canopy cluster. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 8. Canopy Cluster – Mapper Anandha L Ranganathan analog76@gmail.com MLBigData First received point/data is center of Canopy . Say P1 Receive the second point and if it is distance from canopy center is less than T2then they are point of that canopy. If d(P1,P2) >T2then P2 point is new canopy center. If d(P1,P2) < T2 then P1is point of centroidP1. Continue the step 2,3,4 until the mappercomplets its job. Distances are measured between 0 to 1. T2 value is 0.005 and I expect around 200 canopy clusters. T1 value is 0.0010.
  • 9. Canopy Cluster – Mapper Anandha L Ranganathan analog76@gmail.com MLBigData Pseudo Code. booleanpointStronglyBoundToCanopyCenter = false for (Canopy canopy : canopies) { double centerPoint= canopyCenter.getPoint(); if(distanceMeasure.similarity(centerPoint, movie_id) > T1) pointStronglyBoundToCanopyCenter = true } if(!pointStronglyBoundToCanopyCenter){ canopies.add(new Canopy(0.0d));
  • 10. Data Massaging Convert the data into the required format. In this case the converted data to be displayed in <MovieId,List of Users> <MovieId, List<userId,ranking>> Anandha L Ranganathan analog76@gmail.com MLBigData
  • 11. Canopy Cluster – Mapper A Anandha L Ranganathan analog76@gmail.com MLBigData
  • 12. Threshold value Anandha L Ranganathan analog76@gmail.com MLBigData
  • 13. Anandha L Ranganathan analog76@gmail.com MLBigData T1 and T2 are wrong. Inner circle is T2 and outer circle is T1.
  • 14. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 15. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 16. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 17. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 18. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 19. ReducerMapper A - Red center Mapper B – Green center Anandha L Ranganathan analog76@gmail.com MLBigData
  • 20. Redundant centers within the threshold of each other. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 21. Add small error => Threshold+ξ Anandha L Ranganathan analog76@gmail.com MLBigData
  • 22. So far we found , only the canopy center. Run another MR job to find out points that are belong to canopy center. canopy clusters areready when the job is completed. How it would look like ? Anandha L Ranganathan analog76@gmail.com MLBigData
  • 23. Canopy Cluster - Before MR jobSparse Matrix Anandha L Ranganathan analog76@gmail.com MLBigData
  • 24. Canopy Cluster – After MR job Anandha L Ranganathan analog76@gmail.com MLBigData
  • 25. Anandha L Ranganathan analog76@gmail.com MLBigData Cells with values 1 are grouped together and users are moved from their original location
  • 26. K – Means Clustering Output of Canopy cluster will become input of K-means clustering. Apply Cosine similarity metric to find out similar users. To find Cosine similarity create a vector in the format <UserId,List<Movies>> <UserId,{m1,m2,m3,m4,m5}> Anandha L Ranganathan analog76@gmail.com MLBigData
  • 27. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 28. Anandha L Ranganathan analog76@gmail.com MLBigData Vector(A) - 1111000 Vector (B)- 0100111 Vector (C)- 1110010 distance(A,B) = Vector (A) * Vector (B) / (||A||*||B||) Vector(A)*Vector(B) = 1 ||A||*||B||=2*2=4  ¼=.25 Similarity (A,B) = .25
  • 29. Find k-neighbors from the same canopy cluster. Do not get any point from another canopy cluster if you want small number of neighbors # of K-means cluster > # of Canopy cluster. After couple of map-reduce jobs K-means cluster is ready Anandha L Ranganathan analog76@gmail.com MLBigData
  • 30. Find Nearest Cluster of a point - Map Public void addPointToCluster(Point p ,Iterable<KMeansCluster> lstKMeansCluster) { kMeansClusterclosesCluster = null; Double closestDistance = CanopyThresholdT1/3 For(KMeansClustercluster :lstKMeansCluster){ double distance=distance(cluster.getCenter(),point) if(closesCluster || closestDistance >distance){ closesetCluster= cluster; closesDistance= distance } } closesCluster.add(point); } Anandha L Ranganathan analog76@gmail.com MLBigData
  • 31. Compute centroid till it converges. Public void computeConvergence((Iterable<KMeansCluster> clusters){ for(Cluster cluster:clusters){ newCentroid = cluster.computeCentroid(cluster); if(cluster.getCentroid()==newCentroid){ cluster.converged=true; } else { cluster.setCentroid(newCentroid) } } Run the process to find nearest cluster of a point and centroid until the centroidbecomes static. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 32. All points –before clustering Anandha L Ranganathan analog76@gmail.com MLBigData
  • 33. Canopy - clustering Anandha L Ranganathan analog76@gmail.com MLBigData
  • 34. Canopy Clusering and K means clustering. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 35. ? Anandha L Ranganathan analog76@gmail.com MLBigData
  • 36. References Apache Mahout - https://cwiki.apache.org/MAHOUT/canopy-clustering.html Canopy Clustering - http://code.google.com/p/canopy-clustering/  Google Lectures. http://www.youtube.com/watch?v=1ZDybXl212Q http://cs.boisestate.edu/~amit/research/makho_ngazimbi_project.pdf Anandha L Ranganathan analog76@gmail.com MLBigData