SlideShare una empresa de Scribd logo
1 de 16
Descargar para leer sin conexión
Mining the Social Web: Chapter6
LinkedIn: Clustering Your Professional
    Network for Fun(And Profit?)

               chois79
Contents
• Introduction
• Motivation for Clustering
• Clustering Contacts by Job Title
   –   Standardizing and Counting Job Titles
   –   Common Similarity Metrics for Clustering
   –   A Greedy Approach to Clustering
   –   Hierarchical and k-Means Clustering

• Fetching Extended Profile Information
• Closing Remarks
Introduction
• LinkedIn is a popular social networking site focused
  on professional and business relationships
• The two primary ways you can access Linked-in
   – Exporting it as address book data
   – Using the Linked-in API
• This chapter introduce fundamental clustering
  techniques to answer the following kinds of queries
   – Which of your connections are the most similar based
     upon a criterion like job title?
   – Which of your connections have worked in companies you
     want to work for?
   – Where do most of your connections reside geographically?
Motivation for Clustering
• Which of your connections are the most similar based
  upon a criterion like job title?
   – LinkedIn members are able to enter in their professional
     information as free text.
      • job titles, company name, professional interests…

• There are two issues
   – How to measure similarity between two values
      • Ex) Chief Executive Officer, Chief Technology Officer
   – How to cluster every people
      • It would be ideal to compare every member to every other member
      • This is n-squared problem
Clustering Contacts by Job Title
   Standardizing and Counting Job Titles
• Standardizing and Counting Job Titles
   – Use a pattern for transforming common job title
   – Perform a basic frequency analysis




                                   standardizing
Clustering Contacts by Job Title
 Common Similarity Metrics for Clustering
• Edit distance(Levenshtein distance)
   – The number of operations required to transform one of
     them into the other
   – Ex1) dad into bad = 1
   – Ex2) □park into spake = 3
          s    p   a   k   e     distance
          □    p   a   r   k        3
          p    a   r   k   □        4
          p   □    a   r   k        4
          p    a   r   □   k        5
Clustering Contacts by Job Title
 Common Similarity Metrics for Clustering
• N-gram similarity
   – Terse way of expressing each possible consequence of n
     tokens from a text
   – Ex) bi-gram (n = 2)
Clustering Contacts by Job Title
 Common Similarity Metrics for Clustering
• Jaccard distance
   – The number of items in common between the two sets
     divided by the total number
      • |Set1 intersection Set2| / |Set1 union Set2|
   – In nltk.metrics.distance.jaccard_distnace
      • ( len(X.union(Y)) – len(X.intersection(Y))) / float(len(X.union(Y))

• MASI distance
   – Weighted version of Jaccard similarity
      • adjusts the score to result in a smaller distance than Jaccard when a
        partial overlap between set exists
      • 1 – float(len(X.intersection(Y))) / float(max(len(X), len(Y))
Clustering Contacts by Job Title
 Common Similarity Metrics for Clustering
• Jaccard distance vs. MASI distance
Clustering Contacts by Job Title
      A Greedy Approach to Clustering
• Cluster job titles by comparing them using MASI
  distance
                                                  n-squared problem




• Scalable clustering sure ain’t easy
   – O(n2) algorithm is simply unacceptable
      • len(all_titles) * len(all_titles) times
Clustering Contacts by Job Title
      A Greedy Approach for Clustering
• A random sample is selected for the scoring
  function
   – Execute the inner loop a much smaller, fixed number of
     times
Clustering Contacts by Job Title
             Hierarchical Clustering
• Hierarchical clustering(agglomerative clustering)
   – Deterministic and exhaustive technique
      • Compute the full matrix of distance between all items
      • Walks through the matrix clustering items that meet a minimum
        distance threshold
   – 0.5*n2 times (dynamic programming)
      • Ex) (abc, def), (def, abc)
Clustering Contacts by Job Title
             K-Means Clustering
• K-Means Clustering
  – Generally executes on the order of O(k*n) times
  – Step
     1. Randomly pick k points in the data space as initial values that will
        be used to compute the k clusters: K 1, K2… Kk .
     2. Assign each of the n points to a cluster by finding the nearest Kn –
        effectively creating k clusters and requiring k * n comparisons
     3. For each of the k clusters, calculate the centroid, or the mean of the
        cluster, and reassign its Ki value to be that value.
     4. Repeat steps 2-3 until the members of the clusters do not change
        between iteration. Generally speaking, relatively few iterations are
        required for convergence
  – http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/
    AppletKM.html
Fetching Extended Profile Information(1/2)

• OAuth
  – Open standard for
    authorization
  – allows users to share
    their private resources
    stored on one site with
    another site without
    having to hand out their
    credentials
Fetching Extended Profile Information(2/2)

• Example


               Request token




                   Redirect auth page


                                   Access token
Closing Remarks
• This chapter covered some serious ground
  – Introduce fundamental clustering techniques
  – Apply to your profession network data on linked in a
    variety of ways

Más contenido relacionado

La actualidad más candente

Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering AlgorithmLino Possamai
 
Clustering techniques
Clustering techniquesClustering techniques
Clustering techniquestalktoharry
 
K means clustering algorithm
K means clustering algorithmK means clustering algorithm
K means clustering algorithmDarshak Mehta
 
Clustering Using Shared Reference Points Algorithm Based On a Sound Data Model
Clustering Using Shared Reference Points Algorithm Based On a Sound Data ModelClustering Using Shared Reference Points Algorithm Based On a Sound Data Model
Clustering Using Shared Reference Points Algorithm Based On a Sound Data ModelWaqas Tariq
 
TextRank: Bringing Order into Texts
TextRank: Bringing Order into TextsTextRank: Bringing Order into Texts
TextRank: Bringing Order into TextsSharath TS
 
SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...
SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...
SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...Sharath TS
 
DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)Cory Cook
 
Reference Scope Identification of Citances Using Convolutional Neural Network
Reference Scope Identification of Citances Using Convolutional Neural NetworkReference Scope Identification of Citances Using Convolutional Neural Network
Reference Scope Identification of Citances Using Convolutional Neural NetworkSaurav Jha
 
Optics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structureOptics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structureRajesh Piryani
 
Instance based learning
Instance based learningInstance based learning
Instance based learningswapnac12
 
Introduction to Data Structures & Algorithms
Introduction to Data Structures & AlgorithmsIntroduction to Data Structures & Algorithms
Introduction to Data Structures & AlgorithmsAfaq Mansoor Khan
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means ClusteringSajib Sen
 
Uninformed search
Uninformed searchUninformed search
Uninformed searchBablu Shofi
 
Schema matching using machine learning
Schema matching using machine learningSchema matching using machine learning
Schema matching using machine learningShruti Jadon
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)ijceronline
 

La actualidad más candente (20)

Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering Algorithm
 
Clustering
ClusteringClustering
Clustering
 
K means report
K means reportK means report
K means report
 
Clustering techniques
Clustering techniquesClustering techniques
Clustering techniques
 
K means clustering algorithm
K means clustering algorithmK means clustering algorithm
K means clustering algorithm
 
Clustering Using Shared Reference Points Algorithm Based On a Sound Data Model
Clustering Using Shared Reference Points Algorithm Based On a Sound Data ModelClustering Using Shared Reference Points Algorithm Based On a Sound Data Model
Clustering Using Shared Reference Points Algorithm Based On a Sound Data Model
 
TextRank: Bringing Order into Texts
TextRank: Bringing Order into TextsTextRank: Bringing Order into Texts
TextRank: Bringing Order into Texts
 
SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...
SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...
SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...
 
DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)
 
Reference Scope Identification of Citances Using Convolutional Neural Network
Reference Scope Identification of Citances Using Convolutional Neural NetworkReference Scope Identification of Citances Using Convolutional Neural Network
Reference Scope Identification of Citances Using Convolutional Neural Network
 
Optics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structureOptics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structure
 
Data clustering
Data clustering Data clustering
Data clustering
 
Av33274282
Av33274282Av33274282
Av33274282
 
Instance based learning
Instance based learningInstance based learning
Instance based learning
 
Introduction to Data Structures & Algorithms
Introduction to Data Structures & AlgorithmsIntroduction to Data Structures & Algorithms
Introduction to Data Structures & Algorithms
 
NLP_KASHK:Minimum Edit Distance
NLP_KASHK:Minimum Edit DistanceNLP_KASHK:Minimum Edit Distance
NLP_KASHK:Minimum Edit Distance
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
 
Uninformed search
Uninformed searchUninformed search
Uninformed search
 
Schema matching using machine learning
Schema matching using machine learningSchema matching using machine learning
Schema matching using machine learning
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 

Destacado

Refactoring 메소드 호출의 단순화
Refactoring 메소드 호출의 단순화Refactoring 메소드 호출의 단순화
Refactoring 메소드 호출의 단순화HyeonSeok Choi
 
서버인프라를지탱하는기술3_2_3
서버인프라를지탱하는기술3_2_3서버인프라를지탱하는기술3_2_3
서버인프라를지탱하는기술3_2_3HyeonSeok Choi
 
프로그래머로 사는 법 Ch14
프로그래머로 사는 법 Ch14프로그래머로 사는 법 Ch14
프로그래머로 사는 법 Ch14HyeonSeok Choi
 
CODE Ch.21 버스에 올라 탑시다
CODE Ch.21 버스에 올라 탑시다CODE Ch.21 버스에 올라 탑시다
CODE Ch.21 버스에 올라 탑시다HyeonSeok Choi
 
MiningTheSocialWeb.Ch2.Microformat
MiningTheSocialWeb.Ch2.MicroformatMiningTheSocialWeb.Ch2.Microformat
MiningTheSocialWeb.Ch2.MicroformatHyeonSeok Choi
 
SICP_2.5 일반화된 연산시스템
SICP_2.5 일반화된 연산시스템SICP_2.5 일반화된 연산시스템
SICP_2.5 일반화된 연산시스템HyeonSeok Choi
 
Domain driven design ch1
Domain driven design ch1Domain driven design ch1
Domain driven design ch1HyeonSeok Choi
 
Domain driven design ch9
Domain driven design ch9Domain driven design ch9
Domain driven design ch9HyeonSeok Choi
 
자바 병렬 프로그래밍 ch9
자바 병렬 프로그래밍 ch9자바 병렬 프로그래밍 ch9
자바 병렬 프로그래밍 ch9HyeonSeok Choi
 
Code 11 논리 게이트
Code 11 논리 게이트Code 11 논리 게이트
Code 11 논리 게이트HyeonSeok Choi
 
자바 병렬 프로그래밍 1&2
자바 병렬 프로그래밍 1&2자바 병렬 프로그래밍 1&2
자바 병렬 프로그래밍 1&2HyeonSeok Choi
 
컴퓨터 프로그램 구조와 해석 3.5
컴퓨터 프로그램 구조와 해석 3.5컴퓨터 프로그램 구조와 해석 3.5
컴퓨터 프로그램 구조와 해석 3.5HyeonSeok Choi
 
서버인프라를지탱하는기술2_1-2
서버인프라를지탱하는기술2_1-2서버인프라를지탱하는기술2_1-2
서버인프라를지탱하는기술2_1-2HyeonSeok Choi
 
실무로 배우는 시스템 성능 최적화 Ch6
실무로 배우는 시스템 성능 최적화 Ch6실무로 배우는 시스템 성능 최적화 Ch6
실무로 배우는 시스템 성능 최적화 Ch6HyeonSeok Choi
 
실무로 배우는 시스템 성능 최적화 Ch7
실무로 배우는 시스템 성능 최적화 Ch7실무로 배우는 시스템 성능 최적화 Ch7
실무로 배우는 시스템 성능 최적화 Ch7HyeonSeok Choi
 
HTTP 완벽가이드 16장
HTTP 완벽가이드 16장HTTP 완벽가이드 16장
HTTP 완벽가이드 16장HyeonSeok Choi
 
Mining the social web ch1
Mining the social web ch1Mining the social web ch1
Mining the social web ch1HyeonSeok Choi
 

Destacado (20)

C++ api design 품질
C++ api design 품질C++ api design 품질
C++ api design 품질
 
Refactoring 메소드 호출의 단순화
Refactoring 메소드 호출의 단순화Refactoring 메소드 호출의 단순화
Refactoring 메소드 호출의 단순화
 
Code1_2
Code1_2Code1_2
Code1_2
 
서버인프라를지탱하는기술3_2_3
서버인프라를지탱하는기술3_2_3서버인프라를지탱하는기술3_2_3
서버인프라를지탱하는기술3_2_3
 
프로그래머로 사는 법 Ch14
프로그래머로 사는 법 Ch14프로그래머로 사는 법 Ch14
프로그래머로 사는 법 Ch14
 
CODE Ch.21 버스에 올라 탑시다
CODE Ch.21 버스에 올라 탑시다CODE Ch.21 버스에 올라 탑시다
CODE Ch.21 버스에 올라 탑시다
 
MiningTheSocialWeb.Ch2.Microformat
MiningTheSocialWeb.Ch2.MicroformatMiningTheSocialWeb.Ch2.Microformat
MiningTheSocialWeb.Ch2.Microformat
 
SICP_2.5 일반화된 연산시스템
SICP_2.5 일반화된 연산시스템SICP_2.5 일반화된 연산시스템
SICP_2.5 일반화된 연산시스템
 
Domain driven design ch1
Domain driven design ch1Domain driven design ch1
Domain driven design ch1
 
Domain driven design ch9
Domain driven design ch9Domain driven design ch9
Domain driven design ch9
 
자바 병렬 프로그래밍 ch9
자바 병렬 프로그래밍 ch9자바 병렬 프로그래밍 ch9
자바 병렬 프로그래밍 ch9
 
Code 11 논리 게이트
Code 11 논리 게이트Code 11 논리 게이트
Code 11 논리 게이트
 
자바 병렬 프로그래밍 1&2
자바 병렬 프로그래밍 1&2자바 병렬 프로그래밍 1&2
자바 병렬 프로그래밍 1&2
 
컴퓨터 프로그램 구조와 해석 3.5
컴퓨터 프로그램 구조와 해석 3.5컴퓨터 프로그램 구조와 해석 3.5
컴퓨터 프로그램 구조와 해석 3.5
 
서버인프라를지탱하는기술2_1-2
서버인프라를지탱하는기술2_1-2서버인프라를지탱하는기술2_1-2
서버인프라를지탱하는기술2_1-2
 
HTTPS
HTTPSHTTPS
HTTPS
 
실무로 배우는 시스템 성능 최적화 Ch6
실무로 배우는 시스템 성능 최적화 Ch6실무로 배우는 시스템 성능 최적화 Ch6
실무로 배우는 시스템 성능 최적화 Ch6
 
실무로 배우는 시스템 성능 최적화 Ch7
실무로 배우는 시스템 성능 최적화 Ch7실무로 배우는 시스템 성능 최적화 Ch7
실무로 배우는 시스템 성능 최적화 Ch7
 
HTTP 완벽가이드 16장
HTTP 완벽가이드 16장HTTP 완벽가이드 16장
HTTP 완벽가이드 16장
 
Mining the social web ch1
Mining the social web ch1Mining the social web ch1
Mining the social web ch1
 

Similar a Mining the social web 6

대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화NAVER Engineering
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clusteringArshad Farhad
 
Cluster Analysis - Keyword Clustering
Cluster Analysis -  Keyword ClusteringCluster Analysis -  Keyword Clustering
Cluster Analysis - Keyword ClusteringJustine Jes Thomas
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.pptvikassingh569137
 
Data minig.pptx
Data minig.pptxData minig.pptx
Data minig.pptxSabthamiS1
 
Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...Infrrd
 
Apache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesApache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesVinay Shukla
 
ensembles_emptytemplate_v2
ensembles_emptytemplate_v2ensembles_emptytemplate_v2
ensembles_emptytemplate_v2Shrayes Ramesh
 
Unsupervised learning (clustering)
Unsupervised learning (clustering)Unsupervised learning (clustering)
Unsupervised learning (clustering)Pravinkumar Landge
 
Toward Personalized Peer-to-Peer Top-k Processing
Toward Personalized Peer-to-Peer Top-k ProcessingToward Personalized Peer-to-Peer Top-k Processing
Toward Personalized Peer-to-Peer Top-k Processingasapteam
 
algoritma klastering.pdf
algoritma klastering.pdfalgoritma klastering.pdf
algoritma klastering.pdfbintis1
 
big data analytics.pptx
big data analytics.pptxbig data analytics.pptx
big data analytics.pptxSabthamiS1
 

Similar a Mining the social web 6 (20)

05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
 
UNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptxUNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptx
 
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
 
Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
lecture 01.1.ppt
lecture 01.1.pptlecture 01.1.ppt
lecture 01.1.ppt
 
Cluster Analysis - Keyword Clustering
Cluster Analysis -  Keyword ClusteringCluster Analysis -  Keyword Clustering
Cluster Analysis - Keyword Clustering
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
 
DM_clustering.ppt
DM_clustering.pptDM_clustering.ppt
DM_clustering.ppt
 
Data minig.pptx
Data minig.pptxData minig.pptx
Data minig.pptx
 
Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...
 
PPT s10-machine vision-s2
PPT s10-machine vision-s2PPT s10-machine vision-s2
PPT s10-machine vision-s2
 
Apache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesApache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenches
 
ensembles_emptytemplate_v2
ensembles_emptytemplate_v2ensembles_emptytemplate_v2
ensembles_emptytemplate_v2
 
Unsupervised learning (clustering)
Unsupervised learning (clustering)Unsupervised learning (clustering)
Unsupervised learning (clustering)
 
Toward Personalized Peer-to-Peer Top-k Processing
Toward Personalized Peer-to-Peer Top-k ProcessingToward Personalized Peer-to-Peer Top-k Processing
Toward Personalized Peer-to-Peer Top-k Processing
 
Chapter 5.pdf
Chapter 5.pdfChapter 5.pdf
Chapter 5.pdf
 
algoritma klastering.pdf
algoritma klastering.pdfalgoritma klastering.pdf
algoritma klastering.pdf
 
big data analytics.pptx
big data analytics.pptxbig data analytics.pptx
big data analytics.pptx
 

Más de HyeonSeok Choi

밑바닥부터시작하는딥러닝 Ch05
밑바닥부터시작하는딥러닝 Ch05밑바닥부터시작하는딥러닝 Ch05
밑바닥부터시작하는딥러닝 Ch05HyeonSeok Choi
 
밑바닥부터시작하는딥러닝 Ch2
밑바닥부터시작하는딥러닝 Ch2밑바닥부터시작하는딥러닝 Ch2
밑바닥부터시작하는딥러닝 Ch2HyeonSeok Choi
 
프로그래머를위한선형대수학1.2
프로그래머를위한선형대수학1.2프로그래머를위한선형대수학1.2
프로그래머를위한선형대수학1.2HyeonSeok Choi
 
알고리즘 중심의 머신러닝 가이드 Ch04
알고리즘 중심의 머신러닝 가이드 Ch04알고리즘 중심의 머신러닝 가이드 Ch04
알고리즘 중심의 머신러닝 가이드 Ch04HyeonSeok Choi
 
딥러닝 제대로시작하기 Ch04
딥러닝 제대로시작하기 Ch04딥러닝 제대로시작하기 Ch04
딥러닝 제대로시작하기 Ch04HyeonSeok Choi
 
밑바닥부터시작하는딥러닝 Ch05
밑바닥부터시작하는딥러닝 Ch05밑바닥부터시작하는딥러닝 Ch05
밑바닥부터시작하는딥러닝 Ch05HyeonSeok Choi
 
7가지 동시성 모델 - 데이터 병렬성
7가지 동시성 모델 - 데이터 병렬성7가지 동시성 모델 - 데이터 병렬성
7가지 동시성 모델 - 데이터 병렬성HyeonSeok Choi
 
7가지 동시성 모델 4장
7가지 동시성 모델 4장7가지 동시성 모델 4장
7가지 동시성 모델 4장HyeonSeok Choi
 
실무로 배우는 시스템 성능 최적화 Ch8
실무로 배우는 시스템 성능 최적화 Ch8실무로 배우는 시스템 성능 최적화 Ch8
실무로 배우는 시스템 성능 최적화 Ch8HyeonSeok Choi
 
Logstash, ElasticSearch, Kibana
Logstash, ElasticSearch, KibanaLogstash, ElasticSearch, Kibana
Logstash, ElasticSearch, KibanaHyeonSeok Choi
 
실무로배우는시스템성능최적화 Ch1
실무로배우는시스템성능최적화 Ch1실무로배우는시스템성능최적화 Ch1
실무로배우는시스템성능최적화 Ch1HyeonSeok Choi
 
HTTP 완벽가이드 21장
HTTP 완벽가이드 21장HTTP 완벽가이드 21장
HTTP 완벽가이드 21장HyeonSeok Choi
 
HTTP 완벽가이드 6장.
HTTP 완벽가이드 6장.HTTP 완벽가이드 6장.
HTTP 완벽가이드 6장.HyeonSeok Choi
 
HTTP 완벽가이드 1장.
HTTP 완벽가이드 1장.HTTP 완벽가이드 1장.
HTTP 완벽가이드 1장.HyeonSeok Choi
 

Más de HyeonSeok Choi (20)

밑바닥부터시작하는딥러닝 Ch05
밑바닥부터시작하는딥러닝 Ch05밑바닥부터시작하는딥러닝 Ch05
밑바닥부터시작하는딥러닝 Ch05
 
밑바닥부터시작하는딥러닝 Ch2
밑바닥부터시작하는딥러닝 Ch2밑바닥부터시작하는딥러닝 Ch2
밑바닥부터시작하는딥러닝 Ch2
 
프로그래머를위한선형대수학1.2
프로그래머를위한선형대수학1.2프로그래머를위한선형대수학1.2
프로그래머를위한선형대수학1.2
 
알고리즘 중심의 머신러닝 가이드 Ch04
알고리즘 중심의 머신러닝 가이드 Ch04알고리즘 중심의 머신러닝 가이드 Ch04
알고리즘 중심의 머신러닝 가이드 Ch04
 
딥러닝 제대로시작하기 Ch04
딥러닝 제대로시작하기 Ch04딥러닝 제대로시작하기 Ch04
딥러닝 제대로시작하기 Ch04
 
밑바닥부터시작하는딥러닝 Ch05
밑바닥부터시작하는딥러닝 Ch05밑바닥부터시작하는딥러닝 Ch05
밑바닥부터시작하는딥러닝 Ch05
 
함수적 사고 2장
함수적 사고 2장함수적 사고 2장
함수적 사고 2장
 
7가지 동시성 모델 - 데이터 병렬성
7가지 동시성 모델 - 데이터 병렬성7가지 동시성 모델 - 데이터 병렬성
7가지 동시성 모델 - 데이터 병렬성
 
7가지 동시성 모델 4장
7가지 동시성 모델 4장7가지 동시성 모델 4장
7가지 동시성 모델 4장
 
Bounded Context
Bounded ContextBounded Context
Bounded Context
 
DDD Repository
DDD RepositoryDDD Repository
DDD Repository
 
DDD Start Ch#3
DDD Start Ch#3DDD Start Ch#3
DDD Start Ch#3
 
실무로 배우는 시스템 성능 최적화 Ch8
실무로 배우는 시스템 성능 최적화 Ch8실무로 배우는 시스템 성능 최적화 Ch8
실무로 배우는 시스템 성능 최적화 Ch8
 
Logstash, ElasticSearch, Kibana
Logstash, ElasticSearch, KibanaLogstash, ElasticSearch, Kibana
Logstash, ElasticSearch, Kibana
 
실무로배우는시스템성능최적화 Ch1
실무로배우는시스템성능최적화 Ch1실무로배우는시스템성능최적화 Ch1
실무로배우는시스템성능최적화 Ch1
 
HTTP 완벽가이드 21장
HTTP 완벽가이드 21장HTTP 완벽가이드 21장
HTTP 완벽가이드 21장
 
HTTP 완벽가이드 6장.
HTTP 완벽가이드 6장.HTTP 완벽가이드 6장.
HTTP 완벽가이드 6장.
 
HTTP 완벽가이드 1장.
HTTP 완벽가이드 1장.HTTP 완벽가이드 1장.
HTTP 완벽가이드 1장.
 
Cluster - spark
Cluster - sparkCluster - spark
Cluster - spark
 
Pair RDD - Spark
Pair RDD - SparkPair RDD - Spark
Pair RDD - Spark
 

Último

UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 

Último (20)

UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 

Mining the social web 6

  • 1. Mining the Social Web: Chapter6 LinkedIn: Clustering Your Professional Network for Fun(And Profit?) chois79
  • 2. Contents • Introduction • Motivation for Clustering • Clustering Contacts by Job Title – Standardizing and Counting Job Titles – Common Similarity Metrics for Clustering – A Greedy Approach to Clustering – Hierarchical and k-Means Clustering • Fetching Extended Profile Information • Closing Remarks
  • 3. Introduction • LinkedIn is a popular social networking site focused on professional and business relationships • The two primary ways you can access Linked-in – Exporting it as address book data – Using the Linked-in API • This chapter introduce fundamental clustering techniques to answer the following kinds of queries – Which of your connections are the most similar based upon a criterion like job title? – Which of your connections have worked in companies you want to work for? – Where do most of your connections reside geographically?
  • 4. Motivation for Clustering • Which of your connections are the most similar based upon a criterion like job title? – LinkedIn members are able to enter in their professional information as free text. • job titles, company name, professional interests… • There are two issues – How to measure similarity between two values • Ex) Chief Executive Officer, Chief Technology Officer – How to cluster every people • It would be ideal to compare every member to every other member • This is n-squared problem
  • 5. Clustering Contacts by Job Title Standardizing and Counting Job Titles • Standardizing and Counting Job Titles – Use a pattern for transforming common job title – Perform a basic frequency analysis standardizing
  • 6. Clustering Contacts by Job Title Common Similarity Metrics for Clustering • Edit distance(Levenshtein distance) – The number of operations required to transform one of them into the other – Ex1) dad into bad = 1 – Ex2) □park into spake = 3 s p a k e distance □ p a r k 3 p a r k □ 4 p □ a r k 4 p a r □ k 5
  • 7. Clustering Contacts by Job Title Common Similarity Metrics for Clustering • N-gram similarity – Terse way of expressing each possible consequence of n tokens from a text – Ex) bi-gram (n = 2)
  • 8. Clustering Contacts by Job Title Common Similarity Metrics for Clustering • Jaccard distance – The number of items in common between the two sets divided by the total number • |Set1 intersection Set2| / |Set1 union Set2| – In nltk.metrics.distance.jaccard_distnace • ( len(X.union(Y)) – len(X.intersection(Y))) / float(len(X.union(Y)) • MASI distance – Weighted version of Jaccard similarity • adjusts the score to result in a smaller distance than Jaccard when a partial overlap between set exists • 1 – float(len(X.intersection(Y))) / float(max(len(X), len(Y))
  • 9. Clustering Contacts by Job Title Common Similarity Metrics for Clustering • Jaccard distance vs. MASI distance
  • 10. Clustering Contacts by Job Title A Greedy Approach to Clustering • Cluster job titles by comparing them using MASI distance n-squared problem • Scalable clustering sure ain’t easy – O(n2) algorithm is simply unacceptable • len(all_titles) * len(all_titles) times
  • 11. Clustering Contacts by Job Title A Greedy Approach for Clustering • A random sample is selected for the scoring function – Execute the inner loop a much smaller, fixed number of times
  • 12. Clustering Contacts by Job Title Hierarchical Clustering • Hierarchical clustering(agglomerative clustering) – Deterministic and exhaustive technique • Compute the full matrix of distance between all items • Walks through the matrix clustering items that meet a minimum distance threshold – 0.5*n2 times (dynamic programming) • Ex) (abc, def), (def, abc)
  • 13. Clustering Contacts by Job Title K-Means Clustering • K-Means Clustering – Generally executes on the order of O(k*n) times – Step 1. Randomly pick k points in the data space as initial values that will be used to compute the k clusters: K 1, K2… Kk . 2. Assign each of the n points to a cluster by finding the nearest Kn – effectively creating k clusters and requiring k * n comparisons 3. For each of the k clusters, calculate the centroid, or the mean of the cluster, and reassign its Ki value to be that value. 4. Repeat steps 2-3 until the members of the clusters do not change between iteration. Generally speaking, relatively few iterations are required for convergence – http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/ AppletKM.html
  • 14. Fetching Extended Profile Information(1/2) • OAuth – Open standard for authorization – allows users to share their private resources stored on one site with another site without having to hand out their credentials
  • 15. Fetching Extended Profile Information(2/2) • Example Request token Redirect auth page Access token
  • 16. Closing Remarks • This chapter covered some serious ground – Introduce fundamental clustering techniques – Apply to your profession network data on linked in a variety of ways