SlideShare a Scribd company logo
1 of 45
Mining of Massive 
Datasets 
Ashic Mahtab 
@ashic 
www.heartysoft.com
Stream Processing
Stream Processing 
 Have I already processed this? 
 How many distinct queries were made? 
 How many hits did I get?
Stream Processing – Bloom Filters 
 Guaranteed detection of negatives. 
 Possible false positive.
Stream Processing – Bloom Filters 
 Have a collection of hash functions (h1, h2, h3…). 
 For an input, run the hash functions. Map to bit array. 
 If all bits are lit in working store, might have been processed (possibility of 
false positives). 
 If any of the lit bits in hashed array are not lit in working store, need to 
process this. (Guaranteed…no false negatives).
Stream Processing – Bloom Filters 
1 0 0 1 1 0 1 1 1 0 
0 0 1 0 0 1 1 0 0 0 
0 0 1 1 1 0 0 0 0 1 
0 1 0 0 0 1 0 0 0 1 
Input 1: “Foo” hashes to: 
1 0 0 1 1 0 0 0 0 0 
Input 2: “Bar” hashes to: 
1 0 1 1 1 0 0 0 0 0
Stream Processing – Bloom Filters 
 Not just for streams (everything is a stream, right?) 
 Cassandra uses bloom filters to detect if some data is in a low level storage 
file.
Map Reduce 
 A little smarts goes a l-o-o-o-n-g way.
Map Reduce – Multiway Joins 
 R join S join T 
 size(R) = r, size(S) = s, size(T) = t 
 Probability of match for R and S = p 
 Probability of match for S and T = p 
 Which do we join first?
Map Reduce – Multiway Joins 
 R (A, B) join S(B, C) join T(C, D) 
 size(R) = r, size(S) = s, size(T) = t 
 Probability of match for R and S = p 
 Probability of match for S and T = p 
 Communication cost: 
* If we join R and S first: O(r + s + t + pst) 
* If we join S and T first: O(r + s + t + prs)
Map Reduce – Multiway Joins 
 Can we do better?
Map Reduce – Multiway Joins 
 Hash B to b buckets, c to C buckets. 
 bc = k 
 Cost ~ r + 2s + t + 2 * sqrt(krt) 
Usually, can neglect r + t compared to the k term. So, 
2s + 2*sqrt(krt) 
[Single MR job]
Map Reduce – Multiway Joins 
 Hash B to b buckets, c to C buckets. 
 bc = k 
 Cost ~ r + 2s + t + 2 * sqrt(krt) 
Usually, can neglect r + t compared to the k term. So, 
2s + 2*sqrt(krt) 
[Single MR job] 
 vs (r + s + t + prs) 
[Two MR jobs]
Map Reduce – Multiway Joins 
 So…is this always better?
Map Reduce – Complexity 
 Replication Rate (r): 
Number of outputs by all Map tasks / number of inputs 
 Reducer Size (q): 
Max number of items per key at reducers 
 p = number of inputs 
 For nxn: 
qr >= 2n^2 
r >= p / q
Map Reduce – Matrix Multiplication 
 Approach 1 
 Matrix M, N 
 M(i, j), N(j, k) 
 Map1: Map matrices to (j, (M, i, mij)), (j, (N, k, njk)) 
 Reduce1: for each key, output ((i, k), mij*njk) 
 Map2: Identity 
 Reduce2: For each key, (i, k) get the sum of values.
Map Reduce – Matrix Multiplication 
 Approach 2 
 One step: 
 Map: 
For M, produce ((i, k), (M, j, mij)) for k = 1…Ncolumns_in_N 
For M, produce ((i, k), (N, j, njk)) for k = 1…Nrows_in_M 
 Reduce: 
For each key (i, k), multiple values, and sum.
Map Reduce – Matrix Multiplication 
 Approach 3 
 Two steps again.
Map Reduce – Matrix Multiplication 
 One pass: 
(4n^4) / q 
 Two pass: 
(4n^3) / sqrt(q)
Similarity - Shingling 
 “abcdef” -> [“abc”, “bcd”, “cde”…] 
 Jaccard similarity - > N(intersection) / N(union)
Similarity - Shingling 
 “abcdef” -> [“abc”, “bcd”, “cde”…] 
 Jaccard similarity - > N(intersection) / N(union) 
 Problem? 
 Size
Similarity - Minhashing
Similarity - Minhashing 
h(S1) = a, h(S2) = c, h(S3) = b, h(S4) = a
Similarity - Minhashing 
Problem? 
h(S1) = a, h(S2) = c, h(S3) = b, h(S4) = a
Similarity – Minhash Signatures
Similarity – Minhash Signatures 
Problem? Still can’t find pairs with greatest similarity efficiently
Similarity – LSH for Minhash Signatures
Clustering – Hierarchical
Clustering – K Means 
1. Pick k points (centroids) 
2. Assign points to clusters 
3. Shift centroids to “centre”. 
4. Repeat
Clustering – K Means
Clustering – FBR 
• 3 sets – Discard, Compressed and Retained 
• First two have summaries. N, sum per dimension, sum of squares per dimension 
• High dimensional Euclidian space 
Mahalanobis Distance
Clustering – CURE
Clustering – CURE 
• Sample. Run clustering on sample. 
• Pick “representatives” from each sample. 
• Move representatives about 20% or so to the centre. 
• Merge of close.
Dimentionality Reduction
Dimentionality Reduction
Dimentionality Reduction - SVD
Dimentionality Reduction - SVD
Dimensionality Reduction - CUR 
 SVD results in U and V being dense, even when M is sparse. 
 O(n^3)
Dimensionality Reduction - CUR 
 Choose r. 
 Choose r rows and r columns of M. 
 Intersection is W. 
 Run SVD on W (much smaller than M). W = XΣY’ 
 Compute Σ+, the Moore-Penrose pseudoinverse of Σ. 
 Then, U = Y * (Σ+)^2 * X’
Dimensionality Reduction – CUR 
Choosing Rows and Columns 
 Random, but with bias for importance. 
 (Frobenius Norm)^2 
 Probability of picking a row or column: 
Sum of squares for row or column / Sum of squares of all elements
Dimensionality Reduction – CUR 
Choosing Rows and Columns 
 Same row / column may get picked (selection with replacement). 
 Reduces rank.
Dimensionality Reduction – CUR 
Choosing Rows and Columns 
 Same row / column may get picked (selection with replacement). 
 Reduces rank. 
 Can be combined: multiply vector by sqrt(k) if it appears k times.
Dimensionality Reduction – CUR 
Choosing Rows and Columns 
 Same row / column may get picked (selection with replacement). 
 Reduces rank. 
 Can be combined: multiply vector by sqrt(k) if it appears k times. 
 Compute pseudo-inverse as before, but transpose the result.
Dimensionality Reduction – CUR 
Choosing Rows and Columns 
 Same row / column may get picked (selection with replacement). 
 Reduces rank. 
 Can be combined: multiply vector by sqrt(k) if it appears k times. 
 Compute pseudo-inverse as before, but transpose the result.
Thanks 
 Mining of Massive Datasets 
Leskovec, Rajaraman, Ullman 
Coursera / Stanford Course 
Book: http://www.mmds.org/ [free]

More Related Content

What's hot

Shortest path search for real road networks and dynamic costs with pgRouting
Shortest path search for real road networks and dynamic costs with pgRoutingShortest path search for real road networks and dynamic costs with pgRouting
Shortest path search for real road networks and dynamic costs with pgRoutingantonpa
 
Application of Integrals
Application of IntegralsApplication of Integrals
Application of Integralssarcia
 
Lesson 11 plane areas area by integration
Lesson 11 plane areas area by integrationLesson 11 plane areas area by integration
Lesson 11 plane areas area by integrationLawrence De Vera
 
Computer graphics
Computer graphicsComputer graphics
Computer graphicsBala Murali
 
Applications of integrals
Applications of integralsApplications of integrals
Applications of integralsnitishguptamaps
 
Application of integral calculus
Application of integral calculusApplication of integral calculus
Application of integral calculusHabibur Rahman
 
Formulas for calculating surface area and volume
Formulas for calculating surface area and volumeFormulas for calculating surface area and volume
Formulas for calculating surface area and volumeMark Ophaug
 
Multiple integral(tripple integral)
Multiple integral(tripple integral)Multiple integral(tripple integral)
Multiple integral(tripple integral)jigar sable
 
10 fluid pressures x
10 fluid pressures x10 fluid pressures x
10 fluid pressures xmath266
 
Total Surface Area of Prisms
Total Surface Area of PrismsTotal Surface Area of Prisms
Total Surface Area of PrismsPassy World
 
Matrix 2 d
Matrix 2 dMatrix 2 d
Matrix 2 dxyz120
 
Equations of Straight Lines
Equations of Straight LinesEquations of Straight Lines
Equations of Straight Linesitutor
 
Lesson 16 length of an arc
Lesson 16 length of an arcLesson 16 length of an arc
Lesson 16 length of an arcLawrence De Vera
 
Application of Calculus in Real World
Application of Calculus in Real World Application of Calculus in Real World
Application of Calculus in Real World milanmath
 
Surface area and volume
Surface area and volumeSurface area and volume
Surface area and volumeSwaraj Routray
 

What's hot (19)

Shortest path search for real road networks and dynamic costs with pgRouting
Shortest path search for real road networks and dynamic costs with pgRoutingShortest path search for real road networks and dynamic costs with pgRouting
Shortest path search for real road networks and dynamic costs with pgRouting
 
Application of Integrals
Application of IntegralsApplication of Integrals
Application of Integrals
 
Double Integrals
Double IntegralsDouble Integrals
Double Integrals
 
Lesson 11 plane areas area by integration
Lesson 11 plane areas area by integrationLesson 11 plane areas area by integration
Lesson 11 plane areas area by integration
 
Computer graphics
Computer graphicsComputer graphics
Computer graphics
 
Applications of integrals
Applications of integralsApplications of integrals
Applications of integrals
 
Application of integral calculus
Application of integral calculusApplication of integral calculus
Application of integral calculus
 
Formulas for calculating surface area and volume
Formulas for calculating surface area and volumeFormulas for calculating surface area and volume
Formulas for calculating surface area and volume
 
Multiple integral(tripple integral)
Multiple integral(tripple integral)Multiple integral(tripple integral)
Multiple integral(tripple integral)
 
10 fluid pressures x
10 fluid pressures x10 fluid pressures x
10 fluid pressures x
 
Coordinate geometry
Coordinate geometryCoordinate geometry
Coordinate geometry
 
Basic Calculus in R.
Basic Calculus in R. Basic Calculus in R.
Basic Calculus in R.
 
Total Surface Area of Prisms
Total Surface Area of PrismsTotal Surface Area of Prisms
Total Surface Area of Prisms
 
Matrix 2 d
Matrix 2 dMatrix 2 d
Matrix 2 d
 
multiple intrigral lit
multiple intrigral litmultiple intrigral lit
multiple intrigral lit
 
Equations of Straight Lines
Equations of Straight LinesEquations of Straight Lines
Equations of Straight Lines
 
Lesson 16 length of an arc
Lesson 16 length of an arcLesson 16 length of an arc
Lesson 16 length of an arc
 
Application of Calculus in Real World
Application of Calculus in Real World Application of Calculus in Real World
Application of Calculus in Real World
 
Surface area and volume
Surface area and volumeSurface area and volume
Surface area and volume
 

Viewers also liked

Social Networking - Personal learning networts 2013 june tafe managers
Social Networking - Personal learning networts 2013 june tafe managersSocial Networking - Personal learning networts 2013 june tafe managers
Social Networking - Personal learning networts 2013 june tafe managersGihan Lahoud
 
Urogenitalis képalkotó vizsgálati protokollok
Urogenitalis képalkotó vizsgálati protokollokUrogenitalis képalkotó vizsgálati protokollok
Urogenitalis képalkotó vizsgálati protokollokPéter Bágyi M.D.
 
Aan de slag met social media
Aan de slag met social mediaAan de slag met social media
Aan de slag met social mediahallofryslan
 
Cqrs, Event Sourcing
Cqrs, Event SourcingCqrs, Event Sourcing
Cqrs, Event SourcingAshic Mahtab
 
In Memory of Laura Weber
In Memory of Laura WeberIn Memory of Laura Weber
In Memory of Laura WeberLisa McKenna
 
Koalas Cut Into Sections
Koalas Cut Into SectionsKoalas Cut Into Sections
Koalas Cut Into SectionsGihan Lahoud
 
Agriculture
AgricultureAgriculture
Agriculturejespi
 
CT vizsgálati protokollok I-II.
CT vizsgálati protokollok I-II.CT vizsgálati protokollok I-II.
CT vizsgálati protokollok I-II.Péter Bágyi M.D.
 
RCP Company Information,
RCP Company Information,RCP Company Information,
RCP Company Information,johnyboy7
 
Wk 1 Intro Text Types
Wk 1 Intro Text TypesWk 1 Intro Text Types
Wk 1 Intro Text TypesGihan Lahoud
 
Prednosti Internet promocije putem portala za nekretnine
Prednosti Internet promocije putem portala za nekretninePrednosti Internet promocije putem portala za nekretnine
Prednosti Internet promocije putem portala za nekretnineNekretnineSrbije.com
 
International Copyright
International CopyrightInternational Copyright
International CopyrightGihan Lahoud
 
Adobe connect set up instructions str
Adobe connect set up instructions strAdobe connect set up instructions str
Adobe connect set up instructions strGihan Lahoud
 
Uitnodiging Verjaardag Pieter Krauch
Uitnodiging Verjaardag Pieter KrauchUitnodiging Verjaardag Pieter Krauch
Uitnodiging Verjaardag Pieter KrauchPieterKrauch
 
STAP Nederlands instituut voor alcoholbeleid & Social Media door NCRV/Wilko v...
STAP Nederlands instituut voor alcoholbeleid & Social Media door NCRV/Wilko v...STAP Nederlands instituut voor alcoholbeleid & Social Media door NCRV/Wilko v...
STAP Nederlands instituut voor alcoholbeleid & Social Media door NCRV/Wilko v...Wilko van Iperen
 

Viewers also liked (20)

Ifonly
IfonlyIfonly
Ifonly
 
Emus
EmusEmus
Emus
 
Social Networking - Personal learning networts 2013 june tafe managers
Social Networking - Personal learning networts 2013 june tafe managersSocial Networking - Personal learning networts 2013 june tafe managers
Social Networking - Personal learning networts 2013 june tafe managers
 
Urogenitalis képalkotó vizsgálati protokollok
Urogenitalis képalkotó vizsgálati protokollokUrogenitalis képalkotó vizsgálati protokollok
Urogenitalis képalkotó vizsgálati protokollok
 
Aan de slag met social media
Aan de slag met social mediaAan de slag met social media
Aan de slag met social media
 
Cqrs, Event Sourcing
Cqrs, Event SourcingCqrs, Event Sourcing
Cqrs, Event Sourcing
 
In Memory of Laura Weber
In Memory of Laura WeberIn Memory of Laura Weber
In Memory of Laura Weber
 
Koalas Cut Into Sections
Koalas Cut Into SectionsKoalas Cut Into Sections
Koalas Cut Into Sections
 
Agriculture
AgricultureAgriculture
Agriculture
 
CT vizsgálati protokollok I-II.
CT vizsgálati protokollok I-II.CT vizsgálati protokollok I-II.
CT vizsgálati protokollok I-II.
 
Brother Gemalto
Brother GemaltoBrother Gemalto
Brother Gemalto
 
RCP Company Information,
RCP Company Information,RCP Company Information,
RCP Company Information,
 
Wk 1 Intro Text Types
Wk 1 Intro Text TypesWk 1 Intro Text Types
Wk 1 Intro Text Types
 
Prednosti Internet promocije putem portala za nekretnine
Prednosti Internet promocije putem portala za nekretninePrednosti Internet promocije putem portala za nekretnine
Prednosti Internet promocije putem portala za nekretnine
 
International Copyright
International CopyrightInternational Copyright
International Copyright
 
Team One Keynote
Team  One  KeynoteTeam  One  Keynote
Team One Keynote
 
Mediaproof def
Mediaproof defMediaproof def
Mediaproof def
 
Adobe connect set up instructions str
Adobe connect set up instructions strAdobe connect set up instructions str
Adobe connect set up instructions str
 
Uitnodiging Verjaardag Pieter Krauch
Uitnodiging Verjaardag Pieter KrauchUitnodiging Verjaardag Pieter Krauch
Uitnodiging Verjaardag Pieter Krauch
 
STAP Nederlands instituut voor alcoholbeleid & Social Media door NCRV/Wilko v...
STAP Nederlands instituut voor alcoholbeleid & Social Media door NCRV/Wilko v...STAP Nederlands instituut voor alcoholbeleid & Social Media door NCRV/Wilko v...
STAP Nederlands instituut voor alcoholbeleid & Social Media door NCRV/Wilko v...
 

Similar to Mining of massive datasets

Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Charles Martin
 
Tree distance algorithm
Tree distance algorithmTree distance algorithm
Tree distance algorithmTrector Rancor
 
Review of Trigonometry for Calculus “Trigon” =triangle +“metry”=measurement =...
Review of Trigonometry for Calculus “Trigon” =triangle +“metry”=measurement =...Review of Trigonometry for Calculus “Trigon” =triangle +“metry”=measurement =...
Review of Trigonometry for Calculus “Trigon” =triangle +“metry”=measurement =...KyungKoh2
 
Parallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-JoinsParallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-JoinsJonny Daenen
 
Matrix 2 d
Matrix 2 dMatrix 2 d
Matrix 2 dxyz120
 
Sample0 mtechcs06
Sample0 mtechcs06Sample0 mtechcs06
Sample0 mtechcs06bikram ...
 
Sample0 mtechcs06
Sample0 mtechcs06Sample0 mtechcs06
Sample0 mtechcs06bikram ...
 
Finding similar items in high dimensional spaces locality sensitive hashing
Finding similar items in high dimensional spaces  locality sensitive hashingFinding similar items in high dimensional spaces  locality sensitive hashing
Finding similar items in high dimensional spaces locality sensitive hashingDmitriy Selivanov
 
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...Mail.ru Group
 

Similar to Mining of massive datasets (20)

Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3
 
Test
TestTest
Test
 
Tree distance algorithm
Tree distance algorithmTree distance algorithm
Tree distance algorithm
 
Lec5
Lec5Lec5
Lec5
 
Review of Trigonometry for Calculus “Trigon” =triangle +“metry”=measurement =...
Review of Trigonometry for Calculus “Trigon” =triangle +“metry”=measurement =...Review of Trigonometry for Calculus “Trigon” =triangle +“metry”=measurement =...
Review of Trigonometry for Calculus “Trigon” =triangle +“metry”=measurement =...
 
Parallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-JoinsParallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-Joins
 
Matrix 2 d
Matrix 2 dMatrix 2 d
Matrix 2 d
 
Unit 3
Unit 3Unit 3
Unit 3
 
Unit 3
Unit 3Unit 3
Unit 3
 
Transforms UNIt 2
Transforms UNIt 2 Transforms UNIt 2
Transforms UNIt 2
 
Nn
NnNn
Nn
 
Mtc ssample05
Mtc ssample05Mtc ssample05
Mtc ssample05
 
Mtc ssample05
Mtc ssample05Mtc ssample05
Mtc ssample05
 
ch16.pptx
ch16.pptxch16.pptx
ch16.pptx
 
ch16 (1).pptx
ch16 (1).pptxch16 (1).pptx
ch16 (1).pptx
 
Sample0 mtechcs06
Sample0 mtechcs06Sample0 mtechcs06
Sample0 mtechcs06
 
Sample0 mtechcs06
Sample0 mtechcs06Sample0 mtechcs06
Sample0 mtechcs06
 
Introduction to matlab
Introduction to matlabIntroduction to matlab
Introduction to matlab
 
Finding similar items in high dimensional spaces locality sensitive hashing
Finding similar items in high dimensional spaces  locality sensitive hashingFinding similar items in high dimensional spaces  locality sensitive hashing
Finding similar items in high dimensional spaces locality sensitive hashing
 
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
 

Recently uploaded

100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 

Recently uploaded (20)

100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 

Mining of massive datasets

  • 1. Mining of Massive Datasets Ashic Mahtab @ashic www.heartysoft.com
  • 3. Stream Processing  Have I already processed this?  How many distinct queries were made?  How many hits did I get?
  • 4. Stream Processing – Bloom Filters  Guaranteed detection of negatives.  Possible false positive.
  • 5. Stream Processing – Bloom Filters  Have a collection of hash functions (h1, h2, h3…).  For an input, run the hash functions. Map to bit array.  If all bits are lit in working store, might have been processed (possibility of false positives).  If any of the lit bits in hashed array are not lit in working store, need to process this. (Guaranteed…no false negatives).
  • 6. Stream Processing – Bloom Filters 1 0 0 1 1 0 1 1 1 0 0 0 1 0 0 1 1 0 0 0 0 0 1 1 1 0 0 0 0 1 0 1 0 0 0 1 0 0 0 1 Input 1: “Foo” hashes to: 1 0 0 1 1 0 0 0 0 0 Input 2: “Bar” hashes to: 1 0 1 1 1 0 0 0 0 0
  • 7. Stream Processing – Bloom Filters  Not just for streams (everything is a stream, right?)  Cassandra uses bloom filters to detect if some data is in a low level storage file.
  • 8. Map Reduce  A little smarts goes a l-o-o-o-n-g way.
  • 9. Map Reduce – Multiway Joins  R join S join T  size(R) = r, size(S) = s, size(T) = t  Probability of match for R and S = p  Probability of match for S and T = p  Which do we join first?
  • 10. Map Reduce – Multiway Joins  R (A, B) join S(B, C) join T(C, D)  size(R) = r, size(S) = s, size(T) = t  Probability of match for R and S = p  Probability of match for S and T = p  Communication cost: * If we join R and S first: O(r + s + t + pst) * If we join S and T first: O(r + s + t + prs)
  • 11. Map Reduce – Multiway Joins  Can we do better?
  • 12. Map Reduce – Multiway Joins  Hash B to b buckets, c to C buckets.  bc = k  Cost ~ r + 2s + t + 2 * sqrt(krt) Usually, can neglect r + t compared to the k term. So, 2s + 2*sqrt(krt) [Single MR job]
  • 13. Map Reduce – Multiway Joins  Hash B to b buckets, c to C buckets.  bc = k  Cost ~ r + 2s + t + 2 * sqrt(krt) Usually, can neglect r + t compared to the k term. So, 2s + 2*sqrt(krt) [Single MR job]  vs (r + s + t + prs) [Two MR jobs]
  • 14. Map Reduce – Multiway Joins  So…is this always better?
  • 15. Map Reduce – Complexity  Replication Rate (r): Number of outputs by all Map tasks / number of inputs  Reducer Size (q): Max number of items per key at reducers  p = number of inputs  For nxn: qr >= 2n^2 r >= p / q
  • 16. Map Reduce – Matrix Multiplication  Approach 1  Matrix M, N  M(i, j), N(j, k)  Map1: Map matrices to (j, (M, i, mij)), (j, (N, k, njk))  Reduce1: for each key, output ((i, k), mij*njk)  Map2: Identity  Reduce2: For each key, (i, k) get the sum of values.
  • 17. Map Reduce – Matrix Multiplication  Approach 2  One step:  Map: For M, produce ((i, k), (M, j, mij)) for k = 1…Ncolumns_in_N For M, produce ((i, k), (N, j, njk)) for k = 1…Nrows_in_M  Reduce: For each key (i, k), multiple values, and sum.
  • 18. Map Reduce – Matrix Multiplication  Approach 3  Two steps again.
  • 19. Map Reduce – Matrix Multiplication  One pass: (4n^4) / q  Two pass: (4n^3) / sqrt(q)
  • 20. Similarity - Shingling  “abcdef” -> [“abc”, “bcd”, “cde”…]  Jaccard similarity - > N(intersection) / N(union)
  • 21. Similarity - Shingling  “abcdef” -> [“abc”, “bcd”, “cde”…]  Jaccard similarity - > N(intersection) / N(union)  Problem?  Size
  • 23. Similarity - Minhashing h(S1) = a, h(S2) = c, h(S3) = b, h(S4) = a
  • 24. Similarity - Minhashing Problem? h(S1) = a, h(S2) = c, h(S3) = b, h(S4) = a
  • 26. Similarity – Minhash Signatures Problem? Still can’t find pairs with greatest similarity efficiently
  • 27. Similarity – LSH for Minhash Signatures
  • 29. Clustering – K Means 1. Pick k points (centroids) 2. Assign points to clusters 3. Shift centroids to “centre”. 4. Repeat
  • 31. Clustering – FBR • 3 sets – Discard, Compressed and Retained • First two have summaries. N, sum per dimension, sum of squares per dimension • High dimensional Euclidian space Mahalanobis Distance
  • 33. Clustering – CURE • Sample. Run clustering on sample. • Pick “representatives” from each sample. • Move representatives about 20% or so to the centre. • Merge of close.
  • 38. Dimensionality Reduction - CUR  SVD results in U and V being dense, even when M is sparse.  O(n^3)
  • 39. Dimensionality Reduction - CUR  Choose r.  Choose r rows and r columns of M.  Intersection is W.  Run SVD on W (much smaller than M). W = XΣY’  Compute Σ+, the Moore-Penrose pseudoinverse of Σ.  Then, U = Y * (Σ+)^2 * X’
  • 40. Dimensionality Reduction – CUR Choosing Rows and Columns  Random, but with bias for importance.  (Frobenius Norm)^2  Probability of picking a row or column: Sum of squares for row or column / Sum of squares of all elements
  • 41. Dimensionality Reduction – CUR Choosing Rows and Columns  Same row / column may get picked (selection with replacement).  Reduces rank.
  • 42. Dimensionality Reduction – CUR Choosing Rows and Columns  Same row / column may get picked (selection with replacement).  Reduces rank.  Can be combined: multiply vector by sqrt(k) if it appears k times.
  • 43. Dimensionality Reduction – CUR Choosing Rows and Columns  Same row / column may get picked (selection with replacement).  Reduces rank.  Can be combined: multiply vector by sqrt(k) if it appears k times.  Compute pseudo-inverse as before, but transpose the result.
  • 44. Dimensionality Reduction – CUR Choosing Rows and Columns  Same row / column may get picked (selection with replacement).  Reduces rank.  Can be combined: multiply vector by sqrt(k) if it appears k times.  Compute pseudo-inverse as before, but transpose the result.
  • 45. Thanks  Mining of Massive Datasets Leskovec, Rajaraman, Ullman Coursera / Stanford Course Book: http://www.mmds.org/ [free]