SlideShare una empresa de Scribd logo
1 de 26
PAGE1
www.exensa.com
www.exensa.com
PRESENTER: GUILLAUME PITEL 2016 JUNE 9Approximate counting for NLP
Count-Min Tree Sketch
Guillaume Pitel, Geoffroy Fouquier, Emmanuel Marchand, Abdul
Mouhamadsultane
0
1
1 0 1 0
1 1 1 1
1 0
0 1
1
1
1
0
0
0
1
1
1
1
0
0
0
0
1
1
0101
b=2/c=110 b=4/c=01011001
conflict
between
counters
4 and 7
PAGE2
www.exensa.com
A bit of context
Why do we need to count ?
Data analysis platform : eXenGine.
Processes different kind of data (mostly text).
We need to create relevant cross-features : to do that we need to count occurrences of all possible
cross-features. In the case of text data, a particular kind of cross-feature is known as n-grams.
There are many different measures to decide if a n-gram is interesting. All require to count the
occurrences of the cross-feature and the features themselves (i.e. count bigrams and words in
bigrams)
Counting exactly is easy, distributable, and very slow because of memory usage. Also, having the
whole data structure containing the counts in memory is impossible, so one has to resort to using
huge map/reduce with joins to do the job.
PAGE3
www.exensa.com
A bit of context
What kind of data are we talking about ?
Google N-grams
tokens 1024 Billions
sentences 95 Billions
1-grams (count > 200) 14 Millions
2-grams (count > 40) 314 Millions
3-grams 977 Millions
4-grams 1.3 Billion
5-grams 1.2 Billion
PAGE4
www.exensa.com
A bit of context
What kind of data are we talking about ?
Zipfian distribution
[Le Quan & al. 2003]
PAGE5
www.exensa.com
A bit of context
What kind of measures are we talking about ?
PMI, TF-IDF, LLR
PAGE6
www.exensa.com
A bit of context
Summary / Goals
Many
counts
Logarithms
in measures
We need to store
a large amount of
counts
We care about
the order of
magnitude
Fast and memory
controlled
We don’t want a
distributed memory for
the counts
Zipfian
counts
Many very small
counts that will be
filtered out later
PAGE7
www.exensa.com
A bit of context
Summary / Goals
Many
counts
Logarithms
in measures
We need to store
a large amount of
counts
We care about
the order of
magnitude
Fast and memory
controlled
We don’t want a
distributed memory for
the counts
Zipfian
counts
Many very small
counts that will be
filtered out later
We can use probabilistic
structures
PAGE8
www.exensa.com
Count-Min Sketch
A probabilistic data structure to store counts [Cormode & Muthukrishnan 2005]
PAGE9
www.exensa.com
Count-Min Sketch
A probabilistic data structure to store counts
Conservative Update :
improve CMS by updating
only min values
PAGE10
www.exensa.com
Count-Min Log Sketch
A probabilistic data structure to store logarithmic counts
[Pitel & Fouquier, 2015] : same idea than [Talbot, 2009] in a Count-min Sketch
Instead of using regular 32 bit counters, we use 8 or 16 bits “Morris” counters counting
logarithmically.
Since counts are used in logs anyway, the error on the PMI/TF-IDF/… is almost the same, but we can
use more counters
However, a count of 1 still uses the same amount of memory than a count of 10000. Also, at some
point, error stops improving with space (there is an inherent residual error)
PAGE11
www.exensa.com
Count-Min Tree Sketch
A count min sketch with shared counters
Idea : use a hierarchal storage where most significant bits are shared
between counters.
Somehow similar to TOMB counters [Van Durme, 2009], except that
overflow is managed very differently.
PAGE12
www.exensa.com
Tree Shared Counters
Sharing most significant
bits
8 counters structure
o A tree is made of three kinds of storage:
o Counting bits
o Barrier bits
o Spire (not required except for
performance)
oSeveral layers alternating counting
and barrier bits.
oHere we have a
<[(8,8),(4,4),(2,2),(1,1)],4> counter
Or : how can we store counts with an average approaching
4 bits / counter
0
1
1 0 1 0
1 1 1 1
1 0
0 1
1
1
1
0
0
0
1
1
1
1
0
0
0
0
1
1
0101
barrier bits
counting bits
spire
base layer
PAGE13
www.exensa.com
Tree Shared Counters
Sharing most significant
bits
8 counters structure
o8 counters in 30 bits + spire
oWithout a spire, n bits can count up
to 3 × 21+log2
𝑛
4
o Many small shared counters with spires
are more efficient than a large shared
counter
Or : how can we store counts with an average approaching
4 bits / counter
0
1
1 0 1 0
1 1 1 1
1 0
0 1
1
1
1
0
0
0
1
1
1
1
0
0
0
0
1
1
0101
barrier bits
counting bits
spire
base layer
PAGE14
www.exensa.com
Tree Shared Counters
Reading values
o A counter stops at the first ZERO barrier
o When two barrier paths meet, there is
a conflict
o Barrier length (b) is evaluated in unary
o Counter bits (c) are evaluated in a more
classical way
0
1
1 0 1 0
1 1 1 1
1 0
0 1
1
1
1
0
0
0
1
1
1
1
0
0
0
0
1
1
0101
b=2/c=110 b=4/c=01011001
conflict
between
counters
4 and 7
PAGE15
www.exensa.com
Tree Shared Counters
Incrementing (counter 5)
0
0
0 0 0 0
0 0 0 0
0 0
0 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0000
0
0
0 0 0 0
0 0 0 0
0 0
0 0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0000
0
0
0 0 0 0
0 0 0 0
0 0
0 0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0000
0 1 2
PAGE16
www.exensa.com
Tree Shared Counters
Incrementing (counter 5)
0
0
0 0 0 0
0 0 0 0
0 0
0 0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0000
0
0
0 0 1 0
0 0 0 0
0 0
0 0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0000
0
0
0 0 1 0
0 0 0 0
0 0
0 0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0000
3 4 5
PAGE17
www.exensa.com
Tree Shared Counters
Incrementing (counter 5)
0
0
0 0 0 0
0 0 1 0
0 0
0 0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0000
6
1
A bit at that level is worth …
2
2
4
4
8
PAGE18
www.exensa.com
Count-Min Tree Sketches
Experiments
Results !
• 140M tokens from English Wikipedia*
• 14.7M words (unigrams + bigrams)
• Reference counts stored in UnorderedMap  815MiB
Perfect storage size : suppose we have a perfect hash function and store the counts using 32-bits
counters. For 14.7M words, it amounts to 59MiB.
Performance : our implementation of a CMTS using <[(128,128),(64,64)…],32> counters is equivalent to native
UnorderedMap performance.
We use 3-layers sketches (good performance/precision tradeoff)
* We preferred to test our counters with a large number of parameters rather than with a large
corpus, so we limit to 5% of Wikipedia.
PAGE19
www.exensa.com
Count-Min Tree Sketches
Average Relative Error
Results !
PAGE20
www.exensa.com
Count-Min Tree Sketches
RMSE
Results !
PAGE21
www.exensa.com
Count-Min Tree Sketches
RMSE on PMI
Results !
PAGE22
www.exensa.com
Count-Min Tree Sketch
Question : are CMTS really useful in real-life ?
1 – CMTS are better on the whole vocabulary, but what happens if we
skip the least frequent words / bigrams ?
2 – CMTS are better on average, but what happens quantile by quantile ?
PAGE23
www.exensa.com
Count-Min Tree Sketches
PMI Error per quantile
(sketches at 50% perfect
size, limit eval to f > 10-7
)
Results !
PAGE24
www.exensa.com
Count-Min Tree Sketches
Relative Error per log2-quantile
(sketches at 50% perfect size,
limit eval to f > 10-7 )
Results !
PAGE25
www.exensa.com
Conclusion
Where are we ?
CMTS significantly outperforms other methods to store and update Zipfian counts in a very efficient
way.
Because most of the time in sketch accesses is due to memory access, its performance is on-par with
other methods
• Main drawback : at very high (and unpractical anyway) pressures (less than 10% of the perfect storage
size), the error skyrockets
• Other drawback : implementation is not straightforward. We have devised at least 4 different ways to
increment the counters.
Merging (and thus distributing) is easy once you can read and set a counter.
PAGE26
www.exensa.com
Conclusion
Where are we going ?
Dynamic : we are working on a CMTS version that can automatically grow (more layers added below)
Pressure control : when we detect that pressure becomes too high, we can divide and subsample to
stop the collisions to cascade
Open Source python package on its way

Más contenido relacionado

Destacado

Evolution du look & feel du web 0.0 au 2.0 - Printemps.com
Evolution du look & feel du web 0.0 au 2.0 - Printemps.comEvolution du look & feel du web 0.0 au 2.0 - Printemps.com
Evolution du look & feel du web 0.0 au 2.0 - Printemps.combenoit.rigaut
 
2016 06-30-deep-learning-archi
2016 06-30-deep-learning-archi2016 06-30-deep-learning-archi
2016 06-30-deep-learning-archiDaisuke Nagao
 
Modern Datacenter : de la théorie à la pratique
Modern Datacenter : de la théorie à la pratique Modern Datacenter : de la théorie à la pratique
Modern Datacenter : de la théorie à la pratique Microsoft Technet France
 
Les cabinets de recrutement spécialisés dans les métiers du numérique
Les cabinets de recrutement spécialisés dans les métiers du numériqueLes cabinets de recrutement spécialisés dans les métiers du numérique
Les cabinets de recrutement spécialisés dans les métiers du numériqueFrenchWeb.fr
 
hands on: Text Mining With R
hands on: Text Mining With Rhands on: Text Mining With R
hands on: Text Mining With RJahnab Kumar Deka
 
Web1, web2 and web 3
Web1, web2 and web 3Web1, web2 and web 3
Web1, web2 and web 3mercedeh37
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rVivian S. Zhang
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)fridolin.wild
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Databricks
 
Craftworkz at InterConnect 2017 - Creating a Highly Scalable Chatbot in a Mic...
Craftworkz at InterConnect 2017 - Creating a Highly Scalable Chatbot in a Mic...Craftworkz at InterConnect 2017 - Creating a Highly Scalable Chatbot in a Mic...
Craftworkz at InterConnect 2017 - Creating a Highly Scalable Chatbot in a Mic...craftworkz
 
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsRoelof Pieters
 
Web Development on Web Project Presentation
Web Development on Web Project PresentationWeb Development on Web Project Presentation
Web Development on Web Project PresentationMilind Gokhale
 
DocDoku - Mobile Monday Toulouse 1ère : la NFC
DocDoku - Mobile Monday Toulouse 1ère : la NFCDocDoku - Mobile Monday Toulouse 1ère : la NFC
DocDoku - Mobile Monday Toulouse 1ère : la NFCDocDoku
 
#MDSGAM : Etude Digital Trends Morocco 2015
#MDSGAM : Etude Digital Trends Morocco 2015#MDSGAM : Etude Digital Trends Morocco 2015
#MDSGAM : Etude Digital Trends Morocco 2015Othmane Ghailane
 
Detail History of web 1.0 to 3.0
Detail History of web 1.0 to 3.0Detail History of web 1.0 to 3.0
Detail History of web 1.0 to 3.0Ghazal Hina
 
Web 1.0, Web 2.0 & Web 3.0
Web 1.0, Web 2.0 & Web 3.0Web 1.0, Web 2.0 & Web 3.0
Web 1.0, Web 2.0 & Web 3.0tokey_sport
 

Destacado (18)

Evolution du look & feel du web 0.0 au 2.0 - Printemps.com
Evolution du look & feel du web 0.0 au 2.0 - Printemps.comEvolution du look & feel du web 0.0 au 2.0 - Printemps.com
Evolution du look & feel du web 0.0 au 2.0 - Printemps.com
 
2016 06-30-deep-learning-archi
2016 06-30-deep-learning-archi2016 06-30-deep-learning-archi
2016 06-30-deep-learning-archi
 
Modern Datacenter : de la théorie à la pratique
Modern Datacenter : de la théorie à la pratique Modern Datacenter : de la théorie à la pratique
Modern Datacenter : de la théorie à la pratique
 
Les cabinets de recrutement spécialisés dans les métiers du numérique
Les cabinets de recrutement spécialisés dans les métiers du numériqueLes cabinets de recrutement spécialisés dans les métiers du numérique
Les cabinets de recrutement spécialisés dans les métiers du numérique
 
hands on: Text Mining With R
hands on: Text Mining With Rhands on: Text Mining With R
hands on: Text Mining With R
 
Web1, web2 and web 3
Web1, web2 and web 3Web1, web2 and web 3
Web1, web2 and web 3
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with r
 
The Evolution of Web 3.0
The Evolution of Web 3.0The Evolution of Web 3.0
The Evolution of Web 3.0
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
 
TextMining with R
TextMining with RTextMining with R
TextMining with R
 
Craftworkz at InterConnect 2017 - Creating a Highly Scalable Chatbot in a Mic...
Craftworkz at InterConnect 2017 - Creating a Highly Scalable Chatbot in a Mic...Craftworkz at InterConnect 2017 - Creating a Highly Scalable Chatbot in a Mic...
Craftworkz at InterConnect 2017 - Creating a Highly Scalable Chatbot in a Mic...
 
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word Embeddings
 
Web Development on Web Project Presentation
Web Development on Web Project PresentationWeb Development on Web Project Presentation
Web Development on Web Project Presentation
 
DocDoku - Mobile Monday Toulouse 1ère : la NFC
DocDoku - Mobile Monday Toulouse 1ère : la NFCDocDoku - Mobile Monday Toulouse 1ère : la NFC
DocDoku - Mobile Monday Toulouse 1ère : la NFC
 
#MDSGAM : Etude Digital Trends Morocco 2015
#MDSGAM : Etude Digital Trends Morocco 2015#MDSGAM : Etude Digital Trends Morocco 2015
#MDSGAM : Etude Digital Trends Morocco 2015
 
Detail History of web 1.0 to 3.0
Detail History of web 1.0 to 3.0Detail History of web 1.0 to 3.0
Detail History of web 1.0 to 3.0
 
Web 1.0, Web 2.0 & Web 3.0
Web 1.0, Web 2.0 & Web 3.0Web 1.0, Web 2.0 & Web 3.0
Web 1.0, Web 2.0 & Web 3.0
 

Similar a Count-Min Tree Sketch : Approximate counting for NLP tasks

Computer data representation (integers, floating-point numbers, text, images,...
Computer data representation (integers, floating-point numbers, text, images,...Computer data representation (integers, floating-point numbers, text, images,...
Computer data representation (integers, floating-point numbers, text, images,...ArtemKovera
 
Beyond PFCount: Shrif Nada
Beyond PFCount: Shrif NadaBeyond PFCount: Shrif Nada
Beyond PFCount: Shrif NadaRedis Labs
 
CST-20363-Session 1-In the Bitginning
CST-20363-Session 1-In the BitginningCST-20363-Session 1-In the Bitginning
CST-20363-Session 1-In the Bitginningoudesign
 
Lesson 26. Optimization of 64-bit programs
Lesson 26. Optimization of 64-bit programsLesson 26. Optimization of 64-bit programs
Lesson 26. Optimization of 64-bit programsPVS-Studio
 
Counting (Notes)
Counting (Notes)Counting (Notes)
Counting (Notes)roshmat
 
CSF Tips and Tricks 8MS Webinar
CSF Tips and Tricks 8MS WebinarCSF Tips and Tricks 8MS Webinar
CSF Tips and Tricks 8MS WebinarAerialink
 
Manoch1raw 160512091436
Manoch1raw 160512091436Manoch1raw 160512091436
Manoch1raw 160512091436marangburu42
 
UNIT 4 -Data Representation.pptxfghfghhggh
UNIT 4 -Data Representation.pptxfghfghhgghUNIT 4 -Data Representation.pptxfghfghhggh
UNIT 4 -Data Representation.pptxfghfghhgghKwadjoOwusuAnsahQuar
 
Feature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax auditFeature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax auditMichael BENESTY
 
Development of a static code analyzer for detecting errors of porting program...
Development of a static code analyzer for detecting errors of porting program...Development of a static code analyzer for detecting errors of porting program...
Development of a static code analyzer for detecting errors of porting program...PVS-Studio
 
Decimal arithmetic in Processors
Decimal arithmetic in ProcessorsDecimal arithmetic in Processors
Decimal arithmetic in ProcessorsPeeyush Pashine
 
Lesson 13. Pattern 5. Address arithmetic
Lesson 13. Pattern 5. Address arithmeticLesson 13. Pattern 5. Address arithmetic
Lesson 13. Pattern 5. Address arithmeticPVS-Studio
 
Lecture 06 - CS-5040 - modern database systems
Lecture 06  - CS-5040 - modern database systemsLecture 06  - CS-5040 - modern database systems
Lecture 06 - CS-5040 - modern database systemsMichael Mathioudakis
 
Arithmetic for Computers.ppt
Arithmetic for Computers.pptArithmetic for Computers.ppt
Arithmetic for Computers.pptJEEVANANTHAMG6
 
Data representation in a computer
Data representation in a computerData representation in a computer
Data representation in a computerGirmachew Tilahun
 
CBO choice between Index and Full Scan: the good, the bad and the ugly param...
CBO choice between Index and Full Scan:  the good, the bad and the ugly param...CBO choice between Index and Full Scan:  the good, the bad and the ugly param...
CBO choice between Index and Full Scan: the good, the bad and the ugly param...Franck Pachot
 
Scaling a Core Banking Engine Using Apache Kafka | Peter Dudbridge, Thought M...
Scaling a Core Banking Engine Using Apache Kafka | Peter Dudbridge, Thought M...Scaling a Core Banking Engine Using Apache Kafka | Peter Dudbridge, Thought M...
Scaling a Core Banking Engine Using Apache Kafka | Peter Dudbridge, Thought M...HostedbyConfluent
 

Similar a Count-Min Tree Sketch : Approximate counting for NLP tasks (20)

Computer data representation (integers, floating-point numbers, text, images,...
Computer data representation (integers, floating-point numbers, text, images,...Computer data representation (integers, floating-point numbers, text, images,...
Computer data representation (integers, floating-point numbers, text, images,...
 
Beyond PFCount: Shrif Nada
Beyond PFCount: Shrif NadaBeyond PFCount: Shrif Nada
Beyond PFCount: Shrif Nada
 
CST-20363-Session 1-In the Bitginning
CST-20363-Session 1-In the BitginningCST-20363-Session 1-In the Bitginning
CST-20363-Session 1-In the Bitginning
 
Lesson 26. Optimization of 64-bit programs
Lesson 26. Optimization of 64-bit programsLesson 26. Optimization of 64-bit programs
Lesson 26. Optimization of 64-bit programs
 
Counting (Notes)
Counting (Notes)Counting (Notes)
Counting (Notes)
 
CSF Tips and Tricks 8MS Webinar
CSF Tips and Tricks 8MS WebinarCSF Tips and Tricks 8MS Webinar
CSF Tips and Tricks 8MS Webinar
 
Manoch1raw 160512091436
Manoch1raw 160512091436Manoch1raw 160512091436
Manoch1raw 160512091436
 
UNIT 4 -Data Representation.pptxfghfghhggh
UNIT 4 -Data Representation.pptxfghfghhgghUNIT 4 -Data Representation.pptxfghfghhggh
UNIT 4 -Data Representation.pptxfghfghhggh
 
Feature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax auditFeature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax audit
 
C++
C++C++
C++
 
Development of a static code analyzer for detecting errors of porting program...
Development of a static code analyzer for detecting errors of porting program...Development of a static code analyzer for detecting errors of porting program...
Development of a static code analyzer for detecting errors of porting program...
 
Decimal arithmetic in Processors
Decimal arithmetic in ProcessorsDecimal arithmetic in Processors
Decimal arithmetic in Processors
 
PACE-IT: Basic Network Concepts (part 3)
PACE-IT: Basic Network Concepts (part 3)PACE-IT: Basic Network Concepts (part 3)
PACE-IT: Basic Network Concepts (part 3)
 
Number system
Number system Number system
Number system
 
Lesson 13. Pattern 5. Address arithmetic
Lesson 13. Pattern 5. Address arithmeticLesson 13. Pattern 5. Address arithmetic
Lesson 13. Pattern 5. Address arithmetic
 
Lecture 06 - CS-5040 - modern database systems
Lecture 06  - CS-5040 - modern database systemsLecture 06  - CS-5040 - modern database systems
Lecture 06 - CS-5040 - modern database systems
 
Arithmetic for Computers.ppt
Arithmetic for Computers.pptArithmetic for Computers.ppt
Arithmetic for Computers.ppt
 
Data representation in a computer
Data representation in a computerData representation in a computer
Data representation in a computer
 
CBO choice between Index and Full Scan: the good, the bad and the ugly param...
CBO choice between Index and Full Scan:  the good, the bad and the ugly param...CBO choice between Index and Full Scan:  the good, the bad and the ugly param...
CBO choice between Index and Full Scan: the good, the bad and the ugly param...
 
Scaling a Core Banking Engine Using Apache Kafka | Peter Dudbridge, Thought M...
Scaling a Core Banking Engine Using Apache Kafka | Peter Dudbridge, Thought M...Scaling a Core Banking Engine Using Apache Kafka | Peter Dudbridge, Thought M...
Scaling a Core Banking Engine Using Apache Kafka | Peter Dudbridge, Thought M...
 

Último

Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...chandars293
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Monika Rani
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxRizalinePalanog2
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed
 

Último (20)

Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 

Count-Min Tree Sketch : Approximate counting for NLP tasks

  • 1. PAGE1 www.exensa.com www.exensa.com PRESENTER: GUILLAUME PITEL 2016 JUNE 9Approximate counting for NLP Count-Min Tree Sketch Guillaume Pitel, Geoffroy Fouquier, Emmanuel Marchand, Abdul Mouhamadsultane 0 1 1 0 1 0 1 1 1 1 1 0 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 1 0101 b=2/c=110 b=4/c=01011001 conflict between counters 4 and 7
  • 2. PAGE2 www.exensa.com A bit of context Why do we need to count ? Data analysis platform : eXenGine. Processes different kind of data (mostly text). We need to create relevant cross-features : to do that we need to count occurrences of all possible cross-features. In the case of text data, a particular kind of cross-feature is known as n-grams. There are many different measures to decide if a n-gram is interesting. All require to count the occurrences of the cross-feature and the features themselves (i.e. count bigrams and words in bigrams) Counting exactly is easy, distributable, and very slow because of memory usage. Also, having the whole data structure containing the counts in memory is impossible, so one has to resort to using huge map/reduce with joins to do the job.
  • 3. PAGE3 www.exensa.com A bit of context What kind of data are we talking about ? Google N-grams tokens 1024 Billions sentences 95 Billions 1-grams (count > 200) 14 Millions 2-grams (count > 40) 314 Millions 3-grams 977 Millions 4-grams 1.3 Billion 5-grams 1.2 Billion
  • 4. PAGE4 www.exensa.com A bit of context What kind of data are we talking about ? Zipfian distribution [Le Quan & al. 2003]
  • 5. PAGE5 www.exensa.com A bit of context What kind of measures are we talking about ? PMI, TF-IDF, LLR
  • 6. PAGE6 www.exensa.com A bit of context Summary / Goals Many counts Logarithms in measures We need to store a large amount of counts We care about the order of magnitude Fast and memory controlled We don’t want a distributed memory for the counts Zipfian counts Many very small counts that will be filtered out later
  • 7. PAGE7 www.exensa.com A bit of context Summary / Goals Many counts Logarithms in measures We need to store a large amount of counts We care about the order of magnitude Fast and memory controlled We don’t want a distributed memory for the counts Zipfian counts Many very small counts that will be filtered out later We can use probabilistic structures
  • 8. PAGE8 www.exensa.com Count-Min Sketch A probabilistic data structure to store counts [Cormode & Muthukrishnan 2005]
  • 9. PAGE9 www.exensa.com Count-Min Sketch A probabilistic data structure to store counts Conservative Update : improve CMS by updating only min values
  • 10. PAGE10 www.exensa.com Count-Min Log Sketch A probabilistic data structure to store logarithmic counts [Pitel & Fouquier, 2015] : same idea than [Talbot, 2009] in a Count-min Sketch Instead of using regular 32 bit counters, we use 8 or 16 bits “Morris” counters counting logarithmically. Since counts are used in logs anyway, the error on the PMI/TF-IDF/… is almost the same, but we can use more counters However, a count of 1 still uses the same amount of memory than a count of 10000. Also, at some point, error stops improving with space (there is an inherent residual error)
  • 11. PAGE11 www.exensa.com Count-Min Tree Sketch A count min sketch with shared counters Idea : use a hierarchal storage where most significant bits are shared between counters. Somehow similar to TOMB counters [Van Durme, 2009], except that overflow is managed very differently.
  • 12. PAGE12 www.exensa.com Tree Shared Counters Sharing most significant bits 8 counters structure o A tree is made of three kinds of storage: o Counting bits o Barrier bits o Spire (not required except for performance) oSeveral layers alternating counting and barrier bits. oHere we have a <[(8,8),(4,4),(2,2),(1,1)],4> counter Or : how can we store counts with an average approaching 4 bits / counter 0 1 1 0 1 0 1 1 1 1 1 0 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 1 0101 barrier bits counting bits spire base layer
  • 13. PAGE13 www.exensa.com Tree Shared Counters Sharing most significant bits 8 counters structure o8 counters in 30 bits + spire oWithout a spire, n bits can count up to 3 × 21+log2 𝑛 4 o Many small shared counters with spires are more efficient than a large shared counter Or : how can we store counts with an average approaching 4 bits / counter 0 1 1 0 1 0 1 1 1 1 1 0 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 1 0101 barrier bits counting bits spire base layer
  • 14. PAGE14 www.exensa.com Tree Shared Counters Reading values o A counter stops at the first ZERO barrier o When two barrier paths meet, there is a conflict o Barrier length (b) is evaluated in unary o Counter bits (c) are evaluated in a more classical way 0 1 1 0 1 0 1 1 1 1 1 0 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 1 0101 b=2/c=110 b=4/c=01011001 conflict between counters 4 and 7
  • 15. PAGE15 www.exensa.com Tree Shared Counters Incrementing (counter 5) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0000 0 1 2
  • 16. PAGE16 www.exensa.com Tree Shared Counters Incrementing (counter 5) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0000 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0000 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0000 3 4 5
  • 17. PAGE17 www.exensa.com Tree Shared Counters Incrementing (counter 5) 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0000 6 1 A bit at that level is worth … 2 2 4 4 8
  • 18. PAGE18 www.exensa.com Count-Min Tree Sketches Experiments Results ! • 140M tokens from English Wikipedia* • 14.7M words (unigrams + bigrams) • Reference counts stored in UnorderedMap  815MiB Perfect storage size : suppose we have a perfect hash function and store the counts using 32-bits counters. For 14.7M words, it amounts to 59MiB. Performance : our implementation of a CMTS using <[(128,128),(64,64)…],32> counters is equivalent to native UnorderedMap performance. We use 3-layers sketches (good performance/precision tradeoff) * We preferred to test our counters with a large number of parameters rather than with a large corpus, so we limit to 5% of Wikipedia.
  • 22. PAGE22 www.exensa.com Count-Min Tree Sketch Question : are CMTS really useful in real-life ? 1 – CMTS are better on the whole vocabulary, but what happens if we skip the least frequent words / bigrams ? 2 – CMTS are better on average, but what happens quantile by quantile ?
  • 23. PAGE23 www.exensa.com Count-Min Tree Sketches PMI Error per quantile (sketches at 50% perfect size, limit eval to f > 10-7 ) Results !
  • 24. PAGE24 www.exensa.com Count-Min Tree Sketches Relative Error per log2-quantile (sketches at 50% perfect size, limit eval to f > 10-7 ) Results !
  • 25. PAGE25 www.exensa.com Conclusion Where are we ? CMTS significantly outperforms other methods to store and update Zipfian counts in a very efficient way. Because most of the time in sketch accesses is due to memory access, its performance is on-par with other methods • Main drawback : at very high (and unpractical anyway) pressures (less than 10% of the perfect storage size), the error skyrockets • Other drawback : implementation is not straightforward. We have devised at least 4 different ways to increment the counters. Merging (and thus distributing) is easy once you can read and set a counter.
  • 26. PAGE26 www.exensa.com Conclusion Where are we going ? Dynamic : we are working on a CMTS version that can automatically grow (more layers added below) Pressure control : when we detect that pressure becomes too high, we can divide and subsample to stop the collisions to cascade Open Source python package on its way