Count-Min Tree sktech is a variant of the Count-Min Sketch, tailored for Zipfian (or power law) data distribution. With memory footprint improvement of 4 to 8 times against other variants, and on-par performance with native strict counting, the Count-Min Tree sketch can be used in many time-critical situations. It is developped by eXenSa (www.exensa.com)
2. PAGE2
www.exensa.com
A bit of context
Why do we need to count ?
Data analysis platform : eXenGine.
Processes different kind of data (mostly text).
We need to create relevant cross-features : to do that we need to count occurrences of all possible
cross-features. In the case of text data, a particular kind of cross-feature is known as n-grams.
There are many different measures to decide if a n-gram is interesting. All require to count the
occurrences of the cross-feature and the features themselves (i.e. count bigrams and words in
bigrams)
Counting exactly is easy, distributable, and very slow because of memory usage. Also, having the
whole data structure containing the counts in memory is impossible, so one has to resort to using
huge map/reduce with joins to do the job.
3. PAGE3
www.exensa.com
A bit of context
What kind of data are we talking about ?
Google N-grams
tokens 1024 Billions
sentences 95 Billions
1-grams (count > 200) 14 Millions
2-grams (count > 40) 314 Millions
3-grams 977 Millions
4-grams 1.3 Billion
5-grams 1.2 Billion
4. PAGE4
www.exensa.com
A bit of context
What kind of data are we talking about ?
Zipfian distribution
[Le Quan & al. 2003]
6. PAGE6
www.exensa.com
A bit of context
Summary / Goals
Many
counts
Logarithms
in measures
We need to store
a large amount of
counts
We care about
the order of
magnitude
Fast and memory
controlled
We don’t want a
distributed memory for
the counts
Zipfian
counts
Many very small
counts that will be
filtered out later
7. PAGE7
www.exensa.com
A bit of context
Summary / Goals
Many
counts
Logarithms
in measures
We need to store
a large amount of
counts
We care about
the order of
magnitude
Fast and memory
controlled
We don’t want a
distributed memory for
the counts
Zipfian
counts
Many very small
counts that will be
filtered out later
We can use probabilistic
structures
10. PAGE10
www.exensa.com
Count-Min Log Sketch
A probabilistic data structure to store logarithmic counts
[Pitel & Fouquier, 2015] : same idea than [Talbot, 2009] in a Count-min Sketch
Instead of using regular 32 bit counters, we use 8 or 16 bits “Morris” counters counting
logarithmically.
Since counts are used in logs anyway, the error on the PMI/TF-IDF/… is almost the same, but we can
use more counters
However, a count of 1 still uses the same amount of memory than a count of 10000. Also, at some
point, error stops improving with space (there is an inherent residual error)
11. PAGE11
www.exensa.com
Count-Min Tree Sketch
A count min sketch with shared counters
Idea : use a hierarchal storage where most significant bits are shared
between counters.
Somehow similar to TOMB counters [Van Durme, 2009], except that
overflow is managed very differently.
12. PAGE12
www.exensa.com
Tree Shared Counters
Sharing most significant
bits
8 counters structure
o A tree is made of three kinds of storage:
o Counting bits
o Barrier bits
o Spire (not required except for
performance)
oSeveral layers alternating counting
and barrier bits.
oHere we have a
<[(8,8),(4,4),(2,2),(1,1)],4> counter
Or : how can we store counts with an average approaching
4 bits / counter
0
1
1 0 1 0
1 1 1 1
1 0
0 1
1
1
1
0
0
0
1
1
1
1
0
0
0
0
1
1
0101
barrier bits
counting bits
spire
base layer
13. PAGE13
www.exensa.com
Tree Shared Counters
Sharing most significant
bits
8 counters structure
o8 counters in 30 bits + spire
oWithout a spire, n bits can count up
to 3 × 21+log2
𝑛
4
o Many small shared counters with spires
are more efficient than a large shared
counter
Or : how can we store counts with an average approaching
4 bits / counter
0
1
1 0 1 0
1 1 1 1
1 0
0 1
1
1
1
0
0
0
1
1
1
1
0
0
0
0
1
1
0101
barrier bits
counting bits
spire
base layer
14. PAGE14
www.exensa.com
Tree Shared Counters
Reading values
o A counter stops at the first ZERO barrier
o When two barrier paths meet, there is
a conflict
o Barrier length (b) is evaluated in unary
o Counter bits (c) are evaluated in a more
classical way
0
1
1 0 1 0
1 1 1 1
1 0
0 1
1
1
1
0
0
0
1
1
1
1
0
0
0
0
1
1
0101
b=2/c=110 b=4/c=01011001
conflict
between
counters
4 and 7
18. PAGE18
www.exensa.com
Count-Min Tree Sketches
Experiments
Results !
• 140M tokens from English Wikipedia*
• 14.7M words (unigrams + bigrams)
• Reference counts stored in UnorderedMap 815MiB
Perfect storage size : suppose we have a perfect hash function and store the counts using 32-bits
counters. For 14.7M words, it amounts to 59MiB.
Performance : our implementation of a CMTS using <[(128,128),(64,64)…],32> counters is equivalent to native
UnorderedMap performance.
We use 3-layers sketches (good performance/precision tradeoff)
* We preferred to test our counters with a large number of parameters rather than with a large
corpus, so we limit to 5% of Wikipedia.
22. PAGE22
www.exensa.com
Count-Min Tree Sketch
Question : are CMTS really useful in real-life ?
1 – CMTS are better on the whole vocabulary, but what happens if we
skip the least frequent words / bigrams ?
2 – CMTS are better on average, but what happens quantile by quantile ?
25. PAGE25
www.exensa.com
Conclusion
Where are we ?
CMTS significantly outperforms other methods to store and update Zipfian counts in a very efficient
way.
Because most of the time in sketch accesses is due to memory access, its performance is on-par with
other methods
• Main drawback : at very high (and unpractical anyway) pressures (less than 10% of the perfect storage
size), the error skyrockets
• Other drawback : implementation is not straightforward. We have devised at least 4 different ways to
increment the counters.
Merging (and thus distributing) is easy once you can read and set a counter.
26. PAGE26
www.exensa.com
Conclusion
Where are we going ?
Dynamic : we are working on a CMTS version that can automatically grow (more layers added below)
Pressure control : when we detect that pressure becomes too high, we can divide and subsample to
stop the collisions to cascade
Open Source python package on its way