2. Everybody (Click & Play)
Business Analysts (Excel)
IT / DBAs (SQL, Python)
Data Hackers (MapReduce)
People who implement their
own infrastructure
3. Everybody (Click & Play)
Business Analysts (Excel)
IT / DBAs (SQL, Python)
Data Hackers (MapReduce)
People who implement their
own infrastructure
Disco
4. Everybody (Click & Play)
Business Analysts (Excel)
IT / DBAs (SQL, Python)
Data Hackers (MapReduce)
People who implement their
own infrastructure
12. what makes some users very active?
Customer CCustomer B
how to reduce churn?
Customer A
why some users return?
Daily ActivityDaily Activity Daily Activity
Users
Users
Users
19. from discodb import DiscoDB
FILES = [‘a.txt’,‘b.txt’,‘c.txt’]
def extract_words():
for fname in FILES:
for word in open(fname).read().split():
yield word, fname
db = DiscoDB(extract_words())
db[‘dog’]
db.keys()
db.unique_values()
db.items()
# files that mention ‘dog’
# all distinct word
# all distinct filenames
# all (word, iter(fname)) pairs
20. Hash Map:
hash(Key) → Key ID
Value Map:
Key ID → [Value ID, ...]
Keys:
Key ID → Key
Values:
Value ID →Value
DiscoDB Chunk
21. Hash Map:
hash(Key) → Key ID
Value Map:
Key ID → [Value ID, ...]
Keys:
Key ID → Key
Values:
Value ID →Value
DiscoDB Chunk
Perfect hashing by CMPH,
guaranteed O(1)
The list of Value IDs
is delta-encoded
Values are compressed
with a global Huffman
codebook
22. DiscoDB Chunk
Node 1 Node 2 Node N
Disco Node
Python Worker
DDFS
Disco Node
Python Worker
Disco Node
Python Worker
DiscoDB Chunk
DiscoDB Chunk
DiscoDB Chunk
DiscoDB Chunk
DiscoDB Chunk
DiscoDB Chunk
DiscoDB Chunk
DiscoDB Chunk
23. A → [Apple, Orange, Banana]
B → [Apple, Banana]
C → [Banana, Melon]
Q(“A & B”)
Apple
Banana
Q(“A | B”)
Apple
Orange
Banana
Q(“(A & B) | C”)
Banana
DiscoDB
from discodb.query import Q
Querying with Conjunctive Normal Form
24. Model:
Event → Users
Query (sequence of events):
Q(“Event A & Event B & ...”)
Funnel
https://github.com/tuulos/bd3-mixpanel-funnel
25. Model:
Day N → Users
Query (weekly cohorts):
Q(“(dayN | dayN+1) & (dayM | dayM+1...)”)
Cohort Analysis
https://github.com/tuulos/bd3-mixpanel-cohort
26. Model:
Day N → Users
Query (one time series):
[Q(Day K) for K in range(start, end)]
Time Series
https://github.com/tuulos/bd3-mixpanel-trends