5. Unique visitors over a range
A
B
C
A
B
C
D
E
C
D
E
1 2 3 4
Hours
Pageviews
uniques for just hour 1 = 3
uniques for hours 1 and 2 = 3
uniques for 1 to 3 = 5
uniques for 2-4 = 4
6. Notes:
• Not limiting ourselves to current tooling
• Reasonable variations of existing tooling
are acceptable
• Interested in what’s fundamentally possible
8. Approach #1
• Use Key->Set database
• Key = [URL, hour bucket]
• Value = Set of UserIDs
9. Approach #1
• Queries:
• Get all sets for all hours in range of
query
• Union sets together
• Compute count of merged set
10. Approach #1
• Lot of database lookups for large ranges
• Potentially a lot of items in sets, so lots of
work to merge/count
• Database will use a lot of space
12. interface HyperLogLog {
boolean add(Object o);
long size();
HyperLogLog merge(HyperLogLog... otherSets);
}
1 KB to estimate size up to 1B with only 2% error
13. Approach #2
• Use Key->HyperLogLog database
• Key = [URL, hour bucket]
• Value = HyperLogLog structure
it’s not a stretch to imagine a database that can do hyperloglog natively, so updates don’t
require fetching the entire set
14. Approach #2
• Queries:
• Get all HyperLogLog structures for all
hours in range of query
• Merge structures together
• Retrieve count from merged structure
15. Approach #2
• Much more efficient use of storage
• Less work at query time
• Mild accuracy tradeoff
17. Approach #3
• Use Key->HyperLogLog database
• Key = [URL, bucket, granularity]
• Value = HyperLogLog structure
it’s not a stretch to imagine a database that can do hyperloglog natively, so updates don’t
require fetching the entire set
18. Approach #3
• Queries:
• Compute minimal number of database
lookups to satisfy range
• Get all HyperLogLog structures in range
• Merge structures together
• Retrieve count from merged structure
19. Approach #3
• All benefits of #2
• Minimal number of lookups for any range,
so less variation in latency
• Minimal increase in storage
• Requires more work at write time
example: 1 month there are ~720 hours, 30 days, 4 weeks, 1 month... adding all
granularities makes 755 stored values total instead of 720 values, only a 4.8% increase in
storage
29. Approach #1
• [URL, hour] -> Set of PersonIDs
• PersonID -> Set of buckets
• Indexes to incrementally normalize
UserIDs into PersonIDs
will get back to incrementally updating userids
30. Approach #1
• Getting complicated
• Large indexes
• Operations require a lot of work
will get back to incrementally updating userids
31. Approach #2
• [URL, bucket] -> Set of UserIDs
• Like Approach 1, incrementally normalize
UserId’s
• UserID -> PersonID
offload a lot of the work to read time
32. Approach #2
• Query:
• Retrieve all UserID sets for range
• Merge sets together
• Convert UserIDs -> PersonIDs to
produce new set
• Get count of new set
this is still an insane amount of work at read time
overall
33. Approach #3
• [URL, bucket] -> Set of sampled UserIDs
• Like Approaches 1 & 2, incrementally
normalize UserId’s
• UserID -> PersonID
offload a lot of the work to read time
34. Approach #3
• Query:
• Retrieve all UserID sets for range
• Merge sets together
• Convert UserIDs -> PersonIDs to
produce new set
• Get count of new set
• Divide count by sample rate
36. Approach #3
• Sample the user ids using hash sampling
• Divide by the sample rate at end to
approximate the unique count
can’t just do straight random sampling (imagine if only have 4 user ids that visit thousands of
times... a sample rate of 50% will have all user ids, and you’ll end up giving the answer of 8)
37. Approach #3
• Still need complete UserID -> PersonID index
• Still requires about 100 lookups into
UserID -> PersonID index to resolve queries
• Error rate 3-5x worse than HyperLogLog for
same space usage
• Requires SSD’s for reasonable throughput
39. Attempt 1:
• Maintain index from UserID -> PersonID
• When receive A <-> B:
• Find what they’re each normalized to, and
transitively normalize all reachable IDs to
“smallest” val
41. Attempt 2:
• UserID -> PersonID
• PersonID -> Set of UserIDs
• When receive A <-> B
• Find what they’re each normalized to, and
choose one for both to be normalized to
• Update all UserID’s in both normalized sets
43. Challenges
• Fault-tolerance / ensuring consistency
between indexes
• Concurrency challenges
if using distributed database to store indexes and computing everything concurrently
when receive equivs for 4<->3 and 3<->1 at same time, will need some sort of locking so
they don’t step on each other
44. General challenges with
traditional architectures
• Redundant storage of information
(“denormalization”)
• Brittle to human error
• Operational challenges of enormous
installations of very complex databases
e.g. granularities, the 2 indexes for user id normalization... we know it’s a bad idea to store the same thing in multiple places... opens up
possibility of them getting out of sync if you don’t handle every case perfectly
If you have a bug that accidentally sets the second value of all equivs to 1, you’re in trouble
even the version without equivs suffers from these problems
50. Real World Example
2 functions: produce water of a certain strength, and produce water of a certain temperature
faucet on left gives you “hot” and “cold” inputs which each affect BOTH outputs - complex to
use
faucet on right gives you independent “heat” and “strength” inputs, so SIMPLE to use
neither is very complicated
52. Real World Example #2
I have to use two devices for the same task!!! temperature control!
wouldn’t it be SIMPLER if I had just one device that could regulate temperature from 0 degrees to 500 degrees?
then i could have more features, like “450 for 20 minutes then refrigerate”
who objects to this?
... talk about how mixing them can create REAL COMPLEXITY...
- sometimes you need MORE PIECES to AVOID COMPLEXITY and CREATE SIMPLICITY
- we’ll come back to this... this same situation happens in software all the time
- people want one tool with a million features... but it turns out these features interact with each other and create COMPLEXITY
53. ID Name
Location
ID
1 Sally 3
2 George 1
3 Bob 3
Location
ID
City State Population
1 NewYork NY 8.2M
2 San Diego CA 1.3M
3 Chicago IL 2.7M
Normalized schema
Normalization vs
Denormalization
so just a quick overview of denormalization, here’s a schema that stores user information and location information
each is in its own table, and a user’s location is a reference to a row in the location table
this is pretty standard relational database stuff
now let’s say a really common query is getting the city and state a person lives in
to do this you have to join the tables together as part of your query
54. Join is too expensive, so
denormalize...
you might find joins are too expensive, they use too many resources
55. ID Name Location ID City State
1 Sally 3 Chicago IL
2 George 1 NewYork NY
3 Bob 3 Chicago IL
Location ID City State Population
1 NewYork NY 8.2M
2 San Diego CA 1.3M
3 Chicago IL 2.7M
Denormalized schema
so you denormalize the schema for performance
you redundantly store the city and state in the users table to make that query faster, cause now it doesn’t require a join
now obviously, this sucks. the same data is now stored in multiple places, which we all know is a bad idea
whenever you need to change something about a location you need to change it everywhere it’s stored
but since people make mistakes, inevitably things become inconsistent
but you have no choice, you want to normalize, but you have to denormalize for performance
56. Complexity between robust data model
and query performance
you have to choose which one you’re going to suck at
72. Conclusions
• Easy to understand and implement
• Scalable
• Concurrency / fault-tolerance easily
abstracted away from you
• Great query performance
77. Implementing realtime
layer
• Isn’t this the exact same problem we faced
before we went down the path of batch
computation?
i hope you are looking at this and asking the question...
still have to compute uniques over time and deal with the equivs problem
how are we better off than before?
78. Approach #1
• Use the exact same approach as we did in
fully incremental implementation
• Query performance only degraded for
recent buckets
• e.g.,“last month” range computes vast
majority of query from efficient batch
indexes
79. Approach #1
• Relatively small number of buckets in
realtime layer
• So not that much effect on storage costs
80. Approach #1
• Complexity of realtime layer is softened by
existence of batch layer
• Batch layer continuously overrides realtime
layer, so mistakes are auto-fixed
81. Approach #1
• Still going to be a lot of work to implement
this realtime layer
• Recent buckets with lots of uniques will still
cause bad query performance
• No way to apply recent equivs to batch
views without restructuring batch views
82. Approach #2
• Approximate!
• Ignore realtime equivs
options for taking different approaches to problem without having to sacrifice too much
93. Black box fallacy
people say it does “key/value”, so I can use it when I need key/value operations... and they
stop there
can’t treat it as a black box, that doesn’t tell the full story
94. Online compaction
• Databases write to write-ahead log before
modifying disk and memory indexes
• Need to occasionally compact the log and
indexes
95. Memory Disk
Key B
Key D
Key F
Value 1
Value 2
Value 3
Value 4
Key A Write A, Value 3
Write F, Value 4
Write-ahead log
96. Memory Disk
Write-ahead log
Key B
Key D
Key F
Value 1
Value 2
Value 3
Value 4
Key A
Value 5
Write A, Value 3
Write F, Value 4
Write A, Value 5
98. Online compaction
• Notorious for causing huge, sudden
changes in performance
• Machines can seem locked up
• Necessitated by random writes
• Extremely complex to deal with
102. replica 1: 10
replica 2: 7
replica 3: 18
replica 1: 10
replica 2: 7
replica 3: 18
replica 1: 10
replica 2: 7
replica 3: 18
replicas 1 and 3
replica 1: 11
replica 2: 7
replica 3: 21
replica 1: 10
replica 2: 13
replica 3: 18
network partition
replica 2
merge replica 1: 11
replica 2: 13
replica 3: 21
G-counter
things get much more complicated for things like sets
think about this - you wanted to just keep a count... and you have to deal with all this!
- online compaction plus complexity from CAP plus slow and complicated solutions
103. Complexity leads to bugs
• “Call Me Maybe” blog posts found data loss
problems in many popular databases
• Redis
• Cassandra
• ElasticSearch
some of his tests was seeing over 30% data loss during partitions
105. Master
Dataset
Batch views
New Data
Realtime
views
Query
No random writes!
major operational simplification to not require random writes
i’m not saying you can’t make a database that does online compaction and deals with the other complexities of random writes well, but it’s clearly
a fundamental complexity, and i feel it’s better to not have to deal with it at all
remember, we’re talking about what’s POSSIBLE, not what currently exists
my experience with elephantdb
how this architecture massively eases the problems of dealing with CAP