Box's /events API powers our desktop sync experience and provides users with a realtime, guaranteed-delivery event stream. To do that, we use HBase to store and serve a separate message queue for each of 30+ million users. Learn how we implemented queue semantics, were able to replicate our queues between clusters to enable transparent client failover, and why we chose to build a queueing system on top of HBase.
3. 3
What is the /events API?
• Realtime stream of all activity happening within a user’s account
• GET /events?stream_position=1234&stream_type=all
• Persistent and re-playable
1 2 3 4 5
Client
4. 4
Why did we build it?
• Main use-case was sync switch from batch to incremental diffs
• Several requirements arose from the sync use case:
‒ Guaranteed delivery
‒ Clients can be offline for days at a time
‒ Arbitrary number of clients consuming each user’s stream
Persistence
Re-playability
5. 5
How is it implemented?
• Each user assigned a separate section of the HBase key-space
• Messages are stored in order from oldest to newest within a user’s
section of the key-space
• Reads map directly to scans from the provided position to the user’s end
key
• Row key structure: <pseudo-random prefix>_<user_id>_<position>
2-bytes of user_id sha1 Millisecond timestamp
6. 6
Using a timestamp as a queue position
• Pro: Allows for allocating roughly monotonically increasing positions
with no co-ordination between write requests
• Con: Isn’t sufficient to guarantee append-only semantics in the presence
of parallel writes
Write
Write 2
Write
R
e
a
d
1
2
R
e
a
d
7. 7
Time-bounding and Back-scanning
• Need to ensure that clients don’t advance their stream positions past
writes that will eventually succeed
‒ But clients do need to advance position eventually
‒ How do we know when it’s safe?
• Solution: time-bound writes and back-scan reads
‒ Time-bounding: every write to HBase must complete within a fixed time-bound to be
considered successful
‒ No guaranteed delivery for unsuccessful writes.
‒ Clients should retry failed writes at higher stream positions.
‒ Back-scanning: clients cannot advance their stream positions further than (current
time – back-scan interval)
‒ Back-scan interval >= write time-bound
• Provides guaranteed delivery but at the cost of duplicate events
9. 9
Replication
• Master/slave architecture
‒ One cluster per DC
‒ Master cluster handles all reads and writes
‒ Slave clusters are passive replicas
• On promotion, clients transparently fail over to the new master cluster
• Can’t use native HBase replication directly
‒ Could cause clients to miss events when failing over to a lagging cluster
Replication
1
2
1
Failover Replication
1
2
1
Write
R
e
a
d3
10. 10
Replication Contd.
• Replication system needs to be aware of master/slave failovers
‒ Stop exactly replicating messages. Start appending messages to the current ends of
the queues.
• Currently, use a client-level replication system piggy backing on MySQL
replication
• Plan to switch to a system that hooks into HBase replication by
configuring itself as a slave HBase cluster
1
2
1
Failover
1
2
1
3
4
R
e
a
d
11. 11
Why HBase?
• Closest off-the-rack queuing system is Kafka
‒ Developed at LinkedIn. Open sourced in 2011.
‒ Originally built to power LinkedIn’s analytics pipeline
‒ Very similar model built around “ordered commit logs”
‒ Allow for easy addition of new subscribers
‒ Allow for varying subscriber consumption patterns slow subscribers don’t back up the
pipeline
12. 12
Why HBase and not Kafka?
• Better consistency vs. availability tradeoffs
‒ No automatic rack aware replica placement
‒ No automatic replica re-assignment upon replica failure
‒ On replica failure, no fast failover of new writes to new replicas.
‒ Can’t require minimum replication factor for new writes without significantly impacting
availability on replica failure
• Replication support
‒ Not enough control over Kafka queue positions to implement transparent client
failovers between replica clusters
• Unable to scale to millions of topics
‒ Currently tops out in the tens of thousands of topics.
‒ Design requires very granular topic tracking. Barrier to scale.
13. 13
In conclusion…
• We were able to leverage HBase to store millions of guaranteed delivery
message queues, each of which was:
‒ replicated between data centers
‒ independently consumable by an arbitrary number of clients
• Cluster metrics:
‒ ~30 nodes per cluster
‒ 15K write/sec at peak. Bursts of up to 40K writes/sec.
‒ 50K-60K requests/sec at peak.