2. Overview
● Who is this guy?
● Mate1
○ quick intro
○ some of the features
○ technology stack
● Activity feed
○ take 1
○ take 2
● What’s next?
3. Who is this guy?
● Linux user and developer since 1996
● Started out hacking on Enlightenment
○ X11 window manager
● Worked with OpenBSD
○ building embedded network gear
● Did a whole lot of C followed by Ruby
● Working with the JVM since 2007
github: mardambey
twitter: codewarrior
4. Mate1: quick intro
● Online dating, since 2003, based in Montreal
● Initially team of 3, around 40 now
● Engineering team has 13 geeks / geekettes
● We own and run our own hardware
○ fun!
○ mostly…
○ LXC is a life (hardware resource?) saver (=
https://github.com/mate1
5. Some of our features...
● Lots of communication, chatting, push notifs
● Search, matching, ranking, geo-location
● Lists, friends, blocks, people interested, more
● News & activity feeds, counters, contacts
6. And what we use for them...
● Lots of communication, chatting, push notifs
● Search, matching, ranking, geo-location
● Lists, friends, blocks, people interested, more
● News & activity feeds, counters, contacts
… all glued together by…
7. Programming languages…
Scala, Java -> back-end services, business logic, “controllers”
● What makes us =(
○ Struts2 -> XML, painful, want to dump it
○ Hibernate -> not as an ORM, mainly to map
● What makes us (=
○ Play! -> in prod, migrating to it… need “non-blocking” db
layer
○ Akka -> simplifies concurrency, network transparency
8. Programming languages…
● JavaScript -> front end, mobile and desktop
○ Sencha Touch + Apache Cordova -> cross-platform
● PHP -> quick / temporary work
○ registration funnels
○ transient marketing pages
● Perl -> seriously? yes!
○ pre 2007, entire system was in Perl
○ now, customer service system, marketing tools, etc.
○ and! new email delivery service
9. At some point… activity feed
● Gather user activity and events
● Sometimes inject system events
● As low latency as possible
● Grouped into “tiers” (or types)
● Supports “roll-ups”
● Maintain counters for different event types
10. Take 1: fan-out on read (pull)
● Activity occurs...
○ A views B -> insert into views uid=B viewer_uid=A
○ C likes B -> insert into likes uid=C likee_uid=B
○ D emails B -> insert into emails uid=B sender_uid=D
■ refer to these as channels
■ based on legacy features and legacy data
● B asks for their activity feed
11. Take 1: fan-out on read (pull)
App
servers
MySQL
messages
memcached
MySQL
lists
MySQL
images
MySQL
users
memcached
memcachedcached?
query all channels
aggregate
cache
all done!
so far so good! … or is it?
12. Take 1: fan-out on read (pull)
● Several channels piggybacked off existing
features
○ no uniformity in data structure, not always optimal
● Time constraints on queries
○ can’t go back in time, databases suffered
● Activity feeds slowed down…
● Temporary solutions?
○ slash a bunch of channels -> sucks!
○ aggregate multiple channels to a single table -> hack!
○ had to rethink how we’re doing this
13. Take 2: fan out on write (push)
● Had to change approach entirely
● Needed to store data more efficiently
● Writes can be queued up
● Roll-ups should be persistent
● More channels != slower performance
● Built as a scalable service end-to-end
14. More efficient storage
● Don’t piggyback on old features
● Pre-aggregated user activity feeds
● Ideally store roll-ups in the same store
● Always sorted by time
● Minimal updates or deletes required
● Avoid counting
15. Writes can be queued up
● Push all activity / events into message
queue
● Process and persist as soon as possible
○ don’t cause back-pressure
○ needs to be durable
○ must be able to easily scale consumption
● Lots of message queue technologies
○ experience with RabbitMQ (web server logs)
○ and Redis (pubsub, monitoring)
○ tested out Flume (non-ng), buggy at the time
17. Event manager
● Wanted to be able to publish events
● Started off needing minimal information
○ event type, timestamp, user ids, etc.
○ soon after, needed much more data per event
● Was one of our first Scala libraries!
○ admittedly, needs clean-up now (=
● Provides sync and async publishing
● Also provides callback based consumer API
19. Why Kafka?
● Durable by design
● Very high throughput! and scalable
● Supports consumer groups
● Native Scala API
● Supports consumer “replay”
● Per topic data retention and partitioning
● Integrates with Hadoop
○ Kafka <-> Hadoop via Camus and Camus2Kafka
● Grabbed it from LinkedIn’s SVN
○ never looked back! we love it!
○ moving to 0.8.1 at the time of this writing
21. Consumers
● Implement a simple API
● Subscribe to Kafka topics
● Process events, can do “anything”
○ For activity feeds
■ All interesting events are published
■ and consumed (views, liked, emails, uploads…)
■ then stored into the data store
○ We can also
■ maintain counts & stats, send notifications
● Can fail, with certain tolerance
○ otherwise they stop and alarms are raised
23. Cassandra
● Activity feeds fit well into C*’s data model
● TimeUUID ordering means no sorting
● Having lots of writes is not a problem
○ works to our advantage
○ we want to push data to users’ activity feeds
● Supports counters
● Can add nodes as needed
● Gave each user multiple rows
○ each row is feed type
○ one row is the “roll-up” row
○ roll-ups done in background, or on demand
○ each user has multiple counters and a few “lists”
24. How are the feeds read?
● Cassandra nodes don’t get read from
directly
26. How are the feeds read?
● Cassandra nodes don’t get read from
directly
● HTTP layer in front of C*
○ provides specific access points
○ mainly for reading, almost no writes
■ except for some counters
○ supports caching requests
■ and busting the cache
○ returns everything as JSON
27. How are the feeds read?
● Cassandra nodes don’t get read from
directly
● HTTP layer in front of C*
○ provides specific access points
○ mainly for reading, almost no writes
■ except for some counters
○ supports caching requests
■ and busting the cache
○ returns everything as JSON
● Netty!
○ We have 3 readers with Varnish
○ we want to port this to Play!
30. So how did all of this work out?
● Pretty fantastic!
● Kafka, top notch high performance queue
● Netty, fast, uses CPU very efficiently
● Cassandra, data model fit well
○ wide rows rock!
○ can’t live without counters anymore
○ eventual consistency, or why C* owns for feeds
● Issues?
○ C* consistency matters, must tune
○ Big batch reads from C* live cluster can be painful
○ not much else really (=
31. What else are you up to?
● Want to push more lists into C*
● Want to push our on-site inbox into C*
● Experimenting with Spark and C*
● Need to get data from MySQL -> C*
○ working on a tool to feed MySQL’s replication stream
into Kafka via Avro binary serialization
■ can use it to keep MySQL and C* in sync for
some table, or to maintain basic counts
■ or as a data source for Spark
■ or pump into Hadoop