Activity feeds (and more) at mate1

Activity feeds
and more
at Mate1
Big Data Montreal
Tuesday April 8th 2014
Hisham Mardam-Bey

Overview
● Who is this guy?
● Mate1
○ quick intro
○ some of the features
○ technology stack
● Activity feed
○ take 1
○ take 2
● What’s next?

Who is this guy?
● Linux user and developer since 1996
● Started out hacking on Enlightenment
○ X11 window manager
● Worked with OpenBSD
○ building embedded network gear
● Did a whole lot of C followed by Ruby
● Working with the JVM since 2007
github: mardambey
twitter: codewarrior

Mate1: quick intro
● Online dating, since 2003, based in Montreal
● Initially team of 3, around 40 now
● Engineering team has 13 geeks / geekettes
● We own and run our own hardware
○ fun!
○ mostly…
○ LXC is a life (hardware resource?) saver (=
https://github.com/mate1

Some of our features...
● Lots of communication, chatting, push notifs
● Search, matching, ranking, geo-location
● Lists, friends, blocks, people interested, more
● News & activity feeds, counters, contacts

And what we use for them...
● Lots of communication, chatting, push notifs
● Search, matching, ranking, geo-location
● Lists, friends, blocks, people interested, more
● News & activity feeds, counters, contacts
… all glued together by…

Programming languages…
Scala, Java -> back-end services, business logic, “controllers”
● What makes us =(
○ Struts2 -> XML, painful, want to dump it
○ Hibernate -> not as an ORM, mainly to map
● What makes us (=
○ Play! -> in prod, migrating to it… need “non-blocking” db
layer
○ Akka -> simplifies concurrency, network transparency

Programming languages…
● JavaScript -> front end, mobile and desktop
○ Sencha Touch + Apache Cordova -> cross-platform
● PHP -> quick / temporary work
○ registration funnels
○ transient marketing pages
● Perl -> seriously? yes!
○ pre 2007, entire system was in Perl
○ now, customer service system, marketing tools, etc.
○ and! new email delivery service

At some point… activity feed
● Gather user activity and events
● Sometimes inject system events
● As low latency as possible
● Grouped into “tiers” (or types)
● Supports “roll-ups”
● Maintain counters for different event types

Take 1: fan-out on read (pull)
● Activity occurs...
○ A views B -> insert into views uid=B viewer_uid=A
○ C likes B -> insert into likes uid=C likee_uid=B
○ D emails B -> insert into emails uid=B sender_uid=D
■ refer to these as channels
■ based on legacy features and legacy data
● B asks for their activity feed

App
servers
MySQL
messages
memcached
MySQL
lists
MySQL
images
MySQL
users
memcached
memcachedcached?
query all channels
aggregate
cache
all done!
so far so good! … or is it?

● Several channels piggybacked off existing
features
○ no uniformity in data structure, not always optimal
● Time constraints on queries
○ can’t go back in time, databases suffered
● Activity feeds slowed down…
● Temporary solutions?
○ slash a bunch of channels -> sucks!
○ aggregate multiple channels to a single table -> hack!
○ had to rethink how we’re doing this

Take 2: fan out on write (push)
● Had to change approach entirely
● Needed to store data more efficiently
● Writes can be queued up
● Roll-ups should be persistent
● More channels != slower performance
● Built as a scalable service end-to-end

More efficient storage
● Don’t piggyback on old features
● Pre-aggregated user activity feeds
● Ideally store roll-ups in the same store
● Always sorted by time
● Minimal updates or deletes required
● Avoid counting

Writes can be queued up
● Push all activity / events into message
queue
● Process and persist as soon as possible
○ don’t cause back-pressure
○ needs to be durable
○ must be able to easily scale consumption
● Lots of message queue technologies
○ experience with RabbitMQ (web server logs)
○ and Redis (pubsub, monitoring)
○ tested out Flume (non-ng), buggy at the time

Architecture
App
servers
Web
servers
Other
services
Event
Manager
Event
Manager
Event
Manager

Event manager
● Wanted to be able to publish events
● Started off needing minimal information
○ event type, timestamp, user ids, etc.
○ soon after, needed much more data per event
● Was one of our first Scala libraries!
○ admittedly, needs clean-up now (=
● Provides sync and async publishing
● Also provides callback based consumer API

Architecture
App
servers
Web
servers
Other
services
Event
Manager
Event
Manager
Event
Manager
Kafka
Kafka
Kafka
ZK
ZK

Why Kafka?
● Durable by design
● Very high throughput! and scalable
● Supports consumer groups
● Native Scala API
● Supports consumer “replay”
● Per topic data retention and partitioning
● Integrates with Hadoop
○ Kafka <-> Hadoop via Camus and Camus2Kafka
● Grabbed it from LinkedIn’s SVN
○ never looked back! we love it!
○ moving to 0.8.1 at the time of this writing

Architecture
App
servers
Web
servers
Other
services
Event
Manager
Event
Manager
Event
Manager
Kafka
Kafka
Kafka
ZK
ZK
Consumers
Consumers

Consumers
● Implement a simple API
● Subscribe to Kafka topics
● Process events, can do “anything”
○ For activity feeds
■ All interesting events are published
■ and consumed (views, liked, emails, uploads…)
■ then stored into the data store
○ We can also
■ maintain counts & stats, send notifications
● Can fail, with certain tolerance
○ otherwise they stop and alarms are raised

Architecture
App
servers
Web
servers
Other
services
Event
Manager
Event
Manager
Event
Manager
Kafka
Kafka
Kafka
ZK
ZK
Consumers
Consumers
Store
Store
Store

Cassandra
● Activity feeds fit well into C*’s data model
● TimeUUID ordering means no sorting
● Having lots of writes is not a problem
○ works to our advantage
○ we want to push data to users’ activity feeds
● Supports counters
● Can add nodes as needed
● Gave each user multiple rows
○ each row is feed type
○ one row is the “roll-up” row
○ roll-ups done in background, or on demand
○ each user has multiple counters and a few “lists”

How are the feeds read?
● Cassandra nodes don’t get read from
directly

Architecture
App
servers
Web
servers
Other
services
Event
Manager
Event
Manager
Event
Manager
Kafka
Kafka
Kafka
ZK
ZK
Consumers
Consumers
C*
C*
C*
???
???

directly
● HTTP layer in front of C*
○ provides specific access points
○ mainly for reading, almost no writes
■ except for some counters
○ supports caching requests
■ and busting the cache
○ returns everything as JSON

directly
● HTTP layer in front of C*
○ provides specific access points
○ mainly for reading, almost no writes
■ except for some counters
○ supports caching requests
■ and busting the cache
○ returns everything as JSON
● Netty!
○ We have 3 readers with Varnish
○ we want to port this to Play!

Architecture
App
servers
Web
servers
Other
services
Event
Manager
Event
Manager
Event
Manager
Kafka
Kafka
Kafka
ZK
ZK
Consumers
Consumers
C*
C*
C*
Netty
Netty

What else?
App
servers
Web
servers
Other
services
Event
Manager
Event
Manager
Event
Manager
Kafka
Kafka
Kafka
ZK
ZK
Consumers
Consumers
C*
C*
C*
Netty
Netty
SOLR
Redis
EjabberdEjabberdEjabberd
APNS
NRT search
Geo-location
TTL flags
transient data

So how did all of this work out?
● Pretty fantastic!
● Kafka, top notch high performance queue
● Netty, fast, uses CPU very efficiently
● Cassandra, data model fit well
○ wide rows rock!
○ can’t live without counters anymore
○ eventual consistency, or why C* owns for feeds
● Issues?
○ C* consistency matters, must tune
○ Big batch reads from C* live cluster can be painful
○ not much else really (=

What else are you up to?
● Want to push more lists into C*
● Want to push our on-site inbox into C*
● Experimenting with Spark and C*
● Need to get data from MySQL -> C*
○ working on a tool to feed MySQL’s replication stream
into Kafka via Avro binary serialization
■ can use it to keep MySQL and C* in sync for
some table, or to maintain basic counts
■ or as a data source for Spark
■ or pump into Hadoop

Fin!
Thats all folks (=
Thanks!
Questions?
Oh, we’re hiring!
http://mate1inc.com/careers/

Activity feeds (and more) at mate1

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Activity feeds (and more) at mate1

Similar a Activity feeds (and more) at mate1 (20)

Último

Último (20)

Activity feeds (and more) at mate1