SlideShare una empresa de Scribd logo
1 de 32
Descargar para leer sin conexión
Activity feeds
and more
at Mate1
Big Data Montreal
Tuesday April 8th 2014
Hisham Mardam-Bey
Overview
● Who is this guy?
● Mate1
○ quick intro
○ some of the features
○ technology stack
● Activity feed
○ take 1
○ take 2
● What’s next?
Who is this guy?
● Linux user and developer since 1996
● Started out hacking on Enlightenment
○ X11 window manager
● Worked with OpenBSD
○ building embedded network gear
● Did a whole lot of C followed by Ruby
● Working with the JVM since 2007
github: mardambey
twitter: codewarrior
Mate1: quick intro
● Online dating, since 2003, based in Montreal
● Initially team of 3, around 40 now
● Engineering team has 13 geeks / geekettes
● We own and run our own hardware
○ fun!
○ mostly…
○ LXC is a life (hardware resource?) saver (=
https://github.com/mate1
Some of our features...
● Lots of communication, chatting, push notifs
● Search, matching, ranking, geo-location
● Lists, friends, blocks, people interested, more
● News & activity feeds, counters, contacts
And what we use for them...
● Lots of communication, chatting, push notifs
● Search, matching, ranking, geo-location
● Lists, friends, blocks, people interested, more
● News & activity feeds, counters, contacts
… all glued together by…
Programming languages…
Scala, Java -> back-end services, business logic, “controllers”
● What makes us =(
○ Struts2 -> XML, painful, want to dump it
○ Hibernate -> not as an ORM, mainly to map
● What makes us (=
○ Play! -> in prod, migrating to it… need “non-blocking” db
layer
○ Akka -> simplifies concurrency, network transparency
Programming languages…
● JavaScript -> front end, mobile and desktop
○ Sencha Touch + Apache Cordova -> cross-platform
● PHP -> quick / temporary work
○ registration funnels
○ transient marketing pages
● Perl -> seriously? yes!
○ pre 2007, entire system was in Perl
○ now, customer service system, marketing tools, etc.
○ and! new email delivery service
At some point… activity feed
● Gather user activity and events
● Sometimes inject system events
● As low latency as possible
● Grouped into “tiers” (or types)
● Supports “roll-ups”
● Maintain counters for different event types
Take 1: fan-out on read (pull)
● Activity occurs...
○ A views B -> insert into views uid=B viewer_uid=A
○ C likes B -> insert into likes uid=C likee_uid=B
○ D emails B -> insert into emails uid=B sender_uid=D
■ refer to these as channels
■ based on legacy features and legacy data
● B asks for their activity feed
Take 1: fan-out on read (pull)
App
servers
MySQL
messages
memcached
MySQL
lists
MySQL
images
MySQL
users
memcached
memcachedcached?
query all channels
aggregate
cache
all done!
so far so good! … or is it?
Take 1: fan-out on read (pull)
● Several channels piggybacked off existing
features
○ no uniformity in data structure, not always optimal
● Time constraints on queries
○ can’t go back in time, databases suffered
● Activity feeds slowed down…
● Temporary solutions?
○ slash a bunch of channels -> sucks!
○ aggregate multiple channels to a single table -> hack!
○ had to rethink how we’re doing this
Take 2: fan out on write (push)
● Had to change approach entirely
● Needed to store data more efficiently
● Writes can be queued up
● Roll-ups should be persistent
● More channels != slower performance
● Built as a scalable service end-to-end
More efficient storage
● Don’t piggyback on old features
● Pre-aggregated user activity feeds
● Ideally store roll-ups in the same store
● Always sorted by time
● Minimal updates or deletes required
● Avoid counting
Writes can be queued up
● Push all activity / events into message
queue
● Process and persist as soon as possible
○ don’t cause back-pressure
○ needs to be durable
○ must be able to easily scale consumption
● Lots of message queue technologies
○ experience with RabbitMQ (web server logs)
○ and Redis (pubsub, monitoring)
○ tested out Flume (non-ng), buggy at the time
Architecture
App
servers
Web
servers
Other
services
Event
Manager
Event
Manager
Event
Manager
Event manager
● Wanted to be able to publish events
● Started off needing minimal information
○ event type, timestamp, user ids, etc.
○ soon after, needed much more data per event
● Was one of our first Scala libraries!
○ admittedly, needs clean-up now (=
● Provides sync and async publishing
● Also provides callback based consumer API
Architecture
App
servers
Web
servers
Other
services
Event
Manager
Event
Manager
Event
Manager
Kafka
Kafka
Kafka
ZK
ZK
Why Kafka?
● Durable by design
● Very high throughput! and scalable
● Supports consumer groups
● Native Scala API
● Supports consumer “replay”
● Per topic data retention and partitioning
● Integrates with Hadoop
○ Kafka <-> Hadoop via Camus and Camus2Kafka
● Grabbed it from LinkedIn’s SVN
○ never looked back! we love it!
○ moving to 0.8.1 at the time of this writing
Architecture
App
servers
Web
servers
Other
services
Event
Manager
Event
Manager
Event
Manager
Kafka
Kafka
Kafka
ZK
ZK
Consumers
Consumers
Consumers
● Implement a simple API
● Subscribe to Kafka topics
● Process events, can do “anything”
○ For activity feeds
■ All interesting events are published
■ and consumed (views, liked, emails, uploads…)
■ then stored into the data store
○ We can also
■ maintain counts & stats, send notifications
● Can fail, with certain tolerance
○ otherwise they stop and alarms are raised
Architecture
App
servers
Web
servers
Other
services
Event
Manager
Event
Manager
Event
Manager
Kafka
Kafka
Kafka
ZK
ZK
Consumers
Consumers
Store
Store
Store
Cassandra
● Activity feeds fit well into C*’s data model
● TimeUUID ordering means no sorting
● Having lots of writes is not a problem
○ works to our advantage
○ we want to push data to users’ activity feeds
● Supports counters
● Can add nodes as needed
● Gave each user multiple rows
○ each row is feed type
○ one row is the “roll-up” row
○ roll-ups done in background, or on demand
○ each user has multiple counters and a few “lists”
How are the feeds read?
● Cassandra nodes don’t get read from
directly
Architecture
App
servers
Web
servers
Other
services
Event
Manager
Event
Manager
Event
Manager
Kafka
Kafka
Kafka
ZK
ZK
Consumers
Consumers
C*
C*
C*
???
???
How are the feeds read?
● Cassandra nodes don’t get read from
directly
● HTTP layer in front of C*
○ provides specific access points
○ mainly for reading, almost no writes
■ except for some counters
○ supports caching requests
■ and busting the cache
○ returns everything as JSON
How are the feeds read?
● Cassandra nodes don’t get read from
directly
● HTTP layer in front of C*
○ provides specific access points
○ mainly for reading, almost no writes
■ except for some counters
○ supports caching requests
■ and busting the cache
○ returns everything as JSON
● Netty!
○ We have 3 readers with Varnish
○ we want to port this to Play!
Architecture
App
servers
Web
servers
Other
services
Event
Manager
Event
Manager
Event
Manager
Kafka
Kafka
Kafka
ZK
ZK
Consumers
Consumers
C*
C*
C*
Netty
Netty
What else?
App
servers
Web
servers
Other
services
Event
Manager
Event
Manager
Event
Manager
Kafka
Kafka
Kafka
ZK
ZK
Consumers
Consumers
C*
C*
C*
Netty
Netty
SOLR
Redis
EjabberdEjabberdEjabberd
APNS
NRT search
Geo-location
TTL flags
transient data
So how did all of this work out?
● Pretty fantastic!
● Kafka, top notch high performance queue
● Netty, fast, uses CPU very efficiently
● Cassandra, data model fit well
○ wide rows rock!
○ can’t live without counters anymore
○ eventual consistency, or why C* owns for feeds
● Issues?
○ C* consistency matters, must tune
○ Big batch reads from C* live cluster can be painful
○ not much else really (=
What else are you up to?
● Want to push more lists into C*
● Want to push our on-site inbox into C*
● Experimenting with Spark and C*
● Need to get data from MySQL -> C*
○ working on a tool to feed MySQL’s replication stream
into Kafka via Avro binary serialization
■ can use it to keep MySQL and C* in sync for
some table, or to maintain basic counts
■ or as a data source for Spark
■ or pump into Hadoop
Fin!
Thats all folks (=
Thanks!
Questions?
Oh, we’re hiring!
http://mate1inc.com/careers/

Más contenido relacionado

La actualidad más candente

KubeCon EU 2019 - P2P Docker Image Distribution in Hybrid Cloud Environment w...
KubeCon EU 2019 - P2P Docker Image Distribution in Hybrid Cloud Environment w...KubeCon EU 2019 - P2P Docker Image Distribution in Hybrid Cloud Environment w...
KubeCon EU 2019 - P2P Docker Image Distribution in Hybrid Cloud Environment w...Yiran Wang
 
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponHBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponCloudera, Inc.
 
9b. Document-Oriented Databases lab
9b. Document-Oriented Databases lab9b. Document-Oriented Databases lab
9b. Document-Oriented Databases labFabio Fumarola
 
How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.Renzo Tomà
 
Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes Ziemowit Jankowski
 
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...NoSQLmatters
 
Scaling an ELK stack at bol.com
Scaling an ELK stack at bol.comScaling an ELK stack at bol.com
Scaling an ELK stack at bol.comRenzo Tomà
 
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScyllaDB
 
Apache spark - Installation
Apache spark - InstallationApache spark - Installation
Apache spark - InstallationMartin Zapletal
 
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...Rob Skillington
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandrashimi_k
 
Introduction to NoSql
Introduction to NoSqlIntroduction to NoSql
Introduction to NoSqlOmid Vahdaty
 
Experiences in ELK with D3.js for Large Log Analysis and Visualization
Experiences in ELK with D3.js  for Large Log Analysis  and VisualizationExperiences in ELK with D3.js  for Large Log Analysis  and Visualization
Experiences in ELK with D3.js for Large Log Analysis and VisualizationSurasak Sanguanpong
 
Go and Uber’s time series database m3
Go and Uber’s time series database m3Go and Uber’s time series database m3
Go and Uber’s time series database m3Rob Skillington
 
.NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov).NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov)ITCamp
 
My talk about Tarantool and Lua at Percona Live 2016
My talk about Tarantool and Lua at Percona Live 2016My talk about Tarantool and Lua at Percona Live 2016
My talk about Tarantool and Lua at Percona Live 2016Konstantin Osipov
 
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupMachine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupTill Rohrmann
 
Pain points with M3, some things to address them and how replication works
Pain points with M3, some things to address them and how replication worksPain points with M3, some things to address them and how replication works
Pain points with M3, some things to address them and how replication worksRob Skillington
 

La actualidad más candente (20)

KubeCon EU 2019 - P2P Docker Image Distribution in Hybrid Cloud Environment w...
KubeCon EU 2019 - P2P Docker Image Distribution in Hybrid Cloud Environment w...KubeCon EU 2019 - P2P Docker Image Distribution in Hybrid Cloud Environment w...
KubeCon EU 2019 - P2P Docker Image Distribution in Hybrid Cloud Environment w...
 
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponHBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
 
9b. Document-Oriented Databases lab
9b. Document-Oriented Databases lab9b. Document-Oriented Databases lab
9b. Document-Oriented Databases lab
 
How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.
 
Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes
 
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
 
Scaling an ELK stack at bol.com
Scaling an ELK stack at bol.comScaling an ELK stack at bol.com
Scaling an ELK stack at bol.com
 
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
 
Apache spark - Installation
Apache spark - InstallationApache spark - Installation
Apache spark - Installation
 
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Introduction to NoSql
Introduction to NoSqlIntroduction to NoSql
Introduction to NoSql
 
MacGyver Learns Spark
MacGyver Learns SparkMacGyver Learns Spark
MacGyver Learns Spark
 
Experiences in ELK with D3.js for Large Log Analysis and Visualization
Experiences in ELK with D3.js  for Large Log Analysis  and VisualizationExperiences in ELK with D3.js  for Large Log Analysis  and Visualization
Experiences in ELK with D3.js for Large Log Analysis and Visualization
 
Go and Uber’s time series database m3
Go and Uber’s time series database m3Go and Uber’s time series database m3
Go and Uber’s time series database m3
 
.NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov).NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov)
 
My talk about Tarantool and Lua at Percona Live 2016
My talk about Tarantool and Lua at Percona Live 2016My talk about Tarantool and Lua at Percona Live 2016
My talk about Tarantool and Lua at Percona Live 2016
 
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupMachine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning Group
 
Pain points with M3, some things to address them and how replication works
Pain points with M3, some things to address them and how replication worksPain points with M3, some things to address them and how replication works
Pain points with M3, some things to address them and how replication works
 
Xephon K A Time series database with multiple backends
Xephon K A Time series database with multiple backendsXephon K A Time series database with multiple backends
Xephon K A Time series database with multiple backends
 

Similar a Activity feeds (and more) at mate1

AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
 
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanAnalytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanDatabricks
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in RetailHari Shreedharan
 
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018Bowen Li
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
Log Management: AtlSecCon2015
Log Management: AtlSecCon2015Log Management: AtlSecCon2015
Log Management: AtlSecCon2015cameronevans
 
Big data @ uber vu (1)
Big data @ uber vu (1)Big data @ uber vu (1)
Big data @ uber vu (1)Mihnea Giurgea
 
The Fn Project: A Quick Introduction (December 2017)
The Fn Project: A Quick Introduction (December 2017)The Fn Project: A Quick Introduction (December 2017)
The Fn Project: A Quick Introduction (December 2017)Oracle Developers
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned Omid Vahdaty
 
Build real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache KafkaBuild real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache KafkaHotstar
 
An EyeWitness View into your Network
An EyeWitness View into your NetworkAn EyeWitness View into your Network
An EyeWitness View into your NetworkCTruncer
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streamingdatamantra
 
Building data "Py-pelines"
Building data "Py-pelines"Building data "Py-pelines"
Building data "Py-pelines"Rob Winters
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsMonal Daxini
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsZhenxiao Luo
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 
Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin PodvalMartin Podval
 

Similar a Activity feeds (and more) at mate1 (20)

AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanAnalytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in Retail
 
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Log Management: AtlSecCon2015
Log Management: AtlSecCon2015Log Management: AtlSecCon2015
Log Management: AtlSecCon2015
 
Netty training
Netty trainingNetty training
Netty training
 
Big data @ uber vu (1)
Big data @ uber vu (1)Big data @ uber vu (1)
Big data @ uber vu (1)
 
The Fn Project: A Quick Introduction (December 2017)
The Fn Project: A Quick Introduction (December 2017)The Fn Project: A Quick Introduction (December 2017)
The Fn Project: A Quick Introduction (December 2017)
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
 
Build real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache KafkaBuild real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache Kafka
 
An EyeWitness View into your Network
An EyeWitness View into your NetworkAn EyeWitness View into your Network
An EyeWitness View into your Network
 
Netty training
Netty trainingNetty training
Netty training
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Building data "Py-pelines"
Building data "Py-pelines"Building data "Py-pelines"
Building data "Py-pelines"
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data Problems
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin Podval
 

Último

OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingShane Coughlan
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencessuser9e7c64
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slidesvaideheekore1
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions
 

Último (20)

OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conference
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slides
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh IT
 

Activity feeds (and more) at mate1

  • 1. Activity feeds and more at Mate1 Big Data Montreal Tuesday April 8th 2014 Hisham Mardam-Bey
  • 2. Overview ● Who is this guy? ● Mate1 ○ quick intro ○ some of the features ○ technology stack ● Activity feed ○ take 1 ○ take 2 ● What’s next?
  • 3. Who is this guy? ● Linux user and developer since 1996 ● Started out hacking on Enlightenment ○ X11 window manager ● Worked with OpenBSD ○ building embedded network gear ● Did a whole lot of C followed by Ruby ● Working with the JVM since 2007 github: mardambey twitter: codewarrior
  • 4. Mate1: quick intro ● Online dating, since 2003, based in Montreal ● Initially team of 3, around 40 now ● Engineering team has 13 geeks / geekettes ● We own and run our own hardware ○ fun! ○ mostly… ○ LXC is a life (hardware resource?) saver (= https://github.com/mate1
  • 5. Some of our features... ● Lots of communication, chatting, push notifs ● Search, matching, ranking, geo-location ● Lists, friends, blocks, people interested, more ● News & activity feeds, counters, contacts
  • 6. And what we use for them... ● Lots of communication, chatting, push notifs ● Search, matching, ranking, geo-location ● Lists, friends, blocks, people interested, more ● News & activity feeds, counters, contacts … all glued together by…
  • 7. Programming languages… Scala, Java -> back-end services, business logic, “controllers” ● What makes us =( ○ Struts2 -> XML, painful, want to dump it ○ Hibernate -> not as an ORM, mainly to map ● What makes us (= ○ Play! -> in prod, migrating to it… need “non-blocking” db layer ○ Akka -> simplifies concurrency, network transparency
  • 8. Programming languages… ● JavaScript -> front end, mobile and desktop ○ Sencha Touch + Apache Cordova -> cross-platform ● PHP -> quick / temporary work ○ registration funnels ○ transient marketing pages ● Perl -> seriously? yes! ○ pre 2007, entire system was in Perl ○ now, customer service system, marketing tools, etc. ○ and! new email delivery service
  • 9. At some point… activity feed ● Gather user activity and events ● Sometimes inject system events ● As low latency as possible ● Grouped into “tiers” (or types) ● Supports “roll-ups” ● Maintain counters for different event types
  • 10. Take 1: fan-out on read (pull) ● Activity occurs... ○ A views B -> insert into views uid=B viewer_uid=A ○ C likes B -> insert into likes uid=C likee_uid=B ○ D emails B -> insert into emails uid=B sender_uid=D ■ refer to these as channels ■ based on legacy features and legacy data ● B asks for their activity feed
  • 11. Take 1: fan-out on read (pull) App servers MySQL messages memcached MySQL lists MySQL images MySQL users memcached memcachedcached? query all channels aggregate cache all done! so far so good! … or is it?
  • 12. Take 1: fan-out on read (pull) ● Several channels piggybacked off existing features ○ no uniformity in data structure, not always optimal ● Time constraints on queries ○ can’t go back in time, databases suffered ● Activity feeds slowed down… ● Temporary solutions? ○ slash a bunch of channels -> sucks! ○ aggregate multiple channels to a single table -> hack! ○ had to rethink how we’re doing this
  • 13. Take 2: fan out on write (push) ● Had to change approach entirely ● Needed to store data more efficiently ● Writes can be queued up ● Roll-ups should be persistent ● More channels != slower performance ● Built as a scalable service end-to-end
  • 14. More efficient storage ● Don’t piggyback on old features ● Pre-aggregated user activity feeds ● Ideally store roll-ups in the same store ● Always sorted by time ● Minimal updates or deletes required ● Avoid counting
  • 15. Writes can be queued up ● Push all activity / events into message queue ● Process and persist as soon as possible ○ don’t cause back-pressure ○ needs to be durable ○ must be able to easily scale consumption ● Lots of message queue technologies ○ experience with RabbitMQ (web server logs) ○ and Redis (pubsub, monitoring) ○ tested out Flume (non-ng), buggy at the time
  • 17. Event manager ● Wanted to be able to publish events ● Started off needing minimal information ○ event type, timestamp, user ids, etc. ○ soon after, needed much more data per event ● Was one of our first Scala libraries! ○ admittedly, needs clean-up now (= ● Provides sync and async publishing ● Also provides callback based consumer API
  • 19. Why Kafka? ● Durable by design ● Very high throughput! and scalable ● Supports consumer groups ● Native Scala API ● Supports consumer “replay” ● Per topic data retention and partitioning ● Integrates with Hadoop ○ Kafka <-> Hadoop via Camus and Camus2Kafka ● Grabbed it from LinkedIn’s SVN ○ never looked back! we love it! ○ moving to 0.8.1 at the time of this writing
  • 21. Consumers ● Implement a simple API ● Subscribe to Kafka topics ● Process events, can do “anything” ○ For activity feeds ■ All interesting events are published ■ and consumed (views, liked, emails, uploads…) ■ then stored into the data store ○ We can also ■ maintain counts & stats, send notifications ● Can fail, with certain tolerance ○ otherwise they stop and alarms are raised
  • 23. Cassandra ● Activity feeds fit well into C*’s data model ● TimeUUID ordering means no sorting ● Having lots of writes is not a problem ○ works to our advantage ○ we want to push data to users’ activity feeds ● Supports counters ● Can add nodes as needed ● Gave each user multiple rows ○ each row is feed type ○ one row is the “roll-up” row ○ roll-ups done in background, or on demand ○ each user has multiple counters and a few “lists”
  • 24. How are the feeds read? ● Cassandra nodes don’t get read from directly
  • 26. How are the feeds read? ● Cassandra nodes don’t get read from directly ● HTTP layer in front of C* ○ provides specific access points ○ mainly for reading, almost no writes ■ except for some counters ○ supports caching requests ■ and busting the cache ○ returns everything as JSON
  • 27. How are the feeds read? ● Cassandra nodes don’t get read from directly ● HTTP layer in front of C* ○ provides specific access points ○ mainly for reading, almost no writes ■ except for some counters ○ supports caching requests ■ and busting the cache ○ returns everything as JSON ● Netty! ○ We have 3 readers with Varnish ○ we want to port this to Play!
  • 30. So how did all of this work out? ● Pretty fantastic! ● Kafka, top notch high performance queue ● Netty, fast, uses CPU very efficiently ● Cassandra, data model fit well ○ wide rows rock! ○ can’t live without counters anymore ○ eventual consistency, or why C* owns for feeds ● Issues? ○ C* consistency matters, must tune ○ Big batch reads from C* live cluster can be painful ○ not much else really (=
  • 31. What else are you up to? ● Want to push more lists into C* ● Want to push our on-site inbox into C* ● Experimenting with Spark and C* ● Need to get data from MySQL -> C* ○ working on a tool to feed MySQL’s replication stream into Kafka via Avro binary serialization ■ can use it to keep MySQL and C* in sync for some table, or to maintain basic counts ■ or as a data source for Spark ■ or pump into Hadoop
  • 32. Fin! Thats all folks (= Thanks! Questions? Oh, we’re hiring! http://mate1inc.com/careers/