Papers we love realtime at facebook

•Descargar como PPTX, PDF•

2 recomendaciones•1,009 vistas

Presentation for Papers We Love at QCON NYC 17. I didn't write the paper, good people at Facebook did. But I sure enjoyed reading it and presenting it.

Datos y análisis

1
Papers We Love:
Realtime Data Processing at Facebook
Gwen Shapira
Confluent Inc.

2
Papers We Love:
Realtime Data Processing at Facebook

5
This is NOT
The one true architecture
.
Please don’t cargo-cult this paper

6
Few real-time systems at Facebook
• Chorus – aggregate trends
• Realtime feedback for mobile app developers
• Page analytics – likes, engagement…
• Offload CPU-intensive dashboard queries

10
Looking for trending topics in 5 minute windows

11
The Tofu & Potatoes of the paper:
Design Decisions

13
Decision #1 – Language Paradigm
• Declarative (SQL) – easy & limited
• Functional
• Procedural (C++, Java, Python) –
most flexibility, control, performance. Longer dev cycle.

14
Decision #1 – Language Paradigm
• Declarative (SQL) – easy & limited
• Functional
• Procedural (C++, Java, Python) –
most flexibility, control, performance. Longer dev cycle.

15
Decision #2: Data Transfer
• RPC (Millwheel, Flink, SparkStreaming)
• All about speed
• Message-forwarding broker (Heron)
• Applies back-pressure, multiplex
• Persistent stream storage (Samza, Kafka’s Stream API)
• Most reliable
• Decouples processors

17
Love Song to Scribe
Independent stream processing nodes
And storing inputs / outputs
Made everything great

19
Decision #3 – Processing Semantics
Facebook Verdict: It depends on requirements
• Ranker writes to idempotent system – at least once
• Scuba can lose data, but not handle duplicates – at most once
• …. Exactly once is REALLY HARD and requires transactions

20
Don’t miss the side-note on side-effects
• Exactly once means writing output + offsets to a transactional system
• This takes time
• Why just wait when you can deserialize? And maybe do other stateless stuff?

21
Decision #4 – State Saving
• In-memory state with replication (Old VoltDB)
• Requires lots of hardware and network
• Local database (Samza, Kafka Streams API)
• Remote database (Millwheel)
• Upstream (i.e. replay everything on failure)
• Global consistent snapshot (Flink)

22
Decision #4 – State Saving
Facebook Verdict: It depends
Rhode Island Alaska

23
Best Part of the Paper – by far
How to efficiently work with state in remote DB?

24
Decision #5 - Reprocessing
• Stream only – requires long retention in the stream store
• Maintain both batch and stream systems
• Develop systems that can run in streams and batch (Flink, Spark)

25
Decision #5 - Reprocessing
• Stream only – requires long retention in the stream store
• Maintain both batch and stream systems
• Develop systems that can run in streams and batch (Flink, Spark)
Facebook Verdict:
SQL runs everywhere
And binary generation FTW

26
Applications – Or a whirlwind tour of good patterns
One example:

27
Lessons Learned!
The biggest win is pipelines composed of independent processors
• Mixing multiple systems let us move fast
• High level abstractions let us improve implementation
• Ease of debugging – Independent nodes and ability to replay
• Ease of deployment – Puma as-a-service
• Ease of monitoring – Lag is the most important metric. Everything is
instrumented out of the box.
• In the future – auto-scale based on lag

Más contenido relacionado

La actualidad más candente

Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafkaconfluent

Ingesting Healthcare Data, Micah Whitacreconfluent

Espresso Database Replication with Kafka, Tom Quiggleconfluent

Kafka Streams for Java enthusiastsSlim Baltagi

Decoupling Decisions with Apache KafkaGrant Henke

Real time Messages at Scale with Apache Kafka and CouchbaseWill Gardella

Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf

Streaming Data Integration - For Women in Big Data MeetupGwen (Chen) Shapira

Introduction to Apache KafkaJeff Holoman

Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...confluent

Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...DataWorks Summit/Hadoop Summit

Design Patterns for working with Fast DataMapR Technologies

Building Stream Infrastructure across Multiple Data Centers with Apache KafkaGuozhang Wang

Kafka reliability velocity 17Gwen (Chen) Shapira

101 ways to configure kafka - badly (Kafka Summit)Henning Spjelkavik

Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...confluent

Building High-Throughput, Low-Latency Pipelines in Kafkaconfluent

Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Exampleconfluent

Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applicationsconfluent

Building Realtim Data Pipelines with Kafka Connect and Spark StreamingGuozhang Wang

La actualidad más candente (20)

Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka

Ingesting Healthcare Data, Micah Whitacre

Espresso Database Replication with Kafka, Tom Quiggle

Kafka Streams for Java enthusiasts

Decoupling Decisions with Apache Kafka

Real time Messages at Scale with Apache Kafka and Couchbase

Jack Gudenkauf sparkug_20151207_7

Streaming Data Integration - For Women in Big Data Meetup

Introduction to Apache Kafka

Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...

Design Patterns for working with Fast Data

Building Stream Infrastructure across Multiple Data Centers with Apache Kafka

Kafka reliability velocity 17

101 ways to configure kafka - badly (Kafka Summit)

Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...

Building High-Throughput, Low-Latency Pipelines in Kafka

Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example

Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications

Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Similar a Papers we love realtime at facebook

Hadoop ppt1chariorienit

Distributed Data processing in a Cloudelliando dias

What ya gonna do?CQD

Transforming Data Architecture Complexity at Sears - StampedeCon 2013StampedeCon

Lessons from Sharding SolrGregg Donovan

Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, EtsyLucidworks

Introduction to Hadoop AdministrationRamesh Pabba - seeking new projects

Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward

Architecture Patterns - Open DiscussionNguyen Tung

20141206 4 q14_dataconference_i_am_your_dbhyeongchae lee

Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Gyula Fóra

Large-Scale Stream Processing in the Hadoop Ecosystem DataWorks Summit/Hadoop Summit

PNUTS: Yahoo!’s Hosted Data Serving PlatformTarik Reza Toha

The Autobahn Has No Speed Limit - Your XPages Shouldn't Either!Teamstudio

Tachyon-2014-11-21-amp-camp5Haoyuan Li

List of Engineering Colleges in UttarakhandRoorkee College of Engineering, Roorkee

Hadoop.pptxarslanhaneef

Hadoop.pptxsonukumar379092

From Zero to Performance Hero in Minutes - Agile Testing Days 2014 PotsdamAndreas Grabner

Similar a Papers we love realtime at facebook (20)

Hadoop ppt1

Distributed Data processing in a Cloud

What ya gonna do?

Transforming Data Architecture Complexity at Sears - StampedeCon 2013

Lessons from Sharding Solr

Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy

Introduction to Hadoop Administration

Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...

Architecture Patterns - Open Discussion

20141206 4 q14_dataconference_i_am_your_db

Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016

Large-Scale Stream Processing in the Hadoop Ecosystem

PNUTS: Yahoo!’s Hosted Data Serving Platform

The Autobahn Has No Speed Limit - Your XPages Shouldn't Either!

Tachyon-2014-11-21-amp-camp5

List of Engineering Colleges in Uttarakhand

Hadoop.pptx

From Zero to Performance Hero in Minutes - Agile Testing Days 2014 Potsdam

Más de Gwen (Chen) Shapira

Velocity 2019 - Kafka Operations Deep DiveGwen (Chen) Shapira

Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote Gwen (Chen) Shapira

Gluecon - Kafka and the service meshGwen (Chen) Shapira

Kafka connect-london-meetup-2016Gwen (Chen) Shapira

Fraud Detection for Israel BigThings MeetupGwen (Chen) Shapira

Kafka Reliability - When it absolutely, positively has to be thereGwen (Chen) Shapira

Nyc kafka meetup 2015 - when bad things happen to good kafka clustersGwen (Chen) Shapira

Fraud Detection ArchitectureGwen (Chen) Shapira

Have your cake and eat it tooGwen (Chen) Shapira

Kafka for DBAsGwen (Chen) Shapira

Data Architectures for Robust Decision MakingGwen (Chen) Shapira

Kafka and Hadoop at LinkedIn MeetupGwen (Chen) Shapira

Kafka & Hadoop - for NYC Kafka MeetupGwen (Chen) Shapira

Twitter with hadoop for oowGwen (Chen) Shapira

R for hadoopersGwen (Chen) Shapira

Scaling ETL with Hadoop - Avoiding FailureGwen (Chen) Shapira

Intro to Spark - for Denver Big Data MeetupGwen (Chen) Shapira

Incredible Impala Gwen (Chen) Shapira

Data Wrangling and Oracle Connectors for HadoopGwen (Chen) Shapira

Scaling etl with hadoop shapira 3Gwen (Chen) Shapira

Más de Gwen (Chen) Shapira (20)

Velocity 2019 - Kafka Operations Deep Dive

Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote

Gluecon - Kafka and the service mesh

Kafka connect-london-meetup-2016

Fraud Detection for Israel BigThings Meetup

Kafka Reliability - When it absolutely, positively has to be there

Nyc kafka meetup 2015 - when bad things happen to good kafka clusters

Fraud Detection Architecture

Have your cake and eat it too

Kafka for DBAs

Data Architectures for Robust Decision Making

Kafka and Hadoop at LinkedIn Meetup

Kafka & Hadoop - for NYC Kafka Meetup

Twitter with hadoop for oow

R for hadoopers

Scaling ETL with Hadoop - Avoiding Failure

Intro to Spark - for Denver Big Data Meetup

Incredible Impala

Data Wrangling and Oracle Connectors for Hadoop

Scaling etl with hadoop shapira 3

Último

Halmar dropshipping via API with DroFxolyaivanovalion

Carero dropshipping via API with DroFx.pptxolyaivanovalion

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh9953056974 Low Rate Call Girls In Saket, Delhi NCR

100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate

Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten

Sampling (random) method and Non random.pptDr. Soumendra Kumar Patra

BabyOno dropshipping via API with DroFx.pptxolyaivanovalion

Invezz.com - Grow your wealth with trading signalsInvezz1

Ukraine War presentation: KNOW THE BASICSAishani27

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor

VidaXL dropshipping via API with DroFx.pptxolyaivanovalion

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor

Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor

Edukaciniai dropshipping via API with DroFxolyaivanovalion

VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor

April 2024 - Crypto Market Report's Analysismanisha194592

Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo

Papers we love realtime at facebook

1. 1 Papers We Love: Realtime Data Processing at Facebook Gwen Shapira Confluent Inc.

2. 2 Papers We Love: Realtime Data Processing at Facebook

3. 3 Published in 2016 (!)

4. 4 What kind of paper is this?

5. 5 This is NOT The one true architecture . Please don’t cargo-cult this paper

6. 6 Few real-time systems at Facebook • Chorus – aggregate trends • Realtime feedback for mobile app developers • Page analytics – likes, engagement… • Offload CPU-intensive dashboard queries

7. 7

8. 8

9. 9

10. 10 Looking for trending topics in 5 minute windows

11. 11 The Tofu & Potatoes of the paper: Design Decisions

12. 12 / KafkaStreams + exactly once

13. 13 Decision #1 – Language Paradigm • Declarative (SQL) – easy & limited • Functional • Procedural (C++, Java, Python) – most flexibility, control, performance. Longer dev cycle.

14. 14 Decision #1 – Language Paradigm • Declarative (SQL) – easy & limited • Functional • Procedural (C++, Java, Python) – most flexibility, control, performance. Longer dev cycle.

15. 15 Decision #2: Data Transfer • RPC (Millwheel, Flink, SparkStreaming) • All about speed • Message-forwarding broker (Heron) • Applies back-pressure, multiplex • Persistent stream storage (Samza, Kafka’s Stream API) • Most reliable • Decouples processors

16. 16 Decision #2: Data Transfer

17. 17 Love Song to Scribe Independent stream processing nodes And storing inputs / outputs Made everything great

18. 18 Decision #3 – Processing Semantics

19. 19 Decision #3 – Processing Semantics Facebook Verdict: It depends on requirements • Ranker writes to idempotent system – at least once • Scuba can lose data, but not handle duplicates – at most once • …. Exactly once is REALLY HARD and requires transactions

20. 20 Don’t miss the side-note on side-effects • Exactly once means writing output + offsets to a transactional system • This takes time • Why just wait when you can deserialize? And maybe do other stateless stuff?

21. 21 Decision #4 – State Saving • In-memory state with replication (Old VoltDB) • Requires lots of hardware and network • Local database (Samza, Kafka Streams API) • Remote database (Millwheel) • Upstream (i.e. replay everything on failure) • Global consistent snapshot (Flink)

22. 22 Decision #4 – State Saving Facebook Verdict: It depends Rhode Island Alaska

23. 23 Best Part of the Paper – by far How to efficiently work with state in remote DB?

24. 24 Decision #5 - Reprocessing • Stream only – requires long retention in the stream store • Maintain both batch and stream systems • Develop systems that can run in streams and batch (Flink, Spark)

25. 25 Decision #5 - Reprocessing • Stream only – requires long retention in the stream store • Maintain both batch and stream systems • Develop systems that can run in streams and batch (Flink, Spark) Facebook Verdict: SQL runs everywhere And binary generation FTW

26. 26 Applications – Or a whirlwind tour of good patterns One example:

27. 27 Lessons Learned! The biggest win is pipelines composed of independent processors • Mixing multiple systems let us move fast • High level abstractions let us improve implementation • Ease of debugging – Independent nodes and ability to replay • Ease of deployment – Puma as-a-service • Ease of monitoring – Lag is the most important metric. Everything is instrumented out of the box. • In the future – auto-scale based on lag

28. 28 Thank You!

Notas del editor

Is it the best paper ever? Nope. Is it world-changing seminal works? Nope. On the other hand, it is PACKED with cool ideas and patterns.
2016 is incredibly late for this kind of paper – Storm (Twitter), S4, Kafka + Samza (LinkedIn), Spark Streaming (Berkely) and Heron (More Twitter) are out. By the time Facebook decided to describe their real-time data patterns,
We have evaluation criteria and patterns (decisions made in context) I love this paper because of the trade-offs. So many other papers pretend to present the best solution ever and hide the complications and issues. We also have an apology
These examples also demonstrate something I’ve increasingly noticed – the lines around ETL are blurring.
They have many different real-time pipelines all made from few components. Facebook has lots of systems for everything. This is part of what makes following the paper so complicated. They have good reasons for having this complexity, but it is unlikely that these good reasons apply to you. Which is one good reason to avoid copying architectures from papers. Puma – SQL. Fast to develop. Materialized views of simple aggregate queries and stateless transformations of streams Swift –pipelines in Python. Stateless, low throughput. Not mentioned much. Stylus – low-level processor API in C++.
Low watermark estimates is not trivial. There are a bunch of possible solutions. Would have been amazing to know what they used and how its working for them. Unfortunately, there is no Stylus paper.
Laser – really fast and awesome KV store. Distributed RocksDB. Where results are made available. Scuba – metrics store. Charity will tell you more. Hive. YUUUGE data warehouse
Facebook discusses their decisions in the context of an example. Which is a great idea. Lucky for us, this example looks exactly like all the other data pipelines I’ve worked with.
They highlight 5 key decisions. For each decision evaluated the common options and shared what facebook chose.
Minimum latency??? First, who in the world talks about minimum latency? What’s your 99.9%ile? Second, you are giving persistant data stores a bad name
For example, when we develop a new metric and want to run it against old data
Facebook basically has a magical compiler that takes Stylus C++ app and generates two binaries – batch and streaming

Papers we love realtime at facebook

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Papers we love realtime at facebook

Similar a Papers we love realtime at facebook (20)

Más de Gwen (Chen) Shapira

Más de Gwen (Chen) Shapira (20)

Último

Último (20)

Papers we love realtime at facebook

Notas del editor