Presentation for Papers We Love at QCON NYC 17. I didn't write the paper, good people at Facebook did. But I sure enjoyed reading it and presenting it.
19. 19
Decision #3 – Processing Semantics
Facebook Verdict: It depends on requirements
• Ranker writes to idempotent system – at least once
• Scuba can lose data, but not handle duplicates – at most once
• …. Exactly once is REALLY HARD and requires transactions
20. 20
Don’t miss the side-note on side-effects
• Exactly once means writing output + offsets to a transactional system
• This takes time
• Why just wait when you can deserialize? And maybe do other stateless stuff?
21. 21
Decision #4 – State Saving
• In-memory state with replication (Old VoltDB)
• Requires lots of hardware and network
• Local database (Samza, Kafka Streams API)
• Remote database (Millwheel)
• Upstream (i.e. replay everything on failure)
• Global consistent snapshot (Flink)
22. 22
Decision #4 – State Saving
Facebook Verdict: It depends
Rhode Island Alaska
23. 23
Best Part of the Paper – by far
How to efficiently work with state in remote DB?
24. 24
Decision #5 - Reprocessing
• Stream only – requires long retention in the stream store
• Maintain both batch and stream systems
• Develop systems that can run in streams and batch (Flink, Spark)
25. 25
Decision #5 - Reprocessing
• Stream only – requires long retention in the stream store
• Maintain both batch and stream systems
• Develop systems that can run in streams and batch (Flink, Spark)
Facebook Verdict:
SQL runs everywhere
And binary generation FTW
27. 27
Lessons Learned!
The biggest win is pipelines composed of independent processors
• Mixing multiple systems let us move fast
• High level abstractions let us improve implementation
• Ease of debugging – Independent nodes and ability to replay
• Ease of deployment – Puma as-a-service
• Ease of monitoring – Lag is the most important metric. Everything is
instrumented out of the box.
• In the future – auto-scale based on lag
Is it the best paper ever? Nope. Is it world-changing seminal works? Nope. On the other hand, it is PACKED with cool ideas and patterns.
2016 is incredibly late for this kind of paper – Storm (Twitter), S4, Kafka + Samza (LinkedIn), Spark Streaming (Berkely) and Heron (More Twitter) are out.
By the time Facebook decided to describe their real-time data patterns,
We have evaluation criteria and patterns (decisions made in context)
I love this paper because of the trade-offs. So many other papers pretend to present the best solution ever and hide the complications and issues.
We also have an apology
These examples also demonstrate something I’ve increasingly noticed – the lines around ETL are blurring.
They have many different real-time pipelines all made from few components.
Facebook has lots of systems for everything. This is part of what makes following the paper so complicated. They have good reasons for having this complexity, but it is unlikely that these good reasons apply to you. Which is one good reason to avoid copying architectures from papers.
Puma – SQL. Fast to develop. Materialized views of simple aggregate queries and stateless transformations of streams
Swift –pipelines in Python. Stateless, low throughput. Not mentioned much.
Stylus – low-level processor API in C++.
Low watermark estimates is not trivial. There are a bunch of possible solutions. Would have been amazing to know what they used and how its working for them. Unfortunately, there is no Stylus paper.
Laser – really fast and awesome KV store. Distributed RocksDB. Where results are made available.
Scuba – metrics store. Charity will tell you more.
Hive. YUUUGE data warehouse
Facebook discusses their decisions in the context of an example. Which is a great idea.
Lucky for us, this example looks exactly like all the other data pipelines I’ve worked with.
They highlight 5 key decisions. For each decision evaluated the common options and shared what facebook chose.
Minimum latency???
First, who in the world talks about minimum latency? What’s your 99.9%ile?
Second, you are giving persistant data stores a bad name
For example, when we develop a new metric and want to run it against old data
Facebook basically has a magical compiler that takes Stylus C++ app and generates two binaries – batch and streaming