Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2l2Rr6L.
Doug Daniels discusses the cloud-based platform they have built at DataDog and how it differs from a traditional datacenter-based analytics stack. He walks through the decisions they have made at each layer, covers the pros and cons of these decisions and discusses the tooling they have built. Filmed at qconsf.com.
Doug Daniels is a Director of Engineering at Datadog, where he works on high-scale data systems for monitoring, data science, and analytics. Prior to joining Datadog, he was CTO at Mortar Data and an architect and developer at Wireless Generation, where he designed data systems to serve more than 4 million students in 49 states.
2. InfoQ.com: News & Community Site
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
datadog-cloud
3. Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
16. WHOM
What’s our big data platform do?
Data Engineers
Data Scientists
do
WHAT
App features
Statistical Analysis/ML
Ad-hoc investigation
17. WHOM
What’s our big data platform do?
Data Engineers
Data Scientists
do
WHAT
App features
Statistical Analysis/ML
Ad-hoc investigation
WITH
Spark
Hadoop (Pig)
Python (Luigi)
with
33. S3 is flexible!
• Read data from as many clusters as you want
• Store unlimited stuff(*) with no management
* Accepting laws of physics and your credit card limit
34. S3 is flexible!
• Read data from as many clusters as you want
• Store unlimited stuff(*) with no management
• Rock solid: durability (99.999999999), availability (99.99)
* Accepting laws of physics and your credit card limit
35. S3 is flexible!
• Read data from as many clusters as you want
• Store unlimited stuff(*) with no management
• Rock solid: durability (99.999999999), availability (99.99)
• Access from any programming language
* Accepting laws of physics and your credit card limit
41. How to fix slow listing
Bigger filesParallelize it
42. HDFS
No way to quickly move data
Task
Intermediate Final
write atomic move
43. HDFS
No way to quickly move data
Task
Intermediate Final
write atomic move
S3 Task
write
44. No way to quickly move data
• Say goodbye to speculative execution
45. No way to quickly move data
• Say goodbye to speculative execution
• Say hello to better task timeouts
46. But really: We 💜S3
This is a great system.
✓ Data accessible from many clusters
✓ Storage is easy to manage
✓ It’s a multi-language paradise up in here
52. No more waiting on loaded clusters
• Tailor each cluster to the work you want to do
• Scale up when you need results faster
• Data scientists and data engineers don’t have to wait
🕐🕓🕥
53. Pick the best hardware for each job
for CPU-bound jobs
r3
if you don’t care (cheap!)
== ~30% savings over general purpose hardware
c3
for memory-bound jobs
m1.xlarge
56. How we do spot clusters
• Bid the on-demand price,
pay the spot price
In the big data platform
57. How we do spot clusters
• Bid the on-demand price,
pay the spot price
• Fallback to on-demand
instances if you can’t get spot
In the big data platform
58. How we do spot clusters
• Bid the on-demand price,
pay the spot price
• Fallback to on-demand
instances if you can’t get spot
• Monitor everything: jobs,
clusters, spot market
In the big data platform
59. How we do spot clusters
• Bid the on-demand price,
pay the spot price
• Fallback to on-demand
instances if you can’t get spot
• Monitor everything: jobs,
clusters, spot market
• ☞ Save up to 80% off the
on-demand price
In the big data platform
61. We like this strategy a lot!
Cluster is oversubscribed; everyone waiting in line to do their work
Lots of expensive hardware sits idle when everyone’s gone
✓ No waiting for the cluster you need
✓ No waste from hardware sitting idle
✓ Spot clusters are affordable enough to use everywhere
68. Big Data Platform Architecture
DATA Amazon S3
CLUSTER EMR
WORKER Pig Workers Spark Workers Luigi Workers
69. Big Data Platform Architecture
DATA Amazon S3
CLUSTER EMR
WORKER Pig Workers Spark Workers Luigi Workers
STORAGE Metadata DB Queueing Logs
70. Big Data Platform Architecture
DATA Amazon S3
CLUSTER EMR
WEB Web API
WORKER Pig Workers Spark Workers Luigi Workers
STORAGE Metadata DB Queueing Logs
71. Big Data Platform Architecture
DATA Amazon S3
CLUSTER EMR
WEB Web API
WORKER Pig Workers Spark Workers Luigi Workers
USER CLI API Clients Job Scheduler
STORAGE Metadata DB Queueing Logs
72. Big Data Platform Architecture
DATA Amazon S3
CLUSTER EMR
WEB Web API
WORKER Pig Workers Spark Workers Luigi Workers
USER CLI API Clients Job Scheduler
STORAGE Metadata DB Queueing Logs
Datadog
Monitoring
73. How to find the right cluster
when they disappear?
79. Debugging In a Post-Cluster World
Send all logs to S3
• HDFS
• YARN
• Pig
• Spark
80. Debugging In a Post-Cluster World
Visualize the pipeline
• Lipstick for Pig
• Spark History Server
• Luigi task flow
Send all logs to S3
• HDFS
• YARN
• Pig
• Spark
81. Debugging In a Post-Cluster World
Visualize the pipeline
• Lipstick for Pig
• Spark History Server
• Luigi task flow
Preserve historical
monitoring data
Keep history, by tag, after
the cluster disappears
Send all logs to S3
• HDFS
• YARN
• Pig
• Spark
90. Recommendations
for Cloud Big Data
• Use S3 for permanent data, not HDFS
• Start from EMR if building yourself
91. Recommendations
for Cloud Big Data
• Use S3 for permanent data, not HDFS
• Start from EMR if building yourself
• Look into a PaaS: Netflix Genie, Qubole, Databricks
92. Recommendations
for Cloud Big Data
• Use S3 for permanent data, not HDFS
• Start from EMR if building yourself
• Look into a PaaS: Netflix Genie, Qubole, Databricks
• Tag your clusters for dynamic monitoring
93. Recommendations
for Cloud Big Data
• Use S3 for permanent data, not HDFS
• Start from EMR if building yourself
• Look into a PaaS: Netflix Genie, Qubole, Databricks
• Tag your clusters for dynamic monitoring
• Design for failure with a workflow tool (Luigi, Airflow)
94. Thanks!
Want to work with us on Spark, Hadoop,
Kafka, Parquet, and more?
jobs.datadoghq.com
DM me @ddaniels888 or doug@datadoghq.com
95. Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
datadog-cloud