This document discusses Pinterest's data architecture and use of Pinball for workflow management. Pinterest processes 3 petabytes of data daily from their 60 billion pins and 1 billion boards across a 2000 node Hadoop cluster. They use Kafka, Secor and Singer for ingesting event data. Pinball is used for workflow management to handle their scale of hundreds of workflows, thousands of jobs and 500+ jobs in some workflows. Pinball provides simple abstractions, extensibility, reliability, debuggability and horizontal scalability for workflow execution.
4. Data at Pinterest
โข 60 Billion Pins
โข 1 Billion boards
โข 100M MAU
โข 60 PB of data on S3
โข 3 PB processed every day
โข 2000 node Hadoop cluster
โข 250 engineers
15. โข API for simplified
executor abstraction
โข Advanced support
for spot instances
โข Baked AMI
customization
Why Qubole?
โข Hadoop & Spark as
managed services
โข Tight integration with
Hive
โข Graceful cluster
scaling
17. Confidentia
l
โ Scale:
o 60 Billion Pins
o Hundreds of workflows
o Thousands of jobs
o 500+ jobs in a workflow
o 3 petabytes processed daily
โ Support:
o Hadoop, Cascading, Hive, Spark โฆ
Scale of Processing
job
workflow
18. Confidentia
l
Why Pinball?
โ Requirements
o Simple abstractions
o Extensible in future
o Reliable stateless computing
o Easy to debug
o Scales horizontally
o Can be upgraded w/o aborting workflows
o Rich features like auto-retries, per-job emails, overrun
policiesโฆ
โ Options
o Apache Oozie, Azkaban, Luigi
20. Confidentia
l
โ Workflow
o A directed graph of
nodes called jobs
โ Edge
o Run after
dependence
โ Node
o Job is a node
Workflow Model
21. Confidentia
l
Job State
โ Job state is captured in a token
โ Tokens are named hierarchically
Master
Job Token
version: 123
name: /workflow/w1/job
owner: worker_0
expiration: 1234567
data: JobTemplate(....)
23. Confidentia
l
โ Master keeps the state
โ Workers claim and execute tasks
โ Horizontally scalable
Master Worker Interaction
Worker Master Persistent Store
1: request 2: update
3: ack
24. Confidentia
l
Master
โ Entire state is kept in memory
โ Each state update is synchronously persisted
before master replies to client
โ Master runs on a single thread โ no
concurrency issues