As troves of data grow exponentially, the number of analytical jobs that process the data also grows rapidly. When you have large teams running hundreds of analytical jobs, coordinating and scheduling those jobs becomes crucial. Using Amazon Simple Workflow Service (Amazon SWF) and AWS Data Pipeline, you can create automated, repeatable, schedulable processes that reduce or even eliminate the custom scripting and help you efficiently run your Amazon Elastic MapReduce (Amazon EMR) or Amazon Redshift clusters. In this session, we show how you can automate your big data workflows. Learn best practices from customers like Change.org, KickStarter and UnSilo on how they use AWS to gain business insights from their data in a repeatable and reliable fashion.
2. Automating Big Data Workflows
Automating Compute
Automating Data
Worker
Activity
Decider
Data Node
Worker
Amazon SWF
AWS Data Pipeline
3. Amazon SWF – Your Distributed State Machine in the Cloud
Amazon SWF
Worker
Starters
Activity
Worker
AWS Management
Console
History
Activity
Worker
Decider
SWF helps you scale
your business logic
31. Transform extracted data on S3 into “Feature Matrix”
using Cascading/Hadoop on Amazon Elastic MapReduce
100-instance EMR cluster
32. A Feature Matrix is just a text file.
Sparse vector file line format, one line per user.
<user_id>[ <feature_id>:<feature_value>]...
Example:
123
12:0.237
18:1
101:0.578
37. SWF provides a distributed application model
Decider processes make discrete workflow
decisions
Independent task lists (queues) are processed by
task list-affined worker processes
(thus coupling task types to provisioned resource types)
38. SWF provides a distributed application model
Allows deciders and workers to be implemented
in any language.
We used Ruby
with ML calculations done by Python, R, or C.
39. SWF provides a distributed application model
Rich web interface via
the AWS Management Console.
Flexible API for control and monitoring.
40. Resource Provisioning with EC2
Our EC2 instances each provide service
via Simple Workflow Service
for a single Feature Matrix file.
41. Simplifying Assumption:
Full feature matrix file fits on disk
of a m1.medium EC2 instance
(although we compute it with 100-instance EMR cluster)
46. Amazon SWF and EC2
allowed us to build a
common reliable scaffold
for R&D and production
Machine Learning
systems.
47. Provisioning in R&D for Training
• Used 100 small EC2 instances to explore the
Support Vector Machine (SVM) algorithm
to repeatedly brute-force search a 1000-combination
parameter space
• Used a 32-core on-premises box
to explore a Random Forest implementation in
multithreaded Python
48. Provisioning in Production
Start n m3.2xlarge EC2 instances on-demand
for each campaign in the sample group
• Train with single SWF worker using multiple cores
(python multithreaded Random Forest)
• Predict with 8 SWF workers — 1 per core, 4 cores per instance
53. Forward scale
for 10x users
• Run more EMR instances
to build Feature Matrix
• Run more SWF predict workers
per campaign
54. Forward scale
for 10x campaigns
• already automatically start a SWF worker
group per campaign
• ―user generated campaigns‖ require no
campaigner time and are targeted
automatically
55. Forward scale
for 2x+ campaigners
• system eliminates mass email targeting
contention, so team can scale
56. Win for our Campaigners... and Users.
Our user base can now be automatically segmented
across a wide pool of campaigns, even internationally.
30%+ conversion boost over manual targeting.
57.
58. Do you build systems like these?
Do you want to?
We‟d love to talk.
(And yes, we‟re hiring.)
64. Big Data Challenges
4.5 million USPTO granted patents
12 million scientific articles
Heterogeneous processing pipeline
(multiple steps, variable times)
75. Job Loading
• Content loaded by
traversing S3 buckets
• Reprocessing by
traversing tables on
DynamoDB
DynamoDB
76. Decision Workers
• Crawls Workflow History
for Decision Tasks
• Schedules new Activity
Tasks
DynamoDB
77. Activity Workers
• Read/write to S3
• Status in DynamoDB
• SWF task inputs passed
between workflow steps
• Specialized workers
DynamoDB
78. Best practice
Use DynamoDB for content status
Index on different columns (local indexes)
More efficient content status queries
Give me all the items that completed step X
Elastic service!
79. Key to scalability
File organization on S3 for scalability
– 50 req/s naïve approach
– >1500 req/seq
logs/2013-11-14T23:01:34/...
logs/2013-11-14T23:01:23/...
logs/2013-11-14T23:01:15/..."
43:10:32T41-11-3102/logs/...
32:10:32T41-11-3102/logs/...
51:10:32T41-11-3102/logs/..."
http://aws.typepad.com/aws/2012/03/amazon-s3-performance-tips-tricks-seattle-hiring-event.html
http://goo.gl/JnaQZV
87. Key SWF Takeaways
Flexibility
– Room for experimentation
Worker
Transparency
Decider
Worker
– Easy to adapt
Growing with the system
– Not constrained by the framework
Amazon SWF
100. Data @
• We have many different data sources
• Some relational data, like MySQL on Amazon RDS
• Other unstructured data like JSON stored in a
third-party service like Mixpanel
• What if we want to JOIN between them in Amazon
Redshift?
101. Case study: Find the users that have Page View A but
not User Action B
• Page View A is instrumented in Mixpanel, a third-party
service whose API we have access:
{ “Page View A”, { user_uid : 1231567, ... } }
• But User Action B is just the existence of a timestamp
in a MySQL row:
6975, User Action B, 1231567, 2012-08-31 21:55:46
6976, User Action B, 9123811, NULL
6977, User Action B, 2913811, NULL
102. Redshift to the Rescue!
SELECT
users.id,
COUNT(DISTINCT
CASE
WHEN user_actions.timestamp IS NOT NULL
THEN user_actions.id
ELSE NULL
END) as event_b_count
FROM users
INNER JOIN mixpanel_events
ON mixpanel_events.user_uid = users.uid AND mixpanel_events.event
= 'Page View A'
LEFT JOIN user_actions ON user_actions.user_id = users.id
GROUP BY users.id
103. How we do automate the
data flow to keep it fresh
daily?
110. Pipeline 1: RDS to Redshift - Step 1
AWS
First, we run sqoop on
Elastic MapReduce to
extract MySQL tables into
CSVs.
111. Pipeline 1: RDS to Redshift - Step 2
Then we run another Elastic
MapReduce streaming job
to convert NULLs into
empty strings for Redshift.
112. Pipeline 1: RDS to Redshift - Transfer to S3
• 150 - 200 gigabytes
• New DB every day, drop old
tables
• Using AWS Data Pipeline‟s 1day „now‟ schedule
113. Pipeline 1: RDS to Redshift Again
Run a similar pipeline
job in parallel for our
other database.
114. Pipeline 2: Mixpanel to Redshift - Step 1
Spin up an EC2 instance
to download the day‟s
data from Mixpanel.
115. Pipeline 2: Mixpanel to Redshift - Step 2
Use Elastic MapReduce to
transform Mixpanel‟s
unstructured JSON into CSVs.
116. Pipeline 2: Mixpanel to Redshift - Transfer to S3
• 9-10 gb per day
• Incremental data
• 2.2+ billion events
• Backfilled a year in 7 days
117. AWS Data Pipeline
Best Practices
• JSON / CLI tools are crucial
• Build scripts to generate JSON
• ShellCommandActivity is powerful
• Really invest time to understand
scheduling
• Use S3 as the “transport” layer
118. AWS Data Pipeline Takeaways for Kickstarter
15 years ago: $1 million or more
5 years ago: Open source + staff & infrastructure
Now: ~$80 a month on AWS
120. Automating Big Data Workflows
Automating Compute
Automating Data
Worker
Activity
Decider
Data Node
Worker
Amazon SWF
AWS Data Pipeline
121. Automating Big Data Workflows
Automating Compute
Automating Data
Worker
Activity
Decider
Data Node
Worker
Amazon SWF
AWS Data Pipeline
122. Big Thank You to
Customer Speakers!
Jinesh Varia
@jinman
123. More Sessions on SWF and AWS Data Pipeline
SVC101 - 7 Use Cases in 7 Minutes Each : The Power of Workflows and
Automation (Next Up in this room)
BDT207 - Orchestrating Big Data Integration and Analytics Data Flows
with AWS Data Pipeline (Next Up in Sao Paulo 3406)
124. Please give us your feedback on this
presentation
SVC201
As a thank you, we will select prize
winners daily for completed surveys!
Notas del editor
Welcome to Application Services Track SVC 201 Automating. When it comes to cloud, its all about automation. Automating data-driven workflows is one of most important tenets of any cloud-native application. I am super excited because there while they are data engineers, I call them data alchemists because they are the people who turn data into gold with help from AWS.
There are number of different ways on can automated data-driven workflows on AWS. I am going to discuss two aspects in this talk – how you can automate compute using SWF and Automate data using AWS Data Pipeline.
SimpleWorklfow Service is one of the most powerful building block services in AWS umbrella of products. Its an orchestration service that has the power to scale your business logic, Maintains distributed app state, tracks workflow and visbility executions, ensures consistency into execution history, Tasks, Timers and Signals and There are really only 3 things to know when you are designing a SWF-based application 1) worker starters 2) Activity Workers 3) decidersWorkflow starters kicksoff the workflow . A decider is an implementation of a workflow's coordination logic.
- Recent startup based in Denmark
- Silos between domain-specific knowledge- Find information across different disciplines
Concepts are not keywords, but patterns to be recognized across documentsPattern matching poses challenges to the way we handle our dataFull NLP, more computational power and more time required
Fuel our search engineGoal to achieve full coverage in all IP and Science
Linear pipeline glueing scripts togetherTesting scalability
How do we handle this large amount of data?
And more importantly how do we get there fast?Not dealing with additional infrastructureCut corners and focus on core competences
- Started talking with Mario
But more importantlyWork independently
- Brief overview of the system
One EC2 instance fetching data, transforming it into out internal format, and uploading it to S3One bucket per data source, uspto, medline, etc.
Creates a new decision event in the Event HistoryWorkflow startedNew content from S3Re-processing content from DynamoSubset of content from file or DB
Natural approachDifferent computing ratiosDebugging/fixing on the flyMinimal AMIsEC2 user data scripts
Rather than using the Event History for thisSWF history is more expensiveElastic, ramp up provisioning when we need
Internal partitioning of S3
Start up slowlyDecide on a gearing ratioRamp upI/O bound task problems
DynamoDB with local indexes to keep track of workers and instances so that we can make various custom queries for monitoring and managing the instances.SWF Activity task is rescheduled
Account for eventual consistency"back off and try again"-logicCloud issuesThrottling errorsAWS support to raise limits
Debugging from your dev environmentInspect intermediate resultsLocal and remote Workers/decidersAutomated integration testsDynamoDB Local
150k docs/ hour8500 EC2 cores1500 m3 xlarge
We are iterating on our algorithms quite fastDelivering value to the user
Thanks to Christopher Wright & Erik Kastner, who without them, we wouldn’t be using Data Pipeline
Sqoop requires auto-increment IDs, and can’t handle tables named “public”