Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

SVC201 - Automate Your Big Data Workflows
Jinesh Varia, Technology Evangelist
@jinman

November 14, 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Automating Big Data Workflows
Automating Compute

Automating Data

Worker
Activity
Decider

Data Node

Worker

Amazon SWF

AWS Data Pipeline

Amazon SWF – Your Distributed State Machine in the Cloud

Amazon SWF

Worker
Starters

Activity
Worker

AWS Management
Console

History

Activity
Worker

Decider

SWF helps you scale
your business logic

- Data/science Architect, Manager
Tim James
Vijay Ramesh - Data/science Engineer

the world's largest petition platform

At Change.org in the last year

• 120M+ signatures — 15% on victories
• 4000 declared victories

60-90% signatures at Change.org
driven by email

*

This works.

* up to a point!

Manual Targeting doesn‟t scale

cognitively.


in personnel.


into mass
customization.


culturally or
internationally.


with data size
and load.

We used big-compute machine learning
to automatically target our mass emails
across each week‟s set of campaigns.

First: Incrementally extract (and verify)
MySQL data to Amazon S3

Best Practice:
Incrementally extract with high watermarking.
(not wall-clock intervals)

Best Practice:
Verify data continuity after extract.
We used Cascading/Amazon EMR + Amazon SNS.

Transform extracted data on S3 into “Feature Matrix”
using Cascading/Hadoop on Amazon Elastic MapReduce
100-instance EMR cluster

A Feature Matrix is just a text file.

Sparse vector file line format, one line per user.

<user_id>[ <feature_id>:<feature_value>]...
Example:

123

12:0.237

18:1

101:0.578

So

how

do we do

big-compute
Machine Learning?

Enter Amazon

• Simple Workflow Service SWF
• Elastic Compute Cloud
EC2

SWF and EC2 allowed us to decouple:

•
•
•

Control (and error) flow
Task business logic
Compute resource provisioning

SWF provides a distributed application model

Decider processes make discrete workflow
decisions
Independent task lists (queues) are processed by
task list-affined worker processes
(thus coupling task types to provisioned resource types)

Allows deciders and workers to be implemented
in any language.

We used Ruby
with ML calculations done by Python, R, or C.

Rich web interface via
the AWS Management Console.
Flexible API for control and monitoring.

Resource Provisioning with EC2

Our EC2 instances each provide service
via Simple Workflow Service
for a single Feature Matrix file.

Simplifying Assumption:

Full feature matrix file fits on disk
of a m1.medium EC2 instance
(although we compute it with 100-instance EMR cluster)

Best Practice:

Treat compute resources as

hotel rooms, not mansions.

Worker EC2 Instance bootstrap from base
Amazon Machine Image (AMI)
EC2 instance tags provide highly visible, searchable
configuration.
Update local git repo to configured software version.

Best Practice:
Log bootstrap steps to S3
mapping essential config tags to EC2 instance names and log files

Amazon SWF and EC2
allowed us to build a
common reliable scaffold
for R&D and production
Machine Learning
systems.

Provisioning in R&D for Training
• Used 100 small EC2 instances to explore the
Support Vector Machine (SVM) algorithm
to repeatedly brute-force search a 1000-combination
parameter space
• Used a 32-core on-premises box
to explore a Random Forest implementation in
multithreaded Python

Provisioning in Production
Start n m3.2xlarge EC2 instances on-demand
for each campaign in the sample group
• Train with single SWF worker using multiple cores
(python multithreaded Random Forest)
• Predict with 8 SWF workers — 1 per core, 4 cores per instance

Best Practice:

Use Amazon SWF
to decouple and defer crucial
provisioning and
application design decisions
until you’re getting results.

Forward scale

So from here,
how can we expect this system to scale?

Forward scale
for 10x users

• Run more EMR instances
to build Feature Matrix
• Run more SWF predict workers
per campaign

Forward scale
for 10x campaigns

• already automatically start a SWF worker
group per campaign
• ―user generated campaigns‖ require no
campaigner time and are targeted
automatically

Forward scale
for 2x+ campaigners

• system eliminates mass email targeting
contention, so team can scale

Win for our Campaigners... and Users.
Our user base can now be automatically segmented
across a wide pool of campaigns, even internationally.

30%+ conversion boost over manual targeting.

Do you build systems like these?
Do you want to?

We‟d love to talk.
(And yes, we‟re hiring.)

UNSILO
Dr. Francisco Roque, Co-Founder and CTO

A collaborative search platform
that helps you see patterns
across Science & Innovation

Mission

UNSILO breaks down silos and makes it easy and fast for
you to find relevant knowledge written in domain-specific
terminologies

Unsilo
Describe

Discover

Analyze & Share

Big Data Challenges
4.5 million USPTO granted patents
12 million scientific articles
Heterogeneous processing pipeline
(multiple steps, variable times)

A small test

1000 documents
20 minutes/doc average

A bigger test

100k documents
3.8 years?

A bigger test

100k documents
8x8 cores
~21 days

4.5 million patents?
12 million articles?

Amazon SWF to the rescue
•
•
•
•
•

Scaling
Concurrency
Reliability
Flexibility to experiment
Easily adaptable

SWF makes it very easy to
separate algorithmic logic
and workflow logic

Easy to get started:
First document batch
running in just 2 weeks

Job Loading
• Content loaded by
traversing S3 buckets
• Reprocessing by
traversing tables on
DynamoDB

DynamoDB

Decision Workers
• Crawls Workflow History
for Decision Tasks
• Schedules new Activity
Tasks

DynamoDB

Activity Workers
• Read/write to S3
• Status in DynamoDB
• SWF task inputs passed
between workflow steps
• Specialized workers

DynamoDB

Best practice
Use DynamoDB for content status
Index on different columns (local indexes)
More efficient content status queries
Give me all the items that completed step X

Elastic service!

Key to scalability
File organization on S3 for scalability
– 50 req/s naïve approach
– >1500 req/seq
logs/2013-11-14T23:01:34/...
logs/2013-11-14T23:01:23/...
logs/2013-11-14T23:01:15/..."

43:10:32T41-11-3102/logs/...
32:10:32T41-11-3102/logs/...
51:10:32T41-11-3102/logs/..."

http://aws.typepad.com/aws/2012/03/amazon-s3-performance-tips-tricks-seattle-hiring-event.html
http://goo.gl/JnaQZV

Monitoring

Give me all the
workers/instances that have
not responded in the past hour

Amazon SWF components

DynamoDB

Throttling and eventual consistency

Failed?
Try Again

Huge benefits

100k Documents
21 days

< 1 hour

4.5 Million USPTO

~30 hours

Huge benefits

Focus on our goal, faster time to market
Using Spot instances, 1/10 cost

Key SWF Takeaways
Flexibility
– Room for experimentation
Worker

Transparency

Decider

Worker

– Easy to adapt

Growing with the system
– Not constrained by the framework

Amazon SWF

UNSILO
Sign up to be invited for the Public Beta

www.unsilo.com

AWS Data Pipeline

Data
Data Stores

Your ETL in the Cloud

Data

Compute Resources

Data Stores

S3

EMR

S3

S3

S3

EMR

Redshift

DynamoDB

EMR

EMR

DynamoDB

DynamoDB

S3

EC2

RDS

S3

Hive/Pig

Redshift

Intra-region ETL
Inter-region ETL

AWS Data Pipeline Patterns (ActivityWorkers)

Cloud-On-Prem
ETL

A new way to fund creative projects:

All-or-nothing fundraising.

5.1 million people have
backed a project

44% of projects hit their
goal

78% of projects raise under $10,000

51 projects raised more
than $1 million

Project case study: Oculus Rift

Data @
• We have many different data sources

• Some relational data, like MySQL on Amazon RDS
• Other unstructured data like JSON stored in a

third-party service like Mixpanel
• What if we want to JOIN between them in Amazon

Redshift?

Case study: Find the users that have Page View A but
not User Action B
• Page View A is instrumented in Mixpanel, a third-party
service whose API we have access:
{ “Page View A”, { user_uid : 1231567, ... } }

• But User Action B is just the existence of a timestamp
in a MySQL row:
6975, User Action B, 1231567, 2012-08-31 21:55:46
6976, User Action B, 9123811, NULL
6977, User Action B, 2913811, NULL

Redshift to the Rescue!
SELECT
users.id,
COUNT(DISTINCT
CASE
WHEN user_actions.timestamp IS NOT NULL
THEN user_actions.id
ELSE NULL
END) as event_b_count
FROM users
INNER JOIN mixpanel_events
ON mixpanel_events.user_uid = users.uid AND mixpanel_events.event
= 'Page View A'
LEFT JOIN user_actions ON user_actions.user_id = users.id
GROUP BY users.id

How we do automate the
data flow to keep it fresh
daily?

But how do we get the data to Redshift?

This is where AWS
Data Pipeline comes in.

Pipeline 1: RDS to Redshift - Step 1
AWS

First, we run sqoop on
Elastic MapReduce to
extract MySQL tables into
CSVs.

Pipeline 1: RDS to Redshift - Step 2

Then we run another Elastic
MapReduce streaming job
to convert NULLs into
empty strings for Redshift.

Pipeline 1: RDS to Redshift - Transfer to S3

• 150 - 200 gigabytes
• New DB every day, drop old
tables
• Using AWS Data Pipeline‟s 1day „now‟ schedule

Pipeline 1: RDS to Redshift Again

Run a similar pipeline
job in parallel for our
other database.

Pipeline 2: Mixpanel to Redshift - Step 1

Spin up an EC2 instance
to download the day‟s
data from Mixpanel.

Pipeline 2: Mixpanel to Redshift - Step 2

Use Elastic MapReduce to
transform Mixpanel‟s
unstructured JSON into CSVs.

Pipeline 2: Mixpanel to Redshift - Transfer to S3

• 9-10 gb per day
• Incremental data
• 2.2+ billion events
• Backfilled a year in 7 days

AWS Data Pipeline
Best Practices
• JSON / CLI tools are crucial
• Build scripts to generate JSON
• ShellCommandActivity is powerful
• Really invest time to understand
scheduling
• Use S3 as the “transport” layer

AWS Data Pipeline Takeaways for Kickstarter

15 years ago: $1 million or more
5 years ago: Open source + staff & infrastructure
Now: ~$80 a month on AWS

Big Thank You to
Customer Speakers!
Jinesh Varia
@jinman

More Sessions on SWF and AWS Data Pipeline

SVC101 - 7 Use Cases in 7 Minutes Each : The Power of Workflows and
Automation (Next Up in this room)

BDT207 - Orchestrating Big Data Integration and Analytics Data Flows
with AWS Data Pipeline (Next Up in Sao Paulo 3406)

Please give us your feedback on this
presentation

SVC201
As a thank you, we will select prize
winners daily for completed surveys!

Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Similar a Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013 (20)

Más de Amazon Web Services

Más de Amazon Web Services (20)

Último

Último (20)

Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Notas del editor