2. Welcome
Sheri Sullivan
Senior Marketing Manager
Global SI Ecosystem
Amazon Web Services
3. Webinar Overview
• Submit Your Questions using the Q/A tool.
• A copy of today’s presentation will be made available on:
• AWS SlideShare Channel@
http://www.slideshare.net/AmazonWebServices/
• AWS YouTube Channel@
http://www.youtube.com/user/AmazonWebServices
Special Note: Today’s Webinar is being recorded.
4. What We’ll Cover
• Intro to AWS Database and Big Data Services
• Customer Use Cases and Solutions
• Delivering Cross-Media Analytics
• MarketShare Planner Platform
5. John Gannon
AWS Business
Development Manager
jgannon@amazon.com
6. Big Data and Databases on AWS
Managed services designed to reduce administration, accelerate
deployment, and minimize the cost of analysis and experimentation
DynamoDB
Schema-less data store that enables fast deployment of new applications
without the burden of database administration
Relational Database Service (RDS)
Manage existing database applications without the effort required to
provision, upgrade, backup and scale highly available instances
ElastiCache
Accelerate data retrieval performance by caching data in memory and
avoiding slower disk-based systems
Elastic MapReduce (EMR)
Hadoop-based infrastructure service enabling the parallel processing of
massive amounts of data
7. Amazon Relational Database
Service
RDS is a fully managed Relational database service that is
simple to deploy, easy to scale, reliable and cost-effective
Choice of Database Engines
Fully Managed Service
Push Button Scalability
Fault Tolerance with Multi-AZ
Works with EC2 & ElastiCache
8. Amazon DynamoDB
DynamoDB is a fully managed NoSQL database
service that provides extremely fast and
predictable performance with seamless scalability
Authors of NoSQL
Zero Administration
Low Latency SSD’s
Unlimited Potential
Storage and Throughput
9. AMAZON ELASTIC MAPREDUCE
Reduces complexity & cost of Hadoop Management
Integrates with AWS Services and 3rd Party vendors
Highly customizable
11. Amazon EMR is the #1
Enterprise Hadoop Solution
AWS is “the most
prominent Hadoop cloud
service provider” and
“leads the pack (of
Leaders) due to its
proven, feature-rich Elastic
MapReduce service…”
-The Forrester Wave™:
Enterprise Hadoop
Solutions Q1 2012
12. Success Story
Business Challenge
Needed a real-time analytics tool to determine dynamic live event pricing during the
ticket sales life cycle
Optimize event ticket pricing, improve yield management & generate incremental
revenue
AWS Services
Elastic Load Amazon Elastic
Amazon SimpleDB Amazon Simple
Balancer MapReduce Amazon CloudWatch
Email Service (SES)
Business Benefits
Ease of use, reducing developers’ infrastructure management time by 3 hours per day
Estimated 80% cost reduction annually, compared to fixed service costs
15. Who we are
MarketShare MarketShare
Planner™ Price™
The global marketer partner of choice MarketShare MarketShare
for understanding, optimizing and 360™ Optimizer™
driving revenue MarketShare Platform
Cloud modeling | Saas infrastructure | Data
connectors
• Recognized industry leader
Risky Strong
•
Bets Contenders Performers Leaders
Cloud-based software solutions Strong
• Over half the Fortune 100
• Strong media and agency Current
Offering
partnerships
• Global presence
Weak
Weak Strategy Strong
16. Terabytes per 1000+ variables
customer
Data
Architect
Client Data
ETL Reportin Modeling
g
Sim-Opt
FTP
Scale Complex Modeling Simulation Engineer
Modeling Sim-Opts Tool Stack Production
Stack Stack Tables Tables
Tables Tables Application
Modeler
100+ Customers 100+ data sources
17. Brand Product
Earned media
ETL Organic search Reporting Modeling
Innovation
Quality Events
Conferences
Controllable
Bing
WOM Google Trade shows
Sales
Blogs
Social media Twitter Awareness Training
Owned PR
Facebook Service
Support
media Commerce
Simulatio
Website Content Consideration Displays
FTP n
Shelf space In store
Google
Paid Search Bing Discounts
Purchase Bundles
Banner Ads
Coupons Promotions
Display Video Ads
Magazine Offering
Print Newspaper
Pricing Competition
TV
Applicati
Radio
on
Broadcast Signs
Interest
Seasonality
Digital
rates Non-
Stock market
signage Catalog Direct Mobile controllable
mail email
Paid media Economy
Outdoor
Direct
22. Many applications in
production
Marketing Efficiency Attribution
Dynamic Pricing
23. The Technology That Makes
It Possible
Elastic Cloud™ AWS
Amazon EC2 Amazon EC2
Permanent Instances On-Demand Instances
EC2 EC2 Amazon
Instance Instance Elastic MapReduce
Elastic Load
Balancer
Web App
Server Server
AWS
Amazon EC2 Amazon
Permanent Instances Managed Storage
EC2 EC2 RDS Database Amazon Simple
Instance Instance Instance Storage Service
(S3)
Web App
Serve Serve
r r
30. Summary
Design your data pipeline for a multi-cluster environment
• Write Configurable ETL to become independent, partitioned
workflows
• A cluster that stays up the entire month is not elastic
Save your intermediate results in low cost storage
• Think about compression
• Do not underestimate schema complexity
Loosely coupled architecture has failure points
• Save state obsessively
• Build restart-ability into your architecture
31. Programs to help you get started
with Big Data on AWS
Big Data
EMR
Discovery EMR Training
Bootcamp
Workshop
Identify and prioritize target Deploy a sample use case 3 day intensive
Big Data use cases with real customer data developer training
32. EMR Training Schedule
• Los Angeles, CA – 10/16-10/18
• Boston, MA – 10/30-11/1
• Mountain View, CA – 11/13-11/15
• Dallas, TX – 11/27-11/29
• New York, NY – 12/11-12/13
Visit http://bit.ly/AWS_EMR_Training for class details and registration
We’ve been operating the service for over 3 years now and in the last year alone we’ve operated over 2 MILLIONHadoop clusters
Forrester wave report named Amazon EMR the #1 enterprise hadoop solution because of it’s integration with various data stores, it’s ecosystem of vendors and the number of customers the service supports.
Hi, my name is Anupam Singh. I am the Vice President of Technology at MarketShare.
MarketShare builds solutions for marketing organizations at Fortune 100 companies. Our customers provide us data and we provide a cloud based analytic applications to improve the efficiency of our customer’s marketing.
So, what are the big challenges that we face? Our entire business is based on scaling complex data modeling. Our scaling challenges are across 4 major dimensions. Each customer has 10s of terabytes of data. The data comes from hundreds of data sources. This data has thousands of variables to analyze. And we need to do this for hundreds of customers. Let us look at the various stages to build a solution that scales.
The first stage is bringing the data together. Today’s marketing organization is faced with hundreds of data sources. Consider this picture where we bring together data from the customer’s website, the advertising logs from their vendors, revenue data from the ERP systems, variables like Seasonality & Economy. As you can see, we have to gather more than 40 data sources in this single picture. Just managing the storage for daily, weekly and monthly updates is a challenge.
A lot of this data is machine generated. And it is not ready for analytics. Each data source has to be scrubbed and cleaned through an ETL pipeline before doing analytics. Our ETL pipelines have 20-30 main stages with 100s of sub-stages. Scheduling these and correcting data errors is one of our biggest technical challenges. We will dive deeper into this later. Once the data has been cleaned, it is ready for analytics.
Many of our customers have never seen these data sources in a single dashboard. Even before running the data through our proprietary modeling platform, we can help our customers get dashboards on previous data black holes.
The term data scientist has been in vogue lately. At MarketShare, we have a large team of modelers who run modeling on the cloud. As the data has been cleaned up, the modelers run thousands of different equations. Many analytic applications stop their cloud usage at reporting. At MarketShare, we believe that reporting is not enough to answer the questions. Building a predictive model is key to answering business questions on terabytes of data. We use the cloud to build custom models for each one of our customers. We use the power of distributed systems to validate these models for accuracy.
Once the models have been prepared, they are deployed in an easy to use application. It should be noted that reducing big data should not mean that the user is lost in a forest of reports. At MarketShare, we believe in simplifying access to Big Data. We hide the model complexity behind easy to use applications that let our users build many different scenarios for their business.
So, what does all this give our customers? We have been able to release many different applications on top of this analytics pipeline. The first one is marketing efficiency. The second application is Attribution. The third one is Dynamic Pricing.
So, what makes this pipeline run? Our entire analytics workflow is built using various services from Amazon as building blocks. Our applications are deployed behind the elastic load balancer service. The data is stored in Storage services like S3, RDS and we are trying out Dynamo DB. Our analytics jobs are executed on dynamic clusters provided by elastic map reduce.
So, let us quickly go under the hood. 3 years ago, we started with a hadoop cluster to store all our data. Very quickly we noticed two important things with the cluster. The first observation is that however big we made the cluster, jobs kept running into each other. Try as we might, the cluster would get hot for some time when many different stages would start executing at the same time. The second observation was how unused the cluster was for large periods of our time. So, while we are spending a lot of dollars on this large cluster, our customers are still unhappy with the response times!
So, what was our solution? We rewrote our entire data pipeline to run many different clusters. So,
Big Data Discovery WorkshopBrainstorm pilot use casesIdentify data sources and formatsReview business and financial driversRecommended use casesRoadmap for data migration and production rolloutReference architectureEstimated pilot costNext stepsEMR BootcampInteractive onsite workshop (is not classroom training)Work w/customer to architect, install, and config EMRRun and debug production job flowsCustomer’s dataset(s) must be on S3