4. Until now, Questions you ask drove
Data model
New model is collect as much data as
possible – “Data-First Philosophy”
5. Data is the new raw material
for any business on par with
capital, people, labor
6. Big Data
The collection and analysis of large
amounts of data to create a
competitive advantage
7. Big Data + Big Compute by the side = Big Insights
Your
Big Data
Ingest
Analyze data
data from Get Big Insights
in parallel
different
sources
8. Big Data
#1 Why Big Data matters today?
#2 How AWS addresses Big Data challenges?
9. Big Data Use cases
Media Transcoding
Retail Log Analysis
Web Analytics
Data Warehousing
Genome Sequencing
Bioinformatics
Digital Advertising
Financial Modeling
10. Big Data Analytics in the AWS Cloud
Storage
Amazon S3
Your
Big Data
Data from
different
sources
11. Ingesting large amounts of data to the cloud
One-time upload with
Hours constant delta updates
Data Velocity
Transfer to S3 over
Days Internet (Multi-
Threaded/Multi-Part)
GBs TBs
Data Volume and Size
13. Ingesting large amounts of data to the cloud
One-time upload with
Hours constant delta updates
Data Velocity
Transfer to S3 over
Days Internet (Multi- AWS Import/Export
Threaded/Multi-Part)
GBs TBs
Data Volume and Size
14. AWS Import/Export
AWS Import/Export Amazon Simple Amazon Elastic
eSATA, USB 2.0, SATA Storage Compute Cloud
Service (S3) (EC2)
Available Internet Theoretical Min. Number of Days to When to Consider AWS
Connection Transfer 1TB at 80% Network Utilization Import/Export?
T1 (1.544Mbps) 82 days 100GB or more
10Mbps 13 days 600GB or more
T3 (44.736Mbps) 3 days 2TB or more
100Mbps 1 to 2 days 5TB or more
1000Mbps Less than 1 day 60TB or more
15. Ingesting large amounts of data to the cloud
One-time upload with UDP Transfer Software
Hours constant delta updates (Aspera, Tsunami…)
Data Velocity
Transfer to Amazon S3
Days over Internet (Multi- AWS Import/Export
Threaded/Multi-Part)
GBs TBs
Data Volume and Size
16. Big Data Analytics in the AWS Cloud
Storage
Amazon S3
Your
Big Data
Compute and Analytics
Amazon EMR (Hadoop)
Amazon EC2
Data from
different Optimize
sources Amazon EC2 Spot Instances Real time Access
Expand/Shrink running cluster To Analytical Reports
Database
Amazon RDS
Amazon DynamoDB
17. Hadoop + Amazon Elastic MapReduce
Upload large Amazon S3
datasets or log Amazon S3
files directly
Data Input
Source Data Output
Data
Task
Amazon Elastic Node
MapReduce Amazon DynamoDB
Mapper
Code/ Reducer
Name Task
Scripts HiveQL Node Node
Pig Latin
Cascading Runs multiple
JobFlow Steps HiveQL
Core
Node Pig Latin
Query
Core
Node
HDFS
Amazon Elastic MapReduce BI Apps
JDBC
Hadoop Cluster ODBC
18. This is where the cloud really shines
Storage
Amazon S3
Your
Big Data
Compute and Analytics
Amazon EMR (Hadoop)
Amazon EC2
Data from
different Optimize
sources Amazon EC2 Spot Instances Real time Access
Expand/Shrink running cluster To Analytical Reports
Database
Amazon RDS
Amazon DynamoDB
19. Big Data
#1 Why Big Data matters today?
#2 How AWS addresses Big Data challenges?
#3 What are enterprises doing today?
20. #1 Reduced Time To Market
1 instance for 500 hours
=
500 instances for 1 hour
You choose where to balance cost against time
21. Bank – Monte Carlo Simulations
“The AWS platform was a good fit for its
unlimited and flexible computational
power to our risk-simulation process
23 Hours requirements.
to With AWS, we now have the power to
decide how fast we want to obtain
simulation results, and, more importantly,
20 Minutes we have the ability to run simulations not
possible before due to the large amount of
infrastructure required.” – Castillo,
Director, Bankinter
22. #2 Now every employee in your company can
have one supercomputer
23. Recommendation Engine for Investment Bankers
Amazon S3:
Companies You May Be
Interested In
Amazon Elastic Map-Reduce:
Compute User Selectivity
S&P Capital IQ Compute Key Developments
Microsoft Join & Score
SQL Server
Amazon S3:
Clicks
Key Developments
Company Profiles
“We see continued value in using the AWS
cloud because of the flexibility and the
scalability. We have a long queue of projects
and we envision using AWS to help us get
there.”
Jeff Sternberg, Data Science Lead
Capital IQ / Standard & Poors
24. #3 Elasticity is one of the fundamental
properties of the cloud that drives many of its
economic benefits
25. When you turn off your cloud resources,
you actually stop paying for them
26. Elasticity in Wall Street & Amazon EC2
3000 CPU’s for one firm’s risk management processes
3000--
Number of EC2 Instances
300 CPU’s on
weekends
300 --
Wednesday Thursday Friday Saturday Sunday Monday Tuesday
4/22/2009 4/23/2009 4/24/2009 4/25/2009 4/26/2009 4/27/2009 4/28/2009
28. Clickstream log analysis
Daily batch processing requirement: 3.5 billion records
Click stream data
(TB’s / day)
71 million unique cookies
1.7 million targeted ads
Optimize next Daily online ad required per day
day’s ad spend spend analysis
Several TBs of Clickstream logs
Compile Results a day
29. Clickstream Log Analysis
Example Query
User recently Analyze
purchased a Clickstream logs
sports movie Analyze and get patterns Targeted Ad
and is searching from similar user
(1.7 Million per day)
for video games purchase
behavior
Old Way New Way
-SAN storage -Cloud Services
-30 servers for compute -Hadoop and Cascading
-3 high-end SQL servers -“Ad Serving” Integration
Business results: Business results:
-Upfront CapEx: ~$500K -Upfront CapEx: $0
-Recurring OpEx: Significant -Recurring OpEx: $13K/mo.
-Procurement time: 2 mos. -Procurement time: zero
-Processing time: 2 days /Job -Processing time: 8 hours / Job
30. Cloud Accelerates Big Data Analytics
500%
Increase in Return on Ad Spend from last year
31. 3 Takeaways
#1 Why Big Data matters today?
Data-First Philosophy and Big Data Analytics
#2 How AWS addresses Big Data challenges?
Amazon EMR, Amazon EC2, AWS Import/Export, Dynamo DB
#3 What are enterprises doing today?
Capital IQ, Bankinter, Razorfish
32. Big Thank you!
Jinesh Varia
jvaria@amazon.com Twitter:@jinman
Notas del editor
In the old days, forehand, you knew what questions you are going to ask. and the quesitons that you are going to ask drove the data model and the data model usually drove how you are going to store it and data model will also drove how you are collecting it the data.
Now the Philosophy around data has changed. The philosophy is collect as much data as possible before you know what quesitons you are going to ask and most importantly you dont know which algorithms you are going to ask because you dont know what type of quesiotns I might need in future. The ultimate mantra of collect and measure everything. How you are going to refine those algorigthms, how much data, how much processing power, you really dont know how much resources you really need. Big data is what clouds are for. Its Big data analysis and cloud computing is the perfect marriage.If you are really serious of this new style of data analysis, you should not be worried about amont of commutation you. You should be completely free from that constraints.Collect and Store without limitsCompute and Analyze without limitsVisualize without limites
Data is the next industrial revolutionToday, the core of any successful company is the data it manages and its ability to effectively model, analyze and process that data quickly – almost in real time - so that it can make the right decision faster and rise to the top.
Big Data is all about storing, processing, analyzing, sharing, distributing and visualizing massive amounts of data so that companies can distill knowledge from it, derive valuable business insights from that knowledge, and make better business decisions, all as quickly as possible.
Bankinter uses Amazon Web Services (AWS) as an integral part of their credit-risk simulation application, developing complex algorithms to simulate diverse scenarios in order to evaluate the financial health of their clients. Bank at least 400,000 simulations to get realistic results.Through the use of AWS, Bankinter brought average time-to-solution down from 23 hours to 20 minutes and dramatically reduced processing, with the ability to reduce even further when required.
Cloud is highly cost-effective because you can turn off and stop paying for it when you don’t need it or your users are not accessing. Build websites that sleep at night
Only happens in the cloud
This is a real usage graph from one of our financial services customers during the last week of April (They have asked to remain anonymous for competitive reasons). Firms on Wall Street are finding EC2 an ideal environment to run many of their dailymission critical grid computing and cpu bound applications for a couple key reasons: 1/ Flexibilitythe ability to instantly access hundreds/thousands of cores increases the amount of data they can process, improving the overall quality of their models. and 2/ Cost efficiencies, as they can complete more of their processing for less total spend (Not paying for infrastructure during times of the day and weekends when its not needed)This wall street firm in particular has a nightly business process where they upload the day’s market trading data into S3, and then run proprietary ‘risk management’ algorithms. This lasts ~10 hours during week nights, where they ramp up to the equivalent 3000 m1.smalls. During the day and on weekends, they maintain a base of roughly 300 cores, to handle their always on work loads.
First story is about Cost of storing and analyzing Big DataA large retailer went to Razorfish to analyze massive amounts of click stream logs from their website. They analyze massive datasets of clickstream logs and provide patterns to the their ad serving and cross-selling engines so that they can show a targeted ad. While clickstream logs analysis is not new in our industy what what I learnt from the story is cost of storing and analyzing big data has significantly reduced -