More Related Content Similar to A Look Under the Hood – How Amazon.com Uses AWS Services for Analytics at Massive Scale - ABD329 - re:Invent 2017 (20) More from Amazon Web Services (20) A Look Under the Hood – How Amazon.com Uses AWS Services for Analytics at Massive Scale - ABD329 - re:Invent 20171. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS re:INVENT
AB D 329 - A L ook Under the Hood – How
Amazon.com Uses AWS Services for Analy tics
at Massive S cale
J e f f C a r t e r , V P , B i g D a t a T e c h n o l o g i e s , A m a z o n . c o m
2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Traditional Data Warehousing
Wikipedia: In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a
system used for reporting and data analysis, and is considered a core component of business intelligence.[1] DWs are
central repositories of integrated data from one or more disparate sources. They store current and historical data and
are used for creating analytical reports for knowledge workers throughout the enterprise. Examples of reports could
range from annual and quarterly comparisons and trends to detailed daily sales analysis.
3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The Battle for the Future
VS.
4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Big Data Example – The Smart Trashcan
5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
https://www.promptcloud.com
https://john-popelaars.blogspot.com
https://ww.signiant.com
https://www.linkedin.com/pulse/world-today-data-rich-information-poor-guru-p-mohapatra-pmp/
7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The Industry Problem
Growth in Data
(mostly Unstructured)
& Analytics
Average Growth in
Traditional DW
Data
Average IT Budget
8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is Amazon?
9
10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Our vision is to be earth’s most customer-centric company;
to build a place where people can come to find and discover
anything they might want to buy online.
10
11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 11
12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 12
13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Data Warehouse
14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The Amazon Enterprise Data Warehouse
The Good!
Helps to Run the Amazon Business
• Most Comprehensive Set of Cleansed and Curated Business Data
• Feeds Many Downstream Systems and Processes
• Batch Processing, Reporting and Ad Hoc
• 500k+ Data Loads/Transformations Each Day
• 200k+ Queries/Extracts Each Day
• 20k+ Active Tables
• 10B++ Rows Loaded Daily
Our Data is Big!
• Core Data Set: 5+PB of Compressed Data (primarily limited by Legacy Technology)
• Total Storage (Multiple Systems): 35+ PB compressed
• Quote from Executive at Legacy DW Vendor:
• ~1000x Larger than any other DW Customer (from that Vendor)
Significant and Increasing Use of Redshift and EMR
• 1000’s of Redshift and EMR Systems, Range in size from:
• Individual Contributor - Project Based, to
• Running Multi-Billion Dollar Business inside Amazon
15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• Who are we?
• Analytics on the “Marketplace”
• Analytics Spokes: Pricing, B2B, Seller Support, Lending …
• Business Scale:
• 235MM monthly CPU Minutes on Legacy ODW
• 2K upstream tables
• Users:
• Supports 170 teams
• 1000 users with 9527 profiles (Parameterized Queries)
• 20K unique job runs per month
• 2800 (800 TB) datasets
• BI Tool Users:
• 3000+ Users, 650 non-tech
• 600+ ”Dashboards”
• 100k’s of queries each month
Example of an Amazon DW “Customer” Team
16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
“Swiss Army” by by Jim Pennucci. No alterations other than cropping. https://www.flickr.com/photos/pennuja/5363518281/
Image used with permissions under Creative Commons license 2.0, Attribution Generic License
18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is the Goal?
To Provide an analytic ecosystem that Scales with the
Amazon Business
To Leverage AWS Technologies and to help Improve these
technologies for all Amazon Customers
To Provide Choice and Options in New Analytic Technologies
• Provide an SQL based solution
• Increasingly Focus on Enabling new analytic approaches
including Machine Learning and Programmatic Data
Analysis
• Enable both “Bring Your Own Cluster” and “Bring your
Own Query” Approaches
19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
“Tools #2” by Juan Pablo Olmo. No alterations other than cropping. https://www.flickr.com/photos/juanpol/1562101472/
Image used with permissions under Creative Commons license 2.0, Attribution Generic License (https://creativecommons.org/licenses/by/2.0/)
20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon EMR
(running Hive, Pig,
Spark, Presto, etc…)
Amazon DynamoDB
Amazon
Machine Learning
Amazon QuickSight
Amazon RDS
Amazon Elasticsearch
Service
Amazon Redshift Amazon Athena
Amazon SQS
Amazon Kinesis
Analytics
Amazon Kinesis
Firehose
Amazon S3
Amazon Kinesis
Open-source tools
(e.g. for ML, data science)
Commercial tools
21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Moving Forward - AWS
S3 / EDX - Separate
Storage from Compute by
leveraging a parallel file
system as a global data
exchange
• Redshift - Preferred
platform SQL based
Analysis and traditional
Data Warehouse Data
• Focus is “Business Users”
• EMR – Scalable “Do
Everything” Platform - Enable
Teams who have chosen EMR
by providing Curated Data
• Focus is “Programattic Access”
Amazon
Redshift
22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The Amazon “Data Lake” – Project Name “Andes”
The Goal: ”THE” Place for Data at Amazon
• Source teams (Data Producers) put their Public Data there to give access to Analytic
teams (Data Consumers) and to share private data within their team
• EMR Can Directly Access the Data in Parallel from Andes
• Redshift can load the data in Parallel from Andes, or it Can Directly Access the Data in
Parallel with Spectrum
23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
“Datamarts”
Number of Teams using the DW: ~2300
Number of Tables Used per Team:
• Max: 598
• Min 1
• Average: 49
Ad-Hoc (any data any time) can be achieved via
EMR can access the Data in Andes Directly
Redshift can load data into the Redshift file
system, or it can use the Spectrum Feature to
directly access the Data in Andes
An Architecture that Scales with the Business
Amazon Internal Team (132 Tables)
24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Putting The Pieces Together
The Analytic Architecture of the Future
Source
Systems
The Data Lake
“Andes”
Big Data Systems
Data Warehouses
“Bring Your Own Cluster” and
“Bring Your Own Query”
Services and Users
Postgre SQL
instance
Amazon
Redshift
Amazon
Redshift
Amazon
Redshift
Amazon
Kinesis
AWS Glue Amazon
QuickSight
Amazon
Athena
Amazon Machine
Learning
25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The Battle for the Future
The Data Lake becomes the
common source for all
data:
The DW becomes the
compute engine for
traditional structured data
(Redshift)
EMR becomes the compute
engine for programmatic
access, like machine
learning and many
emerging use cases
Both become a form of a
Dependent data mart with
the data coming from the
Data Lake
Vs.
AND
26. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 26
27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Purchase
Contract
seller buyer
27
28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Table Subscriptions - The Vision
29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Subscription
“Big Data Technologies” Team
producer consumer
29
30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 30
31. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Value Chain
Image credits: Icons from thenounproject.com: “Collect” icon by Ramesh; “Cloud Security” icon by Creative Stall; “Search” icon by
Dinosoft Labs;
COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER
32. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Producers only need to integrate their datasets once
with the data lake
• Simplified onboarding process
• One-time integration
Ingest from various source systems:
• Relational databases – e.g., Amazon Aurora/RDS
Postgres
• Non-relational databases – e.g., Amazon DynamoDB
• Streams – e.g., Amazon Kinesis
• Flat files –e.g., files in Amazon S3
COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER
Data Value Chain
33. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Secure and scalable data lake:
• Highly durable S3-based storage
• Scalable since it’s built on AWS technologies
• Permissions are strictly enforced
Data quality:
• Certified with data quality checks
• Schemas are validated
COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER
Data Value Chain
34. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Company-wide data search index
• Consumers can quickly find what they’re looking
for
• Useful information about the datasets are
shown
Clear communication:
• Producers can communicate expectations
around data quality and SLAs
• Consumers can contact producers
COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER
Data Value Chain
35. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Easy process to subscribe to data:
• Find a dataset of interest
• Click “Subscribe”
• Choose the destination compute platform
Rapidly populate data marts, for example:
• Use AWS CloudFormation to provision Redshift
cluster
• Use subscriptions to load datasets to the cluster
COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER
Data Value Chain
36. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Subscriptions mechanism:
• Makes data available to the compute platform where
it can be analyzed
• Keep the compute platform in-sync with any data
updates
• Users can monitor the sync status of their
subscriptions
Synchronizations can be either:
• Full data copy
• Metadata-only sync
COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER
Data Value Chain
37. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Teams can use the right tools for the jobs, e.g.:
• Amazon Redshift for interactive analytics or batch
scheduled jobs
• Amazon EMR for machine learning and data
science
• QuickSight for Business analytics and visualizations
Compute resources can be scaled independently
of the data lake in order to:
• Process more/bigger/faster jobs
• Optimize costs
• Meet business SLAs
• Scale to meet high peak workloads
COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER
Data Value Chain
38. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.Image credits: Icons from thenounproject.com: “Collect” icon by Ramesh; “Cloud Security” icon by Creative Stall; “Search” icon by
Dinosoft Labs;
COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER
Data Value Chain
39. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is the Goal?
To Provide an analytic ecosystem that Scales with the
Amazon Business
To Leverage AWS Technologies and to help Improve these
technologies for all Amazon Customers
To Provide Choice and Options in New Analytic Technologies
• Provide an SQL based solution
• Increasingly Focus on Enabling new analytic approaches
including Machine Learning and Programmatic Data
Analysis
• Enable both “Bring Your Own Cluster” and “Bring your
Own Query” Approaches
40. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Andes – Current State
• We have the data!
• 20k+ Tables maintained in Andes – All Active Tables
have been Sourced from the Enterprise Data
Warehouse
• Many teams are adding new data sets!
• Have Onboarded 900+ Redshift and EMR systems to
Subscriptions
• 20,000+ tables being synchronized
• Usage off the Legacy DW
• Three years (2014-2016) to grow from 0 to 100k Jobs
each Day
• In 2017, has grown from 100k to 300k Jobs each Day
Amazon.com
Big Data
Technologies
41. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data producers
(Amazon teams that want to share
data with other teams)
"Big Data Marketplace"
42. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
THANK YOU!