SlideShare una empresa de Scribd logo
1 de 41
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
How Amazon.com uses AWS Analytics
Bill Baldwin
Technical Evangelist
bbaldwin@amazon.com
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Traditional Data Warehousing
Wikipedia: In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a
system used for reporting and data analysis, and is considered a core component of business intelligence.[1] DWs are
central repositories of integrated data from one or more disparate sources. They store current and historical data and
are used for creating analytical reports for knowledge workers throughout the enterprise. Examples of reports could
range from annual and quarterly comparisons and trends to detailed daily sales analysis.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The Battle for the Future
VS.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
https://www.promptcloud.com
https://john-popelaars.blogspot.com
https://ww.signiant.com
https://www.linkedin.com/pulse/world-today-data-rich-information-poor-guru-p-mohapatra-pmp/
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The Industry Problem
Growth in Data
(mostly Unstructured)
& Analytics
Average Growth in
Traditional DW
Data
Average IT Budget
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is Amazon?
9
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Our vision is to be earth’s most customer-centric company;
to build a place where people can come to find and discover
anything they might want to buy online.
10
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 11
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 12
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Data Warehouse
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The Amazon Enterprise Data Warehouse
The Good!
Helps to Run the Amazon Business
• Most Comprehensive Set of Cleansed and Curated Business Data
• Feeds Many Downstream Systems and Processes
• Batch Processing, Reporting and Ad Hoc
• 500k+ Data Loads/Transformations Each Day
• 200k+ Queries/Extracts Each Day
• 20k+ Active Tables
• 10B++ Rows Loaded Daily
Our Data is Big!
• Core Data Set: 5+PB of Compressed Data (primarily limited by Legacy Technology)
• Total Storage (Multiple Systems): 35+ PB compressed
• Quote from Executive at Legacy DW Vendor:
• ~1000x Larger than any other DW Customer (from that Vendor)
Significant and Increasing Use of Redshift and EMR
• 1000’s of Redshift and EMR Systems, Range in size from:
• Individual Contributor - Project Based, to
• Running Multi-Billion Dollar Business inside Amazon
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• Who are we?
• Analytics on the “Marketplace”
• Analytics Spokes: Pricing, B2B, Seller Support, Lending …
• Business Scale:
• 235MM monthly CPU Minutes on Legacy ODW
• 2K upstream tables
• Users:
• Supports 170 teams
• 1000 users with 9527 profiles (Parameterized Queries)
• 20K unique job runs per month
• 2800 (800 TB) datasets
• BI Tool Users:
• 3000+ Users, 650 non-tech
• 600+ ”Dashboards”
• 100k’s of queries each month
Example of an Amazon DW “Customer” Team
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
“Swiss Army” by by Jim Pennucci. No alterations other than cropping. https://www.flickr.com/photos/pennuja/5363518281/
Image used with permissions under Creative Commons license 2.0, Attribution Generic License
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is the Goal?
To Provide an analytic ecosystem that Scales with the
Amazon Business
To Leverage AWS Technologies and to help Improve these
technologies for all Amazon Customers
To Provide Choice and Options in New Analytic Technologies
• Provide an SQL based solution
• Increasingly Focus on Enabling new analytic approaches
including Machine Learning and Programmatic Data
Analysis
• Enable both “Bring Your Own Cluster” and “Bring your
Own Query” Approaches
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
“Tools #2” by Juan Pablo Olmo. No alterations other than cropping. https://www.flickr.com/photos/juanpol/1562101472/
Image used with permissions under Creative Commons license 2.0, Attribution Generic License (https://creativecommons.org/licenses/by/2.0/)
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon EMR
(running Hive, Pig,
Spark, Presto, etc…)
Amazon DynamoDB
Amazon
Machine Learning
Amazon QuickSight
Amazon RDS
Amazon Elasticsearch
Service
Amazon Redshift Amazon Athena
Amazon SQS
Amazon Kinesis
Analytics
Amazon Kinesis
Firehose
Amazon S3
Amazon Kinesis
Open-source tools
(e.g. for ML, data science)
Commercial tools
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Moving Forward - AWS
S3 / EDX - Separate
Storage from Compute by
leveraging a parallel file
system as a global data
exchange
• Redshift - Preferred
platform SQL based
Analysis and traditional
Data Warehouse Data
• Focus is “Business Users”
• EMR – Scalable “Do
Everything” Platform - Enable
Teams who have chosen EMR
by providing Curated Data
• Focus is “Programattic Access”
Amazon
Redshift
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The Amazon “Data Lake” – Project Name “Andes”
The Goal: ”THE” Place for Data at Amazon
• Source teams (Data Producers) put their Public Data there to give access to Analytic
teams (Data Consumers) and to share private data within their team
• EMR Can Directly Access the Data in Parallel from Andes
• Redshift can load the data in Parallel from Andes, or it Can Directly Access the Data in
Parallel with Spectrum
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
“Datamarts”
Number of Teams using the DW: ~2300
Number of Tables Used per Team:
• Max: 598
• Min 1
• Average: 49
Ad-Hoc (any data any time) can be achieved via
EMR can access the Data in Andes Directly
Redshift can load data into the Redshift file
system, or it can use the Spectrum Feature to
directly access the Data in Andes
An Architecture that Scales with the Business
Amazon Internal Team (132 Tables)
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Putting The Pieces Together
The Analytic Architecture of the Future
Source
Systems
The Data Lake
“Andes”
Big Data Systems
Data Warehouses
“Bring Your Own Cluster” and
“Bring Your Own Query”
Services and Users
Postgre SQL
instance
Amazon
Redshift
Amazon
Redshift
Amazon
Redshift
Amazon
Kinesis
AWS Glue Amazon
QuickSight
Amazon
Athena
Amazon Machine
Learning
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The Battle for the Future
The Data Lake becomes the
common source for all
data:
The DW becomes the
compute engine for
traditional structured data
(Redshift)
EMR becomes the compute
engine for programmatic
access, like machine
learning and many
emerging use cases
Both become a form of a
Dependent data mart with
the data coming from the
Data Lake
Vs.
AND
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 26
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Purchase
Contract
seller buyer
27
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Table Subscriptions - The Vision
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Subscription
“Big Data Technologies” Team
producer consumer
29
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 30
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Value Chain
Image credits: Icons from thenounproject.com: “Collect” icon by Ramesh; “Cloud Security” icon by Creative Stall; “Search” icon by
Dinosoft Labs;
COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Producers only need to integrate their datasets once
with the data lake
• Simplified onboarding process
• One-time integration
Ingest from various source systems:
• Relational databases – e.g., Amazon Aurora/RDS
Postgres
• Non-relational databases – e.g., Amazon DynamoDB
• Streams – e.g., Amazon Kinesis
• Flat files –e.g., files in Amazon S3
COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER
Data Value Chain
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Secure and scalable data lake:
• Highly durable S3-based storage
• Scalable since it’s built on AWS technologies
• Permissions are strictly enforced
Data quality:
• Certified with data quality checks
• Schemas are validated
COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER
Data Value Chain
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Company-wide data search index
• Consumers can quickly find what they’re looking
for
• Useful information about the datasets are
shown
Clear communication:
• Producers can communicate expectations
around data quality and SLAs
• Consumers can contact producers
COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER
Data Value Chain
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Easy process to subscribe to data:
• Find a dataset of interest
• Click “Subscribe”
• Choose the destination compute platform
Rapidly populate data marts, for example:
• Use AWS CloudFormation to provision Redshift
cluster
• Use subscriptions to load datasets to the cluster
COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER
Data Value Chain
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Subscriptions mechanism:
• Makes data available to the compute platform where
it can be analyzed
• Keep the compute platform in-sync with any data
updates
• Users can monitor the sync status of their
subscriptions
Synchronizations can be either:
• Full data copy
• Metadata-only sync
COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER
Data Value Chain
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Teams can use the right tools for the jobs, e.g.:
• Amazon Redshift for interactive analytics or batch
scheduled jobs
• Amazon EMR for machine learning and data
science
• QuickSight for Business analytics and visualizations
Compute resources can be scaled independently
of the data lake in order to:
• Process more/bigger/faster jobs
• Optimize costs
• Meet business SLAs
• Scale to meet high peak workloads
COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER
Data Value Chain
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.Image credits: Icons from thenounproject.com: “Collect” icon by Ramesh; “Cloud Security” icon by Creative Stall; “Search” icon by
Dinosoft Labs;
COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER
Data Value Chain
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is the Goal?
To Provide an analytic ecosystem that Scales with the
Amazon Business
To Leverage AWS Technologies and to help Improve these
technologies for all Amazon Customers
To Provide Choice and Options in New Analytic Technologies
• Provide an SQL based solution
• Increasingly Focus on Enabling new analytic approaches
including Machine Learning and Programmatic Data
Analysis
• Enable both “Bring Your Own Cluster” and “Bring your
Own Query” Approaches
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Andes – Current State
• We have the data!
• 20k+ Tables maintained in Andes – All Active Tables
have been Sourced from the Enterprise Data
Warehouse
• Many teams are adding new data sets!
• Have Onboarded 900+ Redshift and EMR systems to
Subscriptions
• 20,000+ tables being synchronized
• Usage off the Legacy DW
• Three years (2014-2016) to grow from 0 to 100k Jobs
each Day
• In 2017, has grown from 100k to 300k Jobs each Day
Amazon.com
Big Data
Technologies
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data producers
(Amazon teams that want to share
data with other teams)
"Big Data Marketplace"
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
aws.amazon.com/activate
Everything and Anything Startups
Need to Get Started on AWS

Más contenido relacionado

La actualidad más candente

Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftAmazon Web Services
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesBuild Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesAmazon Web Services
 
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...Amazon Web Services
 
Migrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data LakeMigrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data LakeAmazon Web Services
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSAmazon Web Services
 
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueAmazon Web Services
 
Building High Performance Apps with In-memory Data
Building High Performance Apps with In-memory DataBuilding High Performance Apps with In-memory Data
Building High Performance Apps with In-memory DataAmazon Web Services
 
AWS Storage State of the Union & APN Storage Ecosystem
AWS Storage State of the Union & APN Storage EcosystemAWS Storage State of the Union & APN Storage Ecosystem
AWS Storage State of the Union & APN Storage EcosystemAmazon Web Services
 
Building Data Lakes in the AWS Cloud
Building Data Lakes in the AWS CloudBuilding Data Lakes in the AWS Cloud
Building Data Lakes in the AWS CloudAmazon Web Services
 
What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018
What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018
What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018Amazon Web Services
 
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...Amazon Web Services
 
ABD316_American Heart Association Finding Cures to Heart Disease Through the ...
ABD316_American Heart Association Finding Cures to Heart Disease Through the ...ABD316_American Heart Association Finding Cures to Heart Disease Through the ...
ABD316_American Heart Association Finding Cures to Heart Disease Through the ...Amazon Web Services
 
Welcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution OverviewWelcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution OverviewAmazon Web Services
 

La actualidad más candente (20)

Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
Analyzing Streams
Analyzing StreamsAnalyzing Streams
Analyzing Streams
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesBuild Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
 
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
 
Analyzing Streams
Analyzing StreamsAnalyzing Streams
Analyzing Streams
 
Migrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data LakeMigrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data Lake
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWS
 
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS Glue
 
Preparing Data for the Lake
Preparing Data for the LakePreparing Data for the Lake
Preparing Data for the Lake
 
Building High Performance Apps with In-memory Data
Building High Performance Apps with In-memory DataBuilding High Performance Apps with In-memory Data
Building High Performance Apps with In-memory Data
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
AWS Storage State of the Union & APN Storage Ecosystem
AWS Storage State of the Union & APN Storage EcosystemAWS Storage State of the Union & APN Storage Ecosystem
AWS Storage State of the Union & APN Storage Ecosystem
 
Building Data Lakes in the AWS Cloud
Building Data Lakes in the AWS CloudBuilding Data Lakes in the AWS Cloud
Building Data Lakes in the AWS Cloud
 
AWS & Database Analytics
AWS & Database AnalyticsAWS & Database Analytics
AWS & Database Analytics
 
What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018
What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018
What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018
 
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
 
ABD316_American Heart Association Finding Cures to Heart Disease Through the ...
ABD316_American Heart Association Finding Cures to Heart Disease Through the ...ABD316_American Heart Association Finding Cures to Heart Disease Through the ...
ABD316_American Heart Association Finding Cures to Heart Disease Through the ...
 
Welcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution OverviewWelcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution Overview
 

Similar a How Amazon.com uses AWS Analytics

How Amazon.com Uses AWS Analytics: Data Analytics Week SF
How Amazon.com Uses AWS Analytics: Data Analytics Week SFHow Amazon.com Uses AWS Analytics: Data Analytics Week SF
How Amazon.com Uses AWS Analytics: Data Analytics Week SFAmazon Web Services
 
STG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data OceansSTG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data OceansAmazon Web Services
 
How Citrix Uses AWS Marketplace Solutions to Accelerate Analytic Workloads on...
How Citrix Uses AWS Marketplace Solutions to Accelerate Analytic Workloads on...How Citrix Uses AWS Marketplace Solutions to Accelerate Analytic Workloads on...
How Citrix Uses AWS Marketplace Solutions to Accelerate Analytic Workloads on...Amazon Web Services
 
MSC203_How Citrix Uses AWS Marketplace Solutions To Accelerate Analytic Workl...
MSC203_How Citrix Uses AWS Marketplace Solutions To Accelerate Analytic Workl...MSC203_How Citrix Uses AWS Marketplace Solutions To Accelerate Analytic Workl...
MSC203_How Citrix Uses AWS Marketplace Solutions To Accelerate Analytic Workl...Amazon Web Services
 
Building a Real-Time Data Platform on AWS
Building a Real-Time Data Platform on AWSBuilding a Real-Time Data Platform on AWS
Building a Real-Time Data Platform on AWSInjae Kwak
 
100 Billion Data Points With Lambda_AWSPSSummit_Singapore
100 Billion Data Points With Lambda_AWSPSSummit_Singapore100 Billion Data Points With Lambda_AWSPSSummit_Singapore
100 Billion Data Points With Lambda_AWSPSSummit_SingaporeAmazon Web Services
 
ARC201_Scaling Up to Your First 10 Million Users
ARC201_Scaling Up to Your First 10 Million UsersARC201_Scaling Up to Your First 10 Million Users
ARC201_Scaling Up to Your First 10 Million UsersAmazon Web Services
 
RET301-Build Single Customer View across Multiple Retail Channels using AWS S...
RET301-Build Single Customer View across Multiple Retail Channels using AWS S...RET301-Build Single Customer View across Multiple Retail Channels using AWS S...
RET301-Build Single Customer View across Multiple Retail Channels using AWS S...Amazon Web Services
 
Automating Big Data Technologies for Faster Time-to-Value
 Automating Big Data Technologies for Faster Time-to-Value Automating Big Data Technologies for Faster Time-to-Value
Automating Big Data Technologies for Faster Time-to-ValueAmazon Web Services
 
Architecting an Open Data Lake for the Enterprise
 Architecting an Open Data Lake for the Enterprise  Architecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the Enterprise Amazon Web Services
 
21st Century Analytics with Zopa
21st Century Analytics with Zopa21st Century Analytics with Zopa
21st Century Analytics with ZopaAmazon Web Services
 
Database Freedom. Database migration approaches to get to the Cloud - Marcus ...
Database Freedom. Database migration approaches to get to the Cloud - Marcus ...Database Freedom. Database migration approaches to get to the Cloud - Marcus ...
Database Freedom. Database migration approaches to get to the Cloud - Marcus ...Amazon Web Services
 
Analyzing Streaming Data in Real-time with Amazon Kinesis
Analyzing Streaming Data in Real-time with Amazon KinesisAnalyzing Streaming Data in Real-time with Amazon Kinesis
Analyzing Streaming Data in Real-time with Amazon KinesisAmazon Web Services
 
Scaling Up to Your First 10 Million Users
Scaling Up to Your First 10 Million UsersScaling Up to Your First 10 Million Users
Scaling Up to Your First 10 Million UsersAmazon Web Services
 
I Want to Analyze and Visualize Website Access Logs, but Why Do I Need Server...
I Want to Analyze and Visualize Website Access Logs, but Why Do I Need Server...I Want to Analyze and Visualize Website Access Logs, but Why Do I Need Server...
I Want to Analyze and Visualize Website Access Logs, but Why Do I Need Server...Amazon Web Services
 
Value of Data Beyond Analytics by Darin Briskman
 Value of Data Beyond Analytics by Darin Briskman Value of Data Beyond Analytics by Darin Briskman
Value of Data Beyond Analytics by Darin BriskmanSameer Kenkare
 
Best Practices for Distributed Machine Learning and Predictive Analytics Usin...
Best Practices for Distributed Machine Learning and Predictive Analytics Usin...Best Practices for Distributed Machine Learning and Predictive Analytics Usin...
Best Practices for Distributed Machine Learning and Predictive Analytics Usin...Amazon Web Services
 
Architecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the EnterpriseArchitecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the EnterpriseAmazon Web Services
 
Real-time Analytics using Data from IoT Devices - AWS Online Tech Talks
Real-time Analytics using Data from IoT Devices - AWS Online Tech TalksReal-time Analytics using Data from IoT Devices - AWS Online Tech Talks
Real-time Analytics using Data from IoT Devices - AWS Online Tech TalksAmazon Web Services
 

Similar a How Amazon.com uses AWS Analytics (20)

How Amazon.com Uses AWS Analytics: Data Analytics Week SF
How Amazon.com Uses AWS Analytics: Data Analytics Week SFHow Amazon.com Uses AWS Analytics: Data Analytics Week SF
How Amazon.com Uses AWS Analytics: Data Analytics Week SF
 
STG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data OceansSTG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data Oceans
 
How Citrix Uses AWS Marketplace Solutions to Accelerate Analytic Workloads on...
How Citrix Uses AWS Marketplace Solutions to Accelerate Analytic Workloads on...How Citrix Uses AWS Marketplace Solutions to Accelerate Analytic Workloads on...
How Citrix Uses AWS Marketplace Solutions to Accelerate Analytic Workloads on...
 
MSC203_How Citrix Uses AWS Marketplace Solutions To Accelerate Analytic Workl...
MSC203_How Citrix Uses AWS Marketplace Solutions To Accelerate Analytic Workl...MSC203_How Citrix Uses AWS Marketplace Solutions To Accelerate Analytic Workl...
MSC203_How Citrix Uses AWS Marketplace Solutions To Accelerate Analytic Workl...
 
Building a Real-Time Data Platform on AWS
Building a Real-Time Data Platform on AWSBuilding a Real-Time Data Platform on AWS
Building a Real-Time Data Platform on AWS
 
100 Billion Data Points With Lambda_AWSPSSummit_Singapore
100 Billion Data Points With Lambda_AWSPSSummit_Singapore100 Billion Data Points With Lambda_AWSPSSummit_Singapore
100 Billion Data Points With Lambda_AWSPSSummit_Singapore
 
ARC201_Scaling Up to Your First 10 Million Users
ARC201_Scaling Up to Your First 10 Million UsersARC201_Scaling Up to Your First 10 Million Users
ARC201_Scaling Up to Your First 10 Million Users
 
RET301-Build Single Customer View across Multiple Retail Channels using AWS S...
RET301-Build Single Customer View across Multiple Retail Channels using AWS S...RET301-Build Single Customer View across Multiple Retail Channels using AWS S...
RET301-Build Single Customer View across Multiple Retail Channels using AWS S...
 
Automating Big Data Technologies for Faster Time-to-Value
 Automating Big Data Technologies for Faster Time-to-Value Automating Big Data Technologies for Faster Time-to-Value
Automating Big Data Technologies for Faster Time-to-Value
 
Architecting an Open Data Lake for the Enterprise
 Architecting an Open Data Lake for the Enterprise  Architecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the Enterprise
 
21st Century Analytics with Zopa
21st Century Analytics with Zopa21st Century Analytics with Zopa
21st Century Analytics with Zopa
 
Database Freedom. Database migration approaches to get to the Cloud - Marcus ...
Database Freedom. Database migration approaches to get to the Cloud - Marcus ...Database Freedom. Database migration approaches to get to the Cloud - Marcus ...
Database Freedom. Database migration approaches to get to the Cloud - Marcus ...
 
STG401_This Is My Architecture
STG401_This Is My ArchitectureSTG401_This Is My Architecture
STG401_This Is My Architecture
 
Analyzing Streaming Data in Real-time with Amazon Kinesis
Analyzing Streaming Data in Real-time with Amazon KinesisAnalyzing Streaming Data in Real-time with Amazon Kinesis
Analyzing Streaming Data in Real-time with Amazon Kinesis
 
Scaling Up to Your First 10 Million Users
Scaling Up to Your First 10 Million UsersScaling Up to Your First 10 Million Users
Scaling Up to Your First 10 Million Users
 
I Want to Analyze and Visualize Website Access Logs, but Why Do I Need Server...
I Want to Analyze and Visualize Website Access Logs, but Why Do I Need Server...I Want to Analyze and Visualize Website Access Logs, but Why Do I Need Server...
I Want to Analyze and Visualize Website Access Logs, but Why Do I Need Server...
 
Value of Data Beyond Analytics by Darin Briskman
 Value of Data Beyond Analytics by Darin Briskman Value of Data Beyond Analytics by Darin Briskman
Value of Data Beyond Analytics by Darin Briskman
 
Best Practices for Distributed Machine Learning and Predictive Analytics Usin...
Best Practices for Distributed Machine Learning and Predictive Analytics Usin...Best Practices for Distributed Machine Learning and Predictive Analytics Usin...
Best Practices for Distributed Machine Learning and Predictive Analytics Usin...
 
Architecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the EnterpriseArchitecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the Enterprise
 
Real-time Analytics using Data from IoT Devices - AWS Online Tech Talks
Real-time Analytics using Data from IoT Devices - AWS Online Tech TalksReal-time Analytics using Data from IoT Devices - AWS Online Tech Talks
Real-time Analytics using Data from IoT Devices - AWS Online Tech Talks
 

Más de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Más de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

How Amazon.com uses AWS Analytics

  • 1. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved How Amazon.com uses AWS Analytics Bill Baldwin Technical Evangelist bbaldwin@amazon.com
  • 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Traditional Data Warehousing Wikipedia: In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis, and is considered a core component of business intelligence.[1] DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data and are used for creating analytical reports for knowledge workers throughout the enterprise. Examples of reports could range from annual and quarterly comparisons and trends to detailed daily sales analysis.
  • 3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The Battle for the Future VS.
  • 4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. https://www.promptcloud.com https://john-popelaars.blogspot.com https://ww.signiant.com https://www.linkedin.com/pulse/world-today-data-rich-information-poor-guru-p-mohapatra-pmp/
  • 6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The Industry Problem Growth in Data (mostly Unstructured) & Analytics Average Growth in Traditional DW Data Average IT Budget
  • 7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is Amazon? 9
  • 9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Our vision is to be earth’s most customer-centric company; to build a place where people can come to find and discover anything they might want to buy online. 10
  • 10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 11
  • 11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 12
  • 12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Data Warehouse
  • 13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The Amazon Enterprise Data Warehouse The Good! Helps to Run the Amazon Business • Most Comprehensive Set of Cleansed and Curated Business Data • Feeds Many Downstream Systems and Processes • Batch Processing, Reporting and Ad Hoc • 500k+ Data Loads/Transformations Each Day • 200k+ Queries/Extracts Each Day • 20k+ Active Tables • 10B++ Rows Loaded Daily Our Data is Big! • Core Data Set: 5+PB of Compressed Data (primarily limited by Legacy Technology) • Total Storage (Multiple Systems): 35+ PB compressed • Quote from Executive at Legacy DW Vendor: • ~1000x Larger than any other DW Customer (from that Vendor) Significant and Increasing Use of Redshift and EMR • 1000’s of Redshift and EMR Systems, Range in size from: • Individual Contributor - Project Based, to • Running Multi-Billion Dollar Business inside Amazon
  • 14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Who are we? • Analytics on the “Marketplace” • Analytics Spokes: Pricing, B2B, Seller Support, Lending … • Business Scale: • 235MM monthly CPU Minutes on Legacy ODW • 2K upstream tables • Users: • Supports 170 teams • 1000 users with 9527 profiles (Parameterized Queries) • 20K unique job runs per month • 2800 (800 TB) datasets • BI Tool Users: • 3000+ Users, 650 non-tech • 600+ ”Dashboards” • 100k’s of queries each month Example of an Amazon DW “Customer” Team
  • 15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. “Swiss Army” by by Jim Pennucci. No alterations other than cropping. https://www.flickr.com/photos/pennuja/5363518281/ Image used with permissions under Creative Commons license 2.0, Attribution Generic License
  • 17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is the Goal? To Provide an analytic ecosystem that Scales with the Amazon Business To Leverage AWS Technologies and to help Improve these technologies for all Amazon Customers To Provide Choice and Options in New Analytic Technologies • Provide an SQL based solution • Increasingly Focus on Enabling new analytic approaches including Machine Learning and Programmatic Data Analysis • Enable both “Bring Your Own Cluster” and “Bring your Own Query” Approaches
  • 18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. “Tools #2” by Juan Pablo Olmo. No alterations other than cropping. https://www.flickr.com/photos/juanpol/1562101472/ Image used with permissions under Creative Commons license 2.0, Attribution Generic License (https://creativecommons.org/licenses/by/2.0/)
  • 19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon EMR (running Hive, Pig, Spark, Presto, etc…) Amazon DynamoDB Amazon Machine Learning Amazon QuickSight Amazon RDS Amazon Elasticsearch Service Amazon Redshift Amazon Athena Amazon SQS Amazon Kinesis Analytics Amazon Kinesis Firehose Amazon S3 Amazon Kinesis Open-source tools (e.g. for ML, data science) Commercial tools
  • 20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Moving Forward - AWS S3 / EDX - Separate Storage from Compute by leveraging a parallel file system as a global data exchange • Redshift - Preferred platform SQL based Analysis and traditional Data Warehouse Data • Focus is “Business Users” • EMR – Scalable “Do Everything” Platform - Enable Teams who have chosen EMR by providing Curated Data • Focus is “Programattic Access” Amazon Redshift
  • 21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The Amazon “Data Lake” – Project Name “Andes” The Goal: ”THE” Place for Data at Amazon • Source teams (Data Producers) put their Public Data there to give access to Analytic teams (Data Consumers) and to share private data within their team • EMR Can Directly Access the Data in Parallel from Andes • Redshift can load the data in Parallel from Andes, or it Can Directly Access the Data in Parallel with Spectrum
  • 22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. “Datamarts” Number of Teams using the DW: ~2300 Number of Tables Used per Team: • Max: 598 • Min 1 • Average: 49 Ad-Hoc (any data any time) can be achieved via EMR can access the Data in Andes Directly Redshift can load data into the Redshift file system, or it can use the Spectrum Feature to directly access the Data in Andes An Architecture that Scales with the Business Amazon Internal Team (132 Tables)
  • 23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Putting The Pieces Together The Analytic Architecture of the Future Source Systems The Data Lake “Andes” Big Data Systems Data Warehouses “Bring Your Own Cluster” and “Bring Your Own Query” Services and Users Postgre SQL instance Amazon Redshift Amazon Redshift Amazon Redshift Amazon Kinesis AWS Glue Amazon QuickSight Amazon Athena Amazon Machine Learning
  • 24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The Battle for the Future The Data Lake becomes the common source for all data: The DW becomes the compute engine for traditional structured data (Redshift) EMR becomes the compute engine for programmatic access, like machine learning and many emerging use cases Both become a form of a Dependent data mart with the data coming from the Data Lake Vs. AND
  • 25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 26
  • 26. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Purchase Contract seller buyer 27
  • 27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Table Subscriptions - The Vision
  • 28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Subscription “Big Data Technologies” Team producer consumer 29
  • 29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 30
  • 30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Value Chain Image credits: Icons from thenounproject.com: “Collect” icon by Ramesh; “Cloud Security” icon by Creative Stall; “Search” icon by Dinosoft Labs; COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER
  • 31. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Producers only need to integrate their datasets once with the data lake • Simplified onboarding process • One-time integration Ingest from various source systems: • Relational databases – e.g., Amazon Aurora/RDS Postgres • Non-relational databases – e.g., Amazon DynamoDB • Streams – e.g., Amazon Kinesis • Flat files –e.g., files in Amazon S3 COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER Data Value Chain
  • 32. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Secure and scalable data lake: • Highly durable S3-based storage • Scalable since it’s built on AWS technologies • Permissions are strictly enforced Data quality: • Certified with data quality checks • Schemas are validated COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER Data Value Chain
  • 33. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Company-wide data search index • Consumers can quickly find what they’re looking for • Useful information about the datasets are shown Clear communication: • Producers can communicate expectations around data quality and SLAs • Consumers can contact producers COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER Data Value Chain
  • 34. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Easy process to subscribe to data: • Find a dataset of interest • Click “Subscribe” • Choose the destination compute platform Rapidly populate data marts, for example: • Use AWS CloudFormation to provision Redshift cluster • Use subscriptions to load datasets to the cluster COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER Data Value Chain
  • 35. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Subscriptions mechanism: • Makes data available to the compute platform where it can be analyzed • Keep the compute platform in-sync with any data updates • Users can monitor the sync status of their subscriptions Synchronizations can be either: • Full data copy • Metadata-only sync COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER Data Value Chain
  • 36. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Teams can use the right tools for the jobs, e.g.: • Amazon Redshift for interactive analytics or batch scheduled jobs • Amazon EMR for machine learning and data science • QuickSight for Business analytics and visualizations Compute resources can be scaled independently of the data lake in order to: • Process more/bigger/faster jobs • Optimize costs • Meet business SLAs • Scale to meet high peak workloads COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER Data Value Chain
  • 37. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.Image credits: Icons from thenounproject.com: “Collect” icon by Ramesh; “Cloud Security” icon by Creative Stall; “Search” icon by Dinosoft Labs; COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER Data Value Chain
  • 38. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is the Goal? To Provide an analytic ecosystem that Scales with the Amazon Business To Leverage AWS Technologies and to help Improve these technologies for all Amazon Customers To Provide Choice and Options in New Analytic Technologies • Provide an SQL based solution • Increasingly Focus on Enabling new analytic approaches including Machine Learning and Programmatic Data Analysis • Enable both “Bring Your Own Cluster” and “Bring your Own Query” Approaches
  • 39. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Andes – Current State • We have the data! • 20k+ Tables maintained in Andes – All Active Tables have been Sourced from the Enterprise Data Warehouse • Many teams are adding new data sets! • Have Onboarded 900+ Redshift and EMR systems to Subscriptions • 20,000+ tables being synchronized • Usage off the Legacy DW • Three years (2014-2016) to grow from 0 to 100k Jobs each Day • In 2017, has grown from 100k to 300k Jobs each Day Amazon.com Big Data Technologies
  • 40. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data producers (Amazon teams that want to share data with other teams) "Big Data Marketplace"
  • 41. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved aws.amazon.com/activate Everything and Anything Startups Need to Get Started on AWS