SlideShare a Scribd company logo
1 of 42
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS re:INVENT
AB D 329 - A L ook Under the Hood – How
Amazon.com Uses AWS Services for Analy tics
at Massive S cale
J e f f C a r t e r , V P , B i g D a t a T e c h n o l o g i e s , A m a z o n . c o m
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Traditional Data Warehousing
Wikipedia: In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a
system used for reporting and data analysis, and is considered a core component of business intelligence.[1] DWs are
central repositories of integrated data from one or more disparate sources. They store current and historical data and
are used for creating analytical reports for knowledge workers throughout the enterprise. Examples of reports could
range from annual and quarterly comparisons and trends to detailed daily sales analysis.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The Battle for the Future
VS.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Big Data Example – The Smart Trashcan
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
https://www.promptcloud.com
https://john-popelaars.blogspot.com
https://ww.signiant.com
https://www.linkedin.com/pulse/world-today-data-rich-information-poor-guru-p-mohapatra-pmp/
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The Industry Problem
Growth in Data
(mostly Unstructured)
& Analytics
Average Growth in
Traditional DW
Data
Average IT Budget
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is Amazon?
9
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Our vision is to be earth’s most customer-centric company;
to build a place where people can come to find and discover
anything they might want to buy online.
10
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 11
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 12
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Data Warehouse
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The Amazon Enterprise Data Warehouse
The Good!
Helps to Run the Amazon Business
• Most Comprehensive Set of Cleansed and Curated Business Data
• Feeds Many Downstream Systems and Processes
• Batch Processing, Reporting and Ad Hoc
• 500k+ Data Loads/Transformations Each Day
• 200k+ Queries/Extracts Each Day
• 20k+ Active Tables
• 10B++ Rows Loaded Daily
Our Data is Big!
• Core Data Set: 5+PB of Compressed Data (primarily limited by Legacy Technology)
• Total Storage (Multiple Systems): 35+ PB compressed
• Quote from Executive at Legacy DW Vendor:
• ~1000x Larger than any other DW Customer (from that Vendor)
Significant and Increasing Use of Redshift and EMR
• 1000’s of Redshift and EMR Systems, Range in size from:
• Individual Contributor - Project Based, to
• Running Multi-Billion Dollar Business inside Amazon
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• Who are we?
• Analytics on the “Marketplace”
• Analytics Spokes: Pricing, B2B, Seller Support, Lending …
• Business Scale:
• 235MM monthly CPU Minutes on Legacy ODW
• 2K upstream tables
• Users:
• Supports 170 teams
• 1000 users with 9527 profiles (Parameterized Queries)
• 20K unique job runs per month
• 2800 (800 TB) datasets
• BI Tool Users:
• 3000+ Users, 650 non-tech
• 600+ ”Dashboards”
• 100k’s of queries each month
Example of an Amazon DW “Customer” Team
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
“Swiss Army” by by Jim Pennucci. No alterations other than cropping. https://www.flickr.com/photos/pennuja/5363518281/
Image used with permissions under Creative Commons license 2.0, Attribution Generic License
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is the Goal?
To Provide an analytic ecosystem that Scales with the
Amazon Business
To Leverage AWS Technologies and to help Improve these
technologies for all Amazon Customers
To Provide Choice and Options in New Analytic Technologies
• Provide an SQL based solution
• Increasingly Focus on Enabling new analytic approaches
including Machine Learning and Programmatic Data
Analysis
• Enable both “Bring Your Own Cluster” and “Bring your
Own Query” Approaches
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
“Tools #2” by Juan Pablo Olmo. No alterations other than cropping. https://www.flickr.com/photos/juanpol/1562101472/
Image used with permissions under Creative Commons license 2.0, Attribution Generic License (https://creativecommons.org/licenses/by/2.0/)
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon EMR
(running Hive, Pig,
Spark, Presto, etc…)
Amazon DynamoDB
Amazon
Machine Learning
Amazon QuickSight
Amazon RDS
Amazon Elasticsearch
Service
Amazon Redshift Amazon Athena
Amazon SQS
Amazon Kinesis
Analytics
Amazon Kinesis
Firehose
Amazon S3
Amazon Kinesis
Open-source tools
(e.g. for ML, data science)
Commercial tools
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Moving Forward - AWS
S3 / EDX - Separate
Storage from Compute by
leveraging a parallel file
system as a global data
exchange
• Redshift - Preferred
platform SQL based
Analysis and traditional
Data Warehouse Data
• Focus is “Business Users”
• EMR – Scalable “Do
Everything” Platform - Enable
Teams who have chosen EMR
by providing Curated Data
• Focus is “Programattic Access”
Amazon
Redshift
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The Amazon “Data Lake” – Project Name “Andes”
The Goal: ”THE” Place for Data at Amazon
• Source teams (Data Producers) put their Public Data there to give access to Analytic
teams (Data Consumers) and to share private data within their team
• EMR Can Directly Access the Data in Parallel from Andes
• Redshift can load the data in Parallel from Andes, or it Can Directly Access the Data in
Parallel with Spectrum
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
“Datamarts”
Number of Teams using the DW: ~2300
Number of Tables Used per Team:
• Max: 598
• Min 1
• Average: 49
Ad-Hoc (any data any time) can be achieved via
EMR can access the Data in Andes Directly
Redshift can load data into the Redshift file
system, or it can use the Spectrum Feature to
directly access the Data in Andes
An Architecture that Scales with the Business
Amazon Internal Team (132 Tables)
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Putting The Pieces Together
The Analytic Architecture of the Future
Source
Systems
The Data Lake
“Andes”
Big Data Systems
Data Warehouses
“Bring Your Own Cluster” and
“Bring Your Own Query”
Services and Users
Postgre SQL
instance
Amazon
Redshift
Amazon
Redshift
Amazon
Redshift
Amazon
Kinesis
AWS Glue Amazon
QuickSight
Amazon
Athena
Amazon Machine
Learning
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The Battle for the Future
The Data Lake becomes the
common source for all
data:
The DW becomes the
compute engine for
traditional structured data
(Redshift)
EMR becomes the compute
engine for programmatic
access, like machine
learning and many
emerging use cases
Both become a form of a
Dependent data mart with
the data coming from the
Data Lake
Vs.
AND
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 26
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Purchase
Contract
seller buyer
27
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Table Subscriptions - The Vision
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Subscription
“Big Data Technologies” Team
producer consumer
29
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 30
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Value Chain
Image credits: Icons from thenounproject.com: “Collect” icon by Ramesh; “Cloud Security” icon by Creative Stall; “Search” icon by
Dinosoft Labs;
COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Producers only need to integrate their datasets once
with the data lake
• Simplified onboarding process
• One-time integration
Ingest from various source systems:
• Relational databases – e.g., Amazon Aurora/RDS
Postgres
• Non-relational databases – e.g., Amazon DynamoDB
• Streams – e.g., Amazon Kinesis
• Flat files –e.g., files in Amazon S3
COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER
Data Value Chain
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Secure and scalable data lake:
• Highly durable S3-based storage
• Scalable since it’s built on AWS technologies
• Permissions are strictly enforced
Data quality:
• Certified with data quality checks
• Schemas are validated
COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER
Data Value Chain
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Company-wide data search index
• Consumers can quickly find what they’re looking
for
• Useful information about the datasets are
shown
Clear communication:
• Producers can communicate expectations
around data quality and SLAs
• Consumers can contact producers
COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER
Data Value Chain
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Easy process to subscribe to data:
• Find a dataset of interest
• Click “Subscribe”
• Choose the destination compute platform
Rapidly populate data marts, for example:
• Use AWS CloudFormation to provision Redshift
cluster
• Use subscriptions to load datasets to the cluster
COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER
Data Value Chain
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Subscriptions mechanism:
• Makes data available to the compute platform where
it can be analyzed
• Keep the compute platform in-sync with any data
updates
• Users can monitor the sync status of their
subscriptions
Synchronizations can be either:
• Full data copy
• Metadata-only sync
COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER
Data Value Chain
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Teams can use the right tools for the jobs, e.g.:
• Amazon Redshift for interactive analytics or batch
scheduled jobs
• Amazon EMR for machine learning and data
science
• QuickSight for Business analytics and visualizations
Compute resources can be scaled independently
of the data lake in order to:
• Process more/bigger/faster jobs
• Optimize costs
• Meet business SLAs
• Scale to meet high peak workloads
COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER
Data Value Chain
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.Image credits: Icons from thenounproject.com: “Collect” icon by Ramesh; “Cloud Security” icon by Creative Stall; “Search” icon by
Dinosoft Labs;
COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER
Data Value Chain
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is the Goal?
To Provide an analytic ecosystem that Scales with the
Amazon Business
To Leverage AWS Technologies and to help Improve these
technologies for all Amazon Customers
To Provide Choice and Options in New Analytic Technologies
• Provide an SQL based solution
• Increasingly Focus on Enabling new analytic approaches
including Machine Learning and Programmatic Data
Analysis
• Enable both “Bring Your Own Cluster” and “Bring your
Own Query” Approaches
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Andes – Current State
• We have the data!
• 20k+ Tables maintained in Andes – All Active Tables
have been Sourced from the Enterprise Data
Warehouse
• Many teams are adding new data sets!
• Have Onboarded 900+ Redshift and EMR systems to
Subscriptions
• 20,000+ tables being synchronized
• Usage off the Legacy DW
• Three years (2014-2016) to grow from 0 to 100k Jobs
each Day
• In 2017, has grown from 100k to 300k Jobs each Day
Amazon.com
Big Data
Technologies
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data producers
(Amazon teams that want to share
data with other teams)
"Big Data Marketplace"
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
THANK YOU!

More Related Content

What's hot

What's hot (20)

Migrating On-Premises Databases to Cloud
Migrating On-Premises Databases to CloudMigrating On-Premises Databases to Cloud
Migrating On-Premises Databases to Cloud
 
Building Data Lakes with AWS
Building Data Lakes with AWSBuilding Data Lakes with AWS
Building Data Lakes with AWS
 
20190122 AWS Black Belt Online Seminar Amazon Redshift Update
20190122 AWS Black Belt Online Seminar Amazon Redshift Update20190122 AWS Black Belt Online Seminar Amazon Redshift Update
20190122 AWS Black Belt Online Seminar Amazon Redshift Update
 
20190806 AWS Black Belt Online Seminar AWS Glue
20190806 AWS Black Belt Online Seminar AWS Glue20190806 AWS Black Belt Online Seminar AWS Glue
20190806 AWS Black Belt Online Seminar AWS Glue
 
AWS 101
AWS 101AWS 101
AWS 101
 
What is Cloud Computing with Amazon Web Services?
What is Cloud Computing with Amazon Web Services?What is Cloud Computing with Amazon Web Services?
What is Cloud Computing with Amazon Web Services?
 
20201118 AWS Black Belt Online Seminar 形で考えるサーバーレス設計 サーバーレスユースケースパターン解説
20201118 AWS Black Belt Online Seminar 形で考えるサーバーレス設計 サーバーレスユースケースパターン解説20201118 AWS Black Belt Online Seminar 形で考えるサーバーレス設計 サーバーレスユースケースパターン解説
20201118 AWS Black Belt Online Seminar 形で考えるサーバーレス設計 サーバーレスユースケースパターン解説
 
Enterprise-Database-Migration-Strategies-and-Options-on-AWS
Enterprise-Database-Migration-Strategies-and-Options-on-AWSEnterprise-Database-Migration-Strategies-and-Options-on-AWS
Enterprise-Database-Migration-Strategies-and-Options-on-AWS
 
Introduction to the Well-Architected Framework and Tool - SVC208 - Anaheim AW...
Introduction to the Well-Architected Framework and Tool - SVC208 - Anaheim AW...Introduction to the Well-Architected Framework and Tool - SVC208 - Anaheim AW...
Introduction to the Well-Architected Framework and Tool - SVC208 - Anaheim AW...
 
Migrating Your Oracle Database to PostgreSQL - AWS Online Tech Talks
Migrating Your Oracle Database to PostgreSQL - AWS Online Tech TalksMigrating Your Oracle Database to PostgreSQL - AWS Online Tech Talks
Migrating Your Oracle Database to PostgreSQL - AWS Online Tech Talks
 
Introduction to Amazon Aurora
Introduction to Amazon AuroraIntroduction to Amazon Aurora
Introduction to Amazon Aurora
 
대규모 온프레미스 하둡 마이그레이션을 위한 실행 전략과 최적화 방안 소개-유철민, AWS Data Architect / 박성열,AWS Pr...
대규모 온프레미스 하둡 마이그레이션을 위한 실행 전략과 최적화 방안 소개-유철민, AWS Data Architect / 박성열,AWS Pr...대규모 온프레미스 하둡 마이그레이션을 위한 실행 전략과 최적화 방안 소개-유철민, AWS Data Architect / 박성열,AWS Pr...
대규모 온프레미스 하둡 마이그레이션을 위한 실행 전략과 최적화 방안 소개-유철민, AWS Data Architect / 박성열,AWS Pr...
 
Executing a Large-Scale Migration to AWS
Executing a Large-Scale Migration to AWSExecuting a Large-Scale Migration to AWS
Executing a Large-Scale Migration to AWS
 
Build an AppStream 2.0 Environment to Deliver Desktop Applications to Any Com...
Build an AppStream 2.0 Environment to Deliver Desktop Applications to Any Com...Build an AppStream 2.0 Environment to Deliver Desktop Applications to Any Com...
Build an AppStream 2.0 Environment to Deliver Desktop Applications to Any Com...
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
 
Migrating Oracle Databases to AWS
Migrating Oracle Databases to AWSMigrating Oracle Databases to AWS
Migrating Oracle Databases to AWS
 
AWS 9월 웨비나 | AWS 데이터베이스 마이그레이션 서비스 활용하기
AWS 9월 웨비나 | AWS 데이터베이스 마이그레이션 서비스 활용하기AWS 9월 웨비나 | AWS 데이터베이스 마이그레이션 서비스 활용하기
AWS 9월 웨비나 | AWS 데이터베이스 마이그레이션 서비스 활용하기
 
Identity and access control for custom enterprise applications - SDD412 - AWS...
Identity and access control for custom enterprise applications - SDD412 - AWS...Identity and access control for custom enterprise applications - SDD412 - AWS...
Identity and access control for custom enterprise applications - SDD412 - AWS...
 
Intro to AWS Lambda
Intro to AWS Lambda Intro to AWS Lambda
Intro to AWS Lambda
 
다양한 배포 기법과 AWS에서 구축하는 CI/CD 파이프라인 l 안효빈 솔루션즈 아키텍트
다양한 배포 기법과 AWS에서 구축하는 CI/CD 파이프라인 l 안효빈 솔루션즈 아키텍트다양한 배포 기법과 AWS에서 구축하는 CI/CD 파이프라인 l 안효빈 솔루션즈 아키텍트
다양한 배포 기법과 AWS에서 구축하는 CI/CD 파이프라인 l 안효빈 솔루션즈 아키텍트
 

Similar to A Look Under the Hood – How Amazon.com Uses AWS Services for Analytics at Massive Scale - ABD329 - re:Invent 2017

Similar to A Look Under the Hood – How Amazon.com Uses AWS Services for Analytics at Massive Scale - ABD329 - re:Invent 2017 (20)

How Amazon.com uses AWS Analytics
How Amazon.com uses AWS AnalyticsHow Amazon.com uses AWS Analytics
How Amazon.com uses AWS Analytics
 
How Amazon.com uses AWS Analytics
How Amazon.com uses AWS AnalyticsHow Amazon.com uses AWS Analytics
How Amazon.com uses AWS Analytics
 
How Amazon uses AWS Analytics
How Amazon uses AWS AnalyticsHow Amazon uses AWS Analytics
How Amazon uses AWS Analytics
 
How Amazon.com Uses AWS Analytics
How Amazon.com Uses AWS AnalyticsHow Amazon.com Uses AWS Analytics
How Amazon.com Uses AWS Analytics
 
How Amazon.com Uses AWS Analytics: Data Analytics Week SF
How Amazon.com Uses AWS Analytics: Data Analytics Week SFHow Amazon.com Uses AWS Analytics: Data Analytics Week SF
How Amazon.com Uses AWS Analytics: Data Analytics Week SF
 
RET301-Build Single Customer View across Multiple Retail Channels using AWS S...
RET301-Build Single Customer View across Multiple Retail Channels using AWS S...RET301-Build Single Customer View across Multiple Retail Channels using AWS S...
RET301-Build Single Customer View across Multiple Retail Channels using AWS S...
 
STG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data OceansSTG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data Oceans
 
Building a Real-Time Data Platform on AWS
Building a Real-Time Data Platform on AWSBuilding a Real-Time Data Platform on AWS
Building a Real-Time Data Platform on AWS
 
Best Practices for Distributed Machine Learning and Predictive Analytics Usin...
Best Practices for Distributed Machine Learning and Predictive Analytics Usin...Best Practices for Distributed Machine Learning and Predictive Analytics Usin...
Best Practices for Distributed Machine Learning and Predictive Analytics Usin...
 
Value of Data Beyond Analytics by Darin Briskman
 Value of Data Beyond Analytics by Darin Briskman Value of Data Beyond Analytics by Darin Briskman
Value of Data Beyond Analytics by Darin Briskman
 
ABD206-Building Visualizations and Dashboards with Amazon QuickSight
ABD206-Building Visualizations and Dashboards with Amazon QuickSightABD206-Building Visualizations and Dashboards with Amazon QuickSight
ABD206-Building Visualizations and Dashboards with Amazon QuickSight
 
How Citrix Uses AWS Marketplace Solutions to Accelerate Analytic Workloads on...
How Citrix Uses AWS Marketplace Solutions to Accelerate Analytic Workloads on...How Citrix Uses AWS Marketplace Solutions to Accelerate Analytic Workloads on...
How Citrix Uses AWS Marketplace Solutions to Accelerate Analytic Workloads on...
 
MSC203_How Citrix Uses AWS Marketplace Solutions To Accelerate Analytic Workl...
MSC203_How Citrix Uses AWS Marketplace Solutions To Accelerate Analytic Workl...MSC203_How Citrix Uses AWS Marketplace Solutions To Accelerate Analytic Workl...
MSC203_How Citrix Uses AWS Marketplace Solutions To Accelerate Analytic Workl...
 
Journey Towards Scaling Your API to 10 Million Users
Journey Towards Scaling Your API to 10 Million UsersJourney Towards Scaling Your API to 10 Million Users
Journey Towards Scaling Your API to 10 Million Users
 
Automating Big Data Technologies for Faster Time-to-Value
 Automating Big Data Technologies for Faster Time-to-Value Automating Big Data Technologies for Faster Time-to-Value
Automating Big Data Technologies for Faster Time-to-Value
 
Architecting an Open Data Lake for the Enterprise
 Architecting an Open Data Lake for the Enterprise  Architecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the Enterprise
 
STG401_This Is My Architecture
STG401_This Is My ArchitectureSTG401_This Is My Architecture
STG401_This Is My Architecture
 
AWS Database and Analytics State of the Union - 2017 - DAT201 - re:Invent 2017
AWS Database and Analytics State of the Union - 2017 - DAT201 - re:Invent 2017AWS Database and Analytics State of the Union - 2017 - DAT201 - re:Invent 2017
AWS Database and Analytics State of the Union - 2017 - DAT201 - re:Invent 2017
 
ABD311_Deploying Amazon QuickSight For Enterprise
ABD311_Deploying Amazon QuickSight For EnterpriseABD311_Deploying Amazon QuickSight For Enterprise
ABD311_Deploying Amazon QuickSight For Enterprise
 
ARC201_Scaling Up to Your First 10 Million Users
ARC201_Scaling Up to Your First 10 Million UsersARC201_Scaling Up to Your First 10 Million Users
ARC201_Scaling Up to Your First 10 Million Users
 

More from Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

A Look Under the Hood – How Amazon.com Uses AWS Services for Analytics at Massive Scale - ABD329 - re:Invent 2017

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS re:INVENT AB D 329 - A L ook Under the Hood – How Amazon.com Uses AWS Services for Analy tics at Massive S cale J e f f C a r t e r , V P , B i g D a t a T e c h n o l o g i e s , A m a z o n . c o m
  • 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Traditional Data Warehousing Wikipedia: In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis, and is considered a core component of business intelligence.[1] DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data and are used for creating analytical reports for knowledge workers throughout the enterprise. Examples of reports could range from annual and quarterly comparisons and trends to detailed daily sales analysis.
  • 3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The Battle for the Future VS.
  • 4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Big Data Example – The Smart Trashcan
  • 5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. https://www.promptcloud.com https://john-popelaars.blogspot.com https://ww.signiant.com https://www.linkedin.com/pulse/world-today-data-rich-information-poor-guru-p-mohapatra-pmp/
  • 7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The Industry Problem Growth in Data (mostly Unstructured) & Analytics Average Growth in Traditional DW Data Average IT Budget
  • 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is Amazon? 9
  • 10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Our vision is to be earth’s most customer-centric company; to build a place where people can come to find and discover anything they might want to buy online. 10
  • 11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 11
  • 12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 12
  • 13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Data Warehouse
  • 14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The Amazon Enterprise Data Warehouse The Good! Helps to Run the Amazon Business • Most Comprehensive Set of Cleansed and Curated Business Data • Feeds Many Downstream Systems and Processes • Batch Processing, Reporting and Ad Hoc • 500k+ Data Loads/Transformations Each Day • 200k+ Queries/Extracts Each Day • 20k+ Active Tables • 10B++ Rows Loaded Daily Our Data is Big! • Core Data Set: 5+PB of Compressed Data (primarily limited by Legacy Technology) • Total Storage (Multiple Systems): 35+ PB compressed • Quote from Executive at Legacy DW Vendor: • ~1000x Larger than any other DW Customer (from that Vendor) Significant and Increasing Use of Redshift and EMR • 1000’s of Redshift and EMR Systems, Range in size from: • Individual Contributor - Project Based, to • Running Multi-Billion Dollar Business inside Amazon
  • 15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Who are we? • Analytics on the “Marketplace” • Analytics Spokes: Pricing, B2B, Seller Support, Lending … • Business Scale: • 235MM monthly CPU Minutes on Legacy ODW • 2K upstream tables • Users: • Supports 170 teams • 1000 users with 9527 profiles (Parameterized Queries) • 20K unique job runs per month • 2800 (800 TB) datasets • BI Tool Users: • 3000+ Users, 650 non-tech • 600+ ”Dashboards” • 100k’s of queries each month Example of an Amazon DW “Customer” Team
  • 16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. “Swiss Army” by by Jim Pennucci. No alterations other than cropping. https://www.flickr.com/photos/pennuja/5363518281/ Image used with permissions under Creative Commons license 2.0, Attribution Generic License
  • 18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is the Goal? To Provide an analytic ecosystem that Scales with the Amazon Business To Leverage AWS Technologies and to help Improve these technologies for all Amazon Customers To Provide Choice and Options in New Analytic Technologies • Provide an SQL based solution • Increasingly Focus on Enabling new analytic approaches including Machine Learning and Programmatic Data Analysis • Enable both “Bring Your Own Cluster” and “Bring your Own Query” Approaches
  • 19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. “Tools #2” by Juan Pablo Olmo. No alterations other than cropping. https://www.flickr.com/photos/juanpol/1562101472/ Image used with permissions under Creative Commons license 2.0, Attribution Generic License (https://creativecommons.org/licenses/by/2.0/)
  • 20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon EMR (running Hive, Pig, Spark, Presto, etc…) Amazon DynamoDB Amazon Machine Learning Amazon QuickSight Amazon RDS Amazon Elasticsearch Service Amazon Redshift Amazon Athena Amazon SQS Amazon Kinesis Analytics Amazon Kinesis Firehose Amazon S3 Amazon Kinesis Open-source tools (e.g. for ML, data science) Commercial tools
  • 21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Moving Forward - AWS S3 / EDX - Separate Storage from Compute by leveraging a parallel file system as a global data exchange • Redshift - Preferred platform SQL based Analysis and traditional Data Warehouse Data • Focus is “Business Users” • EMR – Scalable “Do Everything” Platform - Enable Teams who have chosen EMR by providing Curated Data • Focus is “Programattic Access” Amazon Redshift
  • 22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The Amazon “Data Lake” – Project Name “Andes” The Goal: ”THE” Place for Data at Amazon • Source teams (Data Producers) put their Public Data there to give access to Analytic teams (Data Consumers) and to share private data within their team • EMR Can Directly Access the Data in Parallel from Andes • Redshift can load the data in Parallel from Andes, or it Can Directly Access the Data in Parallel with Spectrum
  • 23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. “Datamarts” Number of Teams using the DW: ~2300 Number of Tables Used per Team: • Max: 598 • Min 1 • Average: 49 Ad-Hoc (any data any time) can be achieved via EMR can access the Data in Andes Directly Redshift can load data into the Redshift file system, or it can use the Spectrum Feature to directly access the Data in Andes An Architecture that Scales with the Business Amazon Internal Team (132 Tables)
  • 24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Putting The Pieces Together The Analytic Architecture of the Future Source Systems The Data Lake “Andes” Big Data Systems Data Warehouses “Bring Your Own Cluster” and “Bring Your Own Query” Services and Users Postgre SQL instance Amazon Redshift Amazon Redshift Amazon Redshift Amazon Kinesis AWS Glue Amazon QuickSight Amazon Athena Amazon Machine Learning
  • 25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The Battle for the Future The Data Lake becomes the common source for all data: The DW becomes the compute engine for traditional structured data (Redshift) EMR becomes the compute engine for programmatic access, like machine learning and many emerging use cases Both become a form of a Dependent data mart with the data coming from the Data Lake Vs. AND
  • 26. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 26
  • 27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Purchase Contract seller buyer 27
  • 28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Table Subscriptions - The Vision
  • 29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Subscription “Big Data Technologies” Team producer consumer 29
  • 30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 30
  • 31. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Value Chain Image credits: Icons from thenounproject.com: “Collect” icon by Ramesh; “Cloud Security” icon by Creative Stall; “Search” icon by Dinosoft Labs; COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER
  • 32. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Producers only need to integrate their datasets once with the data lake • Simplified onboarding process • One-time integration Ingest from various source systems: • Relational databases – e.g., Amazon Aurora/RDS Postgres • Non-relational databases – e.g., Amazon DynamoDB • Streams – e.g., Amazon Kinesis • Flat files –e.g., files in Amazon S3 COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER Data Value Chain
  • 33. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Secure and scalable data lake: • Highly durable S3-based storage • Scalable since it’s built on AWS technologies • Permissions are strictly enforced Data quality: • Certified with data quality checks • Schemas are validated COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER Data Value Chain
  • 34. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Company-wide data search index • Consumers can quickly find what they’re looking for • Useful information about the datasets are shown Clear communication: • Producers can communicate expectations around data quality and SLAs • Consumers can contact producers COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER Data Value Chain
  • 35. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Easy process to subscribe to data: • Find a dataset of interest • Click “Subscribe” • Choose the destination compute platform Rapidly populate data marts, for example: • Use AWS CloudFormation to provision Redshift cluster • Use subscriptions to load datasets to the cluster COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER Data Value Chain
  • 36. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Subscriptions mechanism: • Makes data available to the compute platform where it can be analyzed • Keep the compute platform in-sync with any data updates • Users can monitor the sync status of their subscriptions Synchronizations can be either: • Full data copy • Metadata-only sync COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER Data Value Chain
  • 37. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Teams can use the right tools for the jobs, e.g.: • Amazon Redshift for interactive analytics or batch scheduled jobs • Amazon EMR for machine learning and data science • QuickSight for Business analytics and visualizations Compute resources can be scaled independently of the data lake in order to: • Process more/bigger/faster jobs • Optimize costs • Meet business SLAs • Scale to meet high peak workloads COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER Data Value Chain
  • 38. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.Image credits: Icons from thenounproject.com: “Collect” icon by Ramesh; “Cloud Security” icon by Creative Stall; “Search” icon by Dinosoft Labs; COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER Data Value Chain
  • 39. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is the Goal? To Provide an analytic ecosystem that Scales with the Amazon Business To Leverage AWS Technologies and to help Improve these technologies for all Amazon Customers To Provide Choice and Options in New Analytic Technologies • Provide an SQL based solution • Increasingly Focus on Enabling new analytic approaches including Machine Learning and Programmatic Data Analysis • Enable both “Bring Your Own Cluster” and “Bring your Own Query” Approaches
  • 40. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Andes – Current State • We have the data! • 20k+ Tables maintained in Andes – All Active Tables have been Sourced from the Enterprise Data Warehouse • Many teams are adding new data sets! • Have Onboarded 900+ Redshift and EMR systems to Subscriptions • 20,000+ tables being synchronized • Usage off the Legacy DW • Three years (2014-2016) to grow from 0 to 100k Jobs each Day • In 2017, has grown from 100k to 300k Jobs each Day Amazon.com Big Data Technologies
  • 41. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data producers (Amazon teams that want to share data with other teams) "Big Data Marketplace"
  • 42. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. THANK YOU!