SlideShare una empresa de Scribd logo
1 de 40
Descargar para leer sin conexión
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Randy Ridgley
Solutions Architect, Amazon Web Services
BDA305
Build Data Lakes & Analytics on AWS:
Patterns & Best Practices
VisualizationVariability
Big data is defined many different ways
Volume Velocity Variety Veracity Value
Data is changing → Analytics are adopting
Capture and store
new data at PB-EB scale
Do new type of analytics in
a cost-effective way
• Machine learning
• Big data processing
• Real-time analytics
• Full text search
New types of
analytics
AWS data lake helps address this
Quickly ingest and store any
type of data
Insights and security,
together …
Run the right tool for the right
job without manually copying
data around
Data lakes from AWS
Analytics
Machine
learning
Real-time dataTraditional
Data lake
on AWS
movementdata movement
Ingestion
Intelligence
Storage
catalog
Variety of
ingestion tools
Decoupled
analytics from
storage / catalog
Managed ML service
Deep learning AMIs
Video and image recognition
Conversational interfaces
Deep-learning video camera
Natural language processing
Language translation
Speech recognition
Text-to-speech
Interactive analysis
Hadoop & Spark
Data warehousing
Full-text search
Real-time analytics
Dashboards & visualizations
Dedicated network connection
Secure appliances
Ruggedized shipping container
Database migration
Connect devices to AWS
Real-time data streams
Real-time video streams
Data lake
on AWS
Storage | Archival Storage | Data Catalog
AnalyticsMachine learning
Real-time dataOn-premises movementdata movement
Data lakes, analytics, and IoT portfolio from AWS
Broadest, deepest set of analytic services
Data lakes, analytics, and IoT portfolio from AWS
Broadest, deepest set of analytic services
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data lake
on AWS
Storage | Archival Storage | Data Catalog
AnalyticsMachine learning
Real-time dataOn-premises movementdata movement
Many ways to transfer data into the data lake
Open and comprehensive
• Data movement from on-premises data centers
• Dedicated network connection
• Secure appliances
• Ruggedized shipping container
• Database migration
• Gateway that lets applications write to the cloud
• Data movement from real-time sources
• Connect devices to AWS
• Real-time data streams
• Real-time video streams
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS Storage Gateway
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data movement from
real-time sources
Data movement from your
data centers
Amazon S3
Amazon Glacier
AWS Glue
Amazon
Kinesis Data
Firehose
Real-time data movement and data lakes on AWS
AWS Glue
Data Catalog
Amazon
S3 data
Data lake
on AWS
Amazon
Kinesis Data
Streams
Data definitionAmazon Kinesis Agent
Apache Kafka
AWS SDK
LOG4J
Flume
Fluentd
AWS Mobile SDK
Kinesis Producer Library
Amazon S3
Amazon Glacier
AWS Glue
Ingest data in its raw form …
Open and comprehensive
CSV
ORC
Grok
Avro
Parquet
JSON
Store the data in its raw form:
BEFORE
Transforming
Analyzing
Manipulating
Doing … anything … to it
This becomes your source of truth you can always go back to
…
Lifecycle policies allow you to shift it to warm and cold
storage
Zillow uses AWS to build personalized website
• Ingests data public data
(property records, recent sales)
into Amazon Kinesis
• Spark on Amazon EMR performs
ML to provide real-time home
estimates
• Store data into Amazon S3
data lake
• Use data to do personalization,
advertising optimization,
and recommendations
for website
Zestimate home valuation
estimates
Personalization, advertising
optimization, and recommendations
for website
Public property records, home
tax assessments,
sales transactions, images,
video, MLS-listing data, and
user-provided data
Amazon
Kinesis
Amazon
S3
Spark
on Amazon
EMR
Public
data
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Demo:
Ingesting data into your data lake
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Gartner:
“Through 2018, 80% of data lakes will not include effective
metadata management capabilities, making them inefficient."
What data do I have?
Data lake
on AWS
Storage | Archival Storage | Data Catalog
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Catalog
ETL job
authoring
Discover data and
extract schema
Auto-generates
customizable ETL code
in Python and Spark
• Automatically discovers data and stores schema
• Catalog makes data searchable, and available for ETL
• Catalog contains table and job definitions
• Computes statistics to make queries efficient
AWS Glue – Data Catalog
M a ke d a ta d i s c o v e ra b l e
IAM Role
AWS Glue Crawler Databases
Amazon
Redshift
Amazon S3
JDBC connection
Object connection
Built-in classifiers
MySQL
MariaDB
PostgreSQL
Amazon Aurora
Oracle
Amazon Redshift
Avro
Parquet
ORC
XML
JSON & JSONPath
AWS CloudTrail
BSON
Logs
(Apache (Grok), Linux(Grok), MS(Grok), Ruby, Redis,
and many others)
Delimited
(comma, pipe, tab, semicolon)
< ALWAYS GROWING …>
What can crawlers discover?
Create additional custom
classifiers
Amazon
DynamoDB
NoSQL connection
But I have my own data formats …?
− There is a custom classifier for that …
Row-based
GROK classifier
A grok pattern is a
named set of regular
expressions (regex)
that are used to match
data one line at a time.
XML
XML classifier
XML tag that defines a
table row in the XML
document.
JSON
JSON classifier
JSON path to the
object, array, or value
that defines a row of
the table being
created. Type the
name in either dot or
bracket JSON syntax
using operators
supported by AWS
Glue.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Raw data stored in data lake:
Preparation:
No rmalized
Partitio ned
Co mpressed
S to rage o ptimized
Extract – Transform – Load
Preparing raw data for consumption
Data lake
on AWS
Raw
ingestion
Curated
datasets
Data Catalog
ETL
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Demo:
Let’s discover data in our data lake
Other ways of populating the catalog
Call the AWS Glue CreateTable API
Create table manually Run Hive DDL statement
Apache Hive
Metastore
AWS GLUE ETL AWS GLUE
DATA CATALOG
Import from Apache Hive Metastore
How do I drive value?
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data lake
on AWS
Storage | Archival Storage | Data Catalog
AnalyticsMachine learning
Real-time dataOn-premises movementdata movement
Different tools for different users … solving different problems
Business
reporting
Data scientists
Data engineer
IDE
Data
Catalog
Central
storage
Amazon
SageMaker
Machine Learning/Deep Learning
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Athena – interactive analysis
Interactive query service to analyze data in Amazon S3 using standard SQL
No infrastructure to set up or manage and no data to load
Ability to run SQL queries on data archived in Amazon Glacier (coming soon)
$ SQL
Query instantly
Zero setup cost; just
point to Amazon S3
and start querying
Pay per query
Pay only for queries run;
save 30%–90% on per-
query costs through
compression
Open
ANSI SQL interface,
JDBC/ODBC drivers, multiple
formats, compression types,
and complex joins and data
types
Easy
Serverless: Zero
infrastructure, zero
administration
Integrated with Amazon
QuickSight
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon EMR – big data processing
Analytics and ML at scale
19 open-source projects: Apache Hadoop, Spark, HBase, Presto, and more
Enterprise-grade security
$
Latest versions
Updated with the latest
open source frameworks
within 30 days of release
Low cost
Flexible billing with per-
second billing, Amazon
EC2 Spot, Reserved
Instances, and Auto
Scaling to reduce costs
50%-80%
Use Amazon S3 storage
Process data directly in
the Amazon S3 data lake
securely with high
performance using the
EMRFS connector
Easy
Launch fully managed
Hadoop & Spark in minutes;
no cluster setup, node
provisioning, cluster tuning
Data Lake
100110000100101011100
1010101110010101000
00111100101100101
010001100001
Hadoop / Spark Analytics on AWS
YARN (Hadoop ResourceManager)
NoSQLMachine
learning
Real-timeInteractiveScriptBatch
Data lake
on AWS
Amazon S3
Amazon EMR
Managed Hadoop / Spark
Object storage
Data warehouse …
Amazon Redshift data warehouse
Relational data
Gigabytes to petabytes scale
Reporting and analysis
Schema defined prior to data load
AWS
Glue ETL
On premises
Amazon QuickSight
Existing or new
BI tool
Redshift
COPY
Data lake
on AWS
Complementary to EDW (not replacement) Data lake can be source for EDW
Schema on read (no predefined schemas) Schema on write (predefined schemas)
Structured/semi-structured / unstructured data Structured data only
Fast ingestion of new data/content Time consuming to introduce new content
Data Science + Prediction / Advanced Analytics + BI
use cases
BI use cases
Data at low level of detail / granularity Data at summary / aggregated level of detail
Loosely defined SLAs Tight SLAs (production schedules)
Flexibility in tools (open source/tools for advanced
analytics)
Limited flexibility in tools (SQL only)
Elastic storage and compute capacity – decoupled
Explicitly sized environments, compute and storage
scaled in linearly
A data lake is not an enterprise data warehouse (EDW)
Data lake EDW
Amazon Redshift Spectrum
Extend the data warehouse to exabytes of data in Amazon S3 data lake
Amazon S3
data lake
Amazon
Redshift data
Redshift Spectrum
query engine • Exabyte Amazon Redshift SQL queries against Amazon S3
• Join data across Amazon Redshift and Amazon S3
• Scale compute and storage separately
• Stable query performance and unlimited concurrency
• CSV, ORC, Grok, Avro, & Parquet data formats
• Pay only for the amount of data scanned
Amazon Redshift
Spectrum
Q u er y you r d ata lake
Amazon
Redshift
JDBC / ODBC
...
1 2 3 4 N
Redshift Spectrum
Scale-out serverless compute
AWS Glue Data Catalog
COPY
commands
Hot data
Query directly
on data lake
Data lakes extend the traditional approach
Data warehouse
Business intelligence
OLTP ERP CRM LOB
• Relational and nonrelational data
• TBs–EBs scale
• Diverse analytical engines
• Low-cost storage & analytics
Devices Web Sensors Social
Data lake
Big data processing,
real-time, machine learning
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Demo:
Query the data lake with multiple
analytics engines
Machine learning on your data lake
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data lake
on AWS
Storage | Archival Storage | Data Catalog
AnalyticsMachine learning
Real-time dataOn-premises movementdata movement
Amazon Polly
Amazon
Transcribe
Amazon Rekognition
Amazon Rekognition
Video
Amazon Translate
Amazon
Comprehend
Amazon Lex
VISION SPEECH LANGUAGE CHATBOT
SERVICES
GPUINFRASTRUCTURE CPU
IoT
(AWS Greengrass)
Mobile FPGAServerless
MXNET
FRAMEWORKS
TensorFlow
Caffe2
& Caffe
Gluon KerasCNTKPyTorch
DEEP LEARNING AMI
Amazon SageMakerPLATFORMS Amazon ML Spark &
Amazon EMR
Amazon
Mechanical Turk
AWS DeepLens
ML in the hands of every developer
Data Visualization &
Analysis
Business Problem –
ML problem framing Data Collection
Data Integration
Data Preparation &
Cleaning
Feature Engineering
Model Training &
Parameter Tuning
Model Evaluation
Are Business
Goals met?
Model Deployment
Monitoring &
Debugging
– Predictions
YesNo
DataAugmentation
Feature
Augmentation
Integration: The Data Architecture
Retraining
Build the data platform:
• Amazon S3
• AWS Glue
• Amazon Athena
• Amazon EMR
• Amazon Redshift Spectrum
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon SageMaker
1 2 3 4
I I I I
Notebook Instances Algorithms ML Training Service ML Hosting Service
Data Visualization &
Analysis
Business Problem –
ML problem framing Data Collection
Data Integration
Data Preparation &
Cleaning
Feature Engineering
Model Training &
Parameter Tuning
Model Evaluation
Are Business
Goals met?
Model Deployment
Monitoring &
Debugging
– Predictions
YesNo
DataAugmentation
Feature
Augmentation
Integration: The Data Architecture
Retraining
• Setup and manage Model
Inference Clusters
• Manage and Scale Model
Inference APIs
• Monitor and Debug Model
Predictions
• Models versioning and
performance tracking
• Automate New Model
version promotion to
production (A/B testing)
DigitalGlobe – Amazon SageMaker
By using Amazon SageMaker,
DigitalGlobe’s cache rate
improved by more than a factor
of two, often being around 83%
and sometimes trending to 90%
cache hit. This allowed them to
also cut their cloud storage cost
in half by better utilizing their
Amazon S3-optimized cache
and retrieving less from their
100+ PB archive.
Agility and innovation are key
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data lake
on AWS
Storage | Archival Storage | Data Catalog
AnalyticsMachine learning
Real-time dataOn-premises movementdata movement
Core tenants
• Data lakes and data warehouses complement each other
• Loose coupling, but highly performant
• Storage, analytics, metadata management, etc.
• Future-proof your analytics
• Choose the best tool for the job
• Elasticity and multiple clusters for dedicated purposes
• Replace capacity planning with a consumption model
• Don’t forget metadata management
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Submit session feedback
1. Tap the Schedule icon.
2. Select the session you
attended.
3. Tap Session Evaluation to
submit your feedback.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thank you!

Más contenido relacionado

La actualidad más candente

Visualization with Amazon QuickSight
Visualization with Amazon QuickSightVisualization with Amazon QuickSight
Visualization with Amazon QuickSightAmazon Web Services
 
How Amazon.com uses AWS Analytics
How Amazon.com uses AWS AnalyticsHow Amazon.com uses AWS Analytics
How Amazon.com uses AWS AnalyticsAmazon Web Services
 
Building a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarBuilding a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarAmazon Web Services
 
How Amazon.com Uses AWS Analytics
How Amazon.com Uses AWS AnalyticsHow Amazon.com Uses AWS Analytics
How Amazon.com Uses AWS AnalyticsAmazon Web Services
 
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...Amazon Web Services
 
Building Data Lakes in the AWS Cloud
Building Data Lakes in the AWS CloudBuilding Data Lakes in the AWS Cloud
Building Data Lakes in the AWS CloudAmazon Web Services
 
Building Your Data Lake on AWS - Level 200
Building Your Data Lake on AWS - Level 200Building Your Data Lake on AWS - Level 200
Building Your Data Lake on AWS - Level 200Amazon Web Services
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesBuild Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesAmazon Web Services
 
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...Amazon Web Services
 
AWS November Webinar Series - Architectural Patterns & Best Practices for Big...
AWS November Webinar Series - Architectural Patterns & Best Practices for Big...AWS November Webinar Series - Architectural Patterns & Best Practices for Big...
AWS November Webinar Series - Architectural Patterns & Best Practices for Big...Amazon Web Services
 

La actualidad más candente (20)

Visualization with Amazon QuickSight
Visualization with Amazon QuickSightVisualization with Amazon QuickSight
Visualization with Amazon QuickSight
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
How Amazon.com uses AWS Analytics
How Amazon.com uses AWS AnalyticsHow Amazon.com uses AWS Analytics
How Amazon.com uses AWS Analytics
 
Building a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarBuilding a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - Webinar
 
Preparing Data for the Lake
Preparing Data for the LakePreparing Data for the Lake
Preparing Data for the Lake
 
How Amazon.com Uses AWS Analytics
How Amazon.com Uses AWS AnalyticsHow Amazon.com Uses AWS Analytics
How Amazon.com Uses AWS Analytics
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
 
AWS & Database Analytics
AWS & Database AnalyticsAWS & Database Analytics
AWS & Database Analytics
 
Analyzing Streams
Analyzing StreamsAnalyzing Streams
Analyzing Streams
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Building Data Lakes in the AWS Cloud
Building Data Lakes in the AWS CloudBuilding Data Lakes in the AWS Cloud
Building Data Lakes in the AWS Cloud
 
Building Your Data Lake on AWS - Level 200
Building Your Data Lake on AWS - Level 200Building Your Data Lake on AWS - Level 200
Building Your Data Lake on AWS - Level 200
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesBuild Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
 
Log Analytics with AWS
Log Analytics with AWSLog Analytics with AWS
Log Analytics with AWS
 
How Amazon uses AWS Analytics
How Amazon uses AWS AnalyticsHow Amazon uses AWS Analytics
How Amazon uses AWS Analytics
 
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
 
AWS November Webinar Series - Architectural Patterns & Best Practices for Big...
AWS November Webinar Series - Architectural Patterns & Best Practices for Big...AWS November Webinar Series - Architectural Patterns & Best Practices for Big...
AWS November Webinar Series - Architectural Patterns & Best Practices for Big...
 

Similar a Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Anaheim AWS Summit

Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Amazon Web Services
 
BDA305 Building Data Lakes and Analytics on AWS
BDA305 Building Data Lakes and Analytics on AWSBDA305 Building Data Lakes and Analytics on AWS
BDA305 Building Data Lakes and Analytics on AWSAmazon Web Services
 
Building a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudBuilding a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudAmazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSAmazon Web Services
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAmazon Web Services
 
Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS Amazon Web Services
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best PracticesBuild Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best PracticesAmazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 
Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfAmazon Web Services
 
AWS Summit Singapore - Architecting a Serverless Data Lake on AWS
AWS Summit Singapore - Architecting a Serverless Data Lake on AWSAWS Summit Singapore - Architecting a Serverless Data Lake on AWS
AWS Summit Singapore - Architecting a Serverless Data Lake on AWSAmazon Web Services
 
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...AWS Riyadh User Group
 
Big Data Meets AI - Driving Insights and Adding Intelligence to Your Solutions
 Big Data Meets AI - Driving Insights and Adding Intelligence to Your Solutions Big Data Meets AI - Driving Insights and Adding Intelligence to Your Solutions
Big Data Meets AI - Driving Insights and Adding Intelligence to Your SolutionsAmazon Web Services
 
Serverless Big Data Architectures: Serverless Data Analytics
Serverless Big Data Architectures: Serverless Data AnalyticsServerless Big Data Architectures: Serverless Data Analytics
Serverless Big Data Architectures: Serverless Data AnalyticsKristana Kane
 
Builders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LCBuilders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LCAmazon Web Services LATAM
 

Similar a Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Anaheim AWS Summit (20)

Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
 
BDA305 Building Data Lakes and Analytics on AWS
BDA305 Building Data Lakes and Analytics on AWSBDA305 Building Data Lakes and Analytics on AWS
BDA305 Building Data Lakes and Analytics on AWS
 
Implementing a Data Lake
Implementing a Data LakeImplementing a Data Lake
Implementing a Data Lake
 
Building a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudBuilding a Modern Data Platform in the Cloud
Building a Modern Data Platform in the Cloud
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Construindo data lakes e analytics com AWS
Construindo data lakes e analytics com AWSConstruindo data lakes e analytics com AWS
Construindo data lakes e analytics com AWS
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWS
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scale
 
Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best PracticesBuild Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best Practices
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdf
 
AWS Summit Singapore - Architecting a Serverless Data Lake on AWS
AWS Summit Singapore - Architecting a Serverless Data Lake on AWSAWS Summit Singapore - Architecting a Serverless Data Lake on AWS
AWS Summit Singapore - Architecting a Serverless Data Lake on AWS
 
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
 
Big Data Meets AI - Driving Insights and Adding Intelligence to Your Solutions
 Big Data Meets AI - Driving Insights and Adding Intelligence to Your Solutions Big Data Meets AI - Driving Insights and Adding Intelligence to Your Solutions
Big Data Meets AI - Driving Insights and Adding Intelligence to Your Solutions
 
Serverless Big Data Architectures: Serverless Data Analytics
Serverless Big Data Architectures: Serverless Data AnalyticsServerless Big Data Architectures: Serverless Data Analytics
Serverless Big Data Architectures: Serverless Data Analytics
 
Builders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LCBuilders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LC
 

Más de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Más de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Anaheim AWS Summit

  • 1. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Randy Ridgley Solutions Architect, Amazon Web Services BDA305 Build Data Lakes & Analytics on AWS: Patterns & Best Practices
  • 2. VisualizationVariability Big data is defined many different ways Volume Velocity Variety Veracity Value
  • 3. Data is changing → Analytics are adopting Capture and store new data at PB-EB scale Do new type of analytics in a cost-effective way • Machine learning • Big data processing • Real-time analytics • Full text search New types of analytics
  • 4. AWS data lake helps address this Quickly ingest and store any type of data Insights and security, together … Run the right tool for the right job without manually copying data around
  • 5. Data lakes from AWS Analytics Machine learning Real-time dataTraditional Data lake on AWS movementdata movement Ingestion Intelligence Storage catalog Variety of ingestion tools Decoupled analytics from storage / catalog
  • 6. Managed ML service Deep learning AMIs Video and image recognition Conversational interfaces Deep-learning video camera Natural language processing Language translation Speech recognition Text-to-speech Interactive analysis Hadoop & Spark Data warehousing Full-text search Real-time analytics Dashboards & visualizations Dedicated network connection Secure appliances Ruggedized shipping container Database migration Connect devices to AWS Real-time data streams Real-time video streams Data lake on AWS Storage | Archival Storage | Data Catalog AnalyticsMachine learning Real-time dataOn-premises movementdata movement Data lakes, analytics, and IoT portfolio from AWS Broadest, deepest set of analytic services
  • 7. Data lakes, analytics, and IoT portfolio from AWS Broadest, deepest set of analytic services Amazon SageMaker AWS Deep Learning AMIs Amazon Rekognition Amazon Lex AWS DeepLens Amazon Comprehend Amazon Translate Amazon Transcribe Amazon Polly Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch Service Amazon Kinesis Amazon QuickSight AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Data lake on AWS Storage | Archival Storage | Data Catalog AnalyticsMachine learning Real-time dataOn-premises movementdata movement
  • 8. Many ways to transfer data into the data lake Open and comprehensive • Data movement from on-premises data centers • Dedicated network connection • Secure appliances • Ruggedized shipping container • Database migration • Gateway that lets applications write to the cloud • Data movement from real-time sources • Connect devices to AWS • Real-time data streams • Real-time video streams AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS Storage Gateway AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Data movement from real-time sources Data movement from your data centers Amazon S3 Amazon Glacier AWS Glue
  • 9. Amazon Kinesis Data Firehose Real-time data movement and data lakes on AWS AWS Glue Data Catalog Amazon S3 data Data lake on AWS Amazon Kinesis Data Streams Data definitionAmazon Kinesis Agent Apache Kafka AWS SDK LOG4J Flume Fluentd AWS Mobile SDK Kinesis Producer Library
  • 10. Amazon S3 Amazon Glacier AWS Glue Ingest data in its raw form … Open and comprehensive CSV ORC Grok Avro Parquet JSON Store the data in its raw form: BEFORE Transforming Analyzing Manipulating Doing … anything … to it This becomes your source of truth you can always go back to … Lifecycle policies allow you to shift it to warm and cold storage
  • 11. Zillow uses AWS to build personalized website • Ingests data public data (property records, recent sales) into Amazon Kinesis • Spark on Amazon EMR performs ML to provide real-time home estimates • Store data into Amazon S3 data lake • Use data to do personalization, advertising optimization, and recommendations for website Zestimate home valuation estimates Personalization, advertising optimization, and recommendations for website Public property records, home tax assessments, sales transactions, images, video, MLS-listing data, and user-provided data Amazon Kinesis Amazon S3 Spark on Amazon EMR Public data
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Demo: Ingesting data into your data lake
  • 13. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Gartner: “Through 2018, 80% of data lakes will not include effective metadata management capabilities, making them inefficient." What data do I have? Data lake on AWS Storage | Archival Storage | Data Catalog
  • 14. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Catalog ETL job authoring Discover data and extract schema Auto-generates customizable ETL code in Python and Spark • Automatically discovers data and stores schema • Catalog makes data searchable, and available for ETL • Catalog contains table and job definitions • Computes statistics to make queries efficient AWS Glue – Data Catalog M a ke d a ta d i s c o v e ra b l e
  • 15. IAM Role AWS Glue Crawler Databases Amazon Redshift Amazon S3 JDBC connection Object connection Built-in classifiers MySQL MariaDB PostgreSQL Amazon Aurora Oracle Amazon Redshift Avro Parquet ORC XML JSON & JSONPath AWS CloudTrail BSON Logs (Apache (Grok), Linux(Grok), MS(Grok), Ruby, Redis, and many others) Delimited (comma, pipe, tab, semicolon) < ALWAYS GROWING …> What can crawlers discover? Create additional custom classifiers Amazon DynamoDB NoSQL connection
  • 16. But I have my own data formats …? − There is a custom classifier for that … Row-based GROK classifier A grok pattern is a named set of regular expressions (regex) that are used to match data one line at a time. XML XML classifier XML tag that defines a table row in the XML document. JSON JSON classifier JSON path to the object, array, or value that defines a row of the table being created. Type the name in either dot or bracket JSON syntax using operators supported by AWS Glue.
  • 17. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Raw data stored in data lake: Preparation: No rmalized Partitio ned Co mpressed S to rage o ptimized Extract – Transform – Load Preparing raw data for consumption Data lake on AWS Raw ingestion Curated datasets Data Catalog ETL
  • 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Demo: Let’s discover data in our data lake
  • 19. Other ways of populating the catalog Call the AWS Glue CreateTable API Create table manually Run Hive DDL statement Apache Hive Metastore AWS GLUE ETL AWS GLUE DATA CATALOG Import from Apache Hive Metastore
  • 20. How do I drive value? Amazon SageMaker AWS Deep Learning AMIs Amazon Rekognition Amazon Lex AWS DeepLens Amazon Comprehend Amazon Translate Amazon Transcribe Amazon Polly Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch Service Amazon Kinesis Amazon QuickSight AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Data lake on AWS Storage | Archival Storage | Data Catalog AnalyticsMachine learning Real-time dataOn-premises movementdata movement
  • 21. Different tools for different users … solving different problems Business reporting Data scientists Data engineer IDE Data Catalog Central storage Amazon SageMaker Machine Learning/Deep Learning
  • 22. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Athena – interactive analysis Interactive query service to analyze data in Amazon S3 using standard SQL No infrastructure to set up or manage and no data to load Ability to run SQL queries on data archived in Amazon Glacier (coming soon) $ SQL Query instantly Zero setup cost; just point to Amazon S3 and start querying Pay per query Pay only for queries run; save 30%–90% on per- query costs through compression Open ANSI SQL interface, JDBC/ODBC drivers, multiple formats, compression types, and complex joins and data types Easy Serverless: Zero infrastructure, zero administration Integrated with Amazon QuickSight
  • 23. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon EMR – big data processing Analytics and ML at scale 19 open-source projects: Apache Hadoop, Spark, HBase, Presto, and more Enterprise-grade security $ Latest versions Updated with the latest open source frameworks within 30 days of release Low cost Flexible billing with per- second billing, Amazon EC2 Spot, Reserved Instances, and Auto Scaling to reduce costs 50%-80% Use Amazon S3 storage Process data directly in the Amazon S3 data lake securely with high performance using the EMRFS connector Easy Launch fully managed Hadoop & Spark in minutes; no cluster setup, node provisioning, cluster tuning Data Lake 100110000100101011100 1010101110010101000 00111100101100101 010001100001
  • 24. Hadoop / Spark Analytics on AWS YARN (Hadoop ResourceManager) NoSQLMachine learning Real-timeInteractiveScriptBatch Data lake on AWS Amazon S3 Amazon EMR Managed Hadoop / Spark Object storage
  • 25. Data warehouse … Amazon Redshift data warehouse Relational data Gigabytes to petabytes scale Reporting and analysis Schema defined prior to data load AWS Glue ETL On premises Amazon QuickSight Existing or new BI tool Redshift COPY Data lake on AWS
  • 26. Complementary to EDW (not replacement) Data lake can be source for EDW Schema on read (no predefined schemas) Schema on write (predefined schemas) Structured/semi-structured / unstructured data Structured data only Fast ingestion of new data/content Time consuming to introduce new content Data Science + Prediction / Advanced Analytics + BI use cases BI use cases Data at low level of detail / granularity Data at summary / aggregated level of detail Loosely defined SLAs Tight SLAs (production schedules) Flexibility in tools (open source/tools for advanced analytics) Limited flexibility in tools (SQL only) Elastic storage and compute capacity – decoupled Explicitly sized environments, compute and storage scaled in linearly A data lake is not an enterprise data warehouse (EDW) Data lake EDW
  • 27. Amazon Redshift Spectrum Extend the data warehouse to exabytes of data in Amazon S3 data lake Amazon S3 data lake Amazon Redshift data Redshift Spectrum query engine • Exabyte Amazon Redshift SQL queries against Amazon S3 • Join data across Amazon Redshift and Amazon S3 • Scale compute and storage separately • Stable query performance and unlimited concurrency • CSV, ORC, Grok, Avro, & Parquet data formats • Pay only for the amount of data scanned
  • 28. Amazon Redshift Spectrum Q u er y you r d ata lake Amazon Redshift JDBC / ODBC ... 1 2 3 4 N Redshift Spectrum Scale-out serverless compute AWS Glue Data Catalog COPY commands Hot data Query directly on data lake
  • 29. Data lakes extend the traditional approach Data warehouse Business intelligence OLTP ERP CRM LOB • Relational and nonrelational data • TBs–EBs scale • Diverse analytical engines • Low-cost storage & analytics Devices Web Sensors Social Data lake Big data processing, real-time, machine learning
  • 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Demo: Query the data lake with multiple analytics engines
  • 31. Machine learning on your data lake Amazon SageMaker AWS Deep Learning AMIs Amazon Rekognition Amazon Lex AWS DeepLens Amazon Comprehend Amazon Translate Amazon Transcribe Amazon Polly Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch Service Amazon Kinesis Amazon QuickSight AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Data lake on AWS Storage | Archival Storage | Data Catalog AnalyticsMachine learning Real-time dataOn-premises movementdata movement
  • 32. Amazon Polly Amazon Transcribe Amazon Rekognition Amazon Rekognition Video Amazon Translate Amazon Comprehend Amazon Lex VISION SPEECH LANGUAGE CHATBOT SERVICES GPUINFRASTRUCTURE CPU IoT (AWS Greengrass) Mobile FPGAServerless MXNET FRAMEWORKS TensorFlow Caffe2 & Caffe Gluon KerasCNTKPyTorch DEEP LEARNING AMI Amazon SageMakerPLATFORMS Amazon ML Spark & Amazon EMR Amazon Mechanical Turk AWS DeepLens ML in the hands of every developer
  • 33. Data Visualization & Analysis Business Problem – ML problem framing Data Collection Data Integration Data Preparation & Cleaning Feature Engineering Model Training & Parameter Tuning Model Evaluation Are Business Goals met? Model Deployment Monitoring & Debugging – Predictions YesNo DataAugmentation Feature Augmentation Integration: The Data Architecture Retraining Build the data platform: • Amazon S3 • AWS Glue • Amazon Athena • Amazon EMR • Amazon Redshift Spectrum
  • 34. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon SageMaker 1 2 3 4 I I I I Notebook Instances Algorithms ML Training Service ML Hosting Service
  • 35. Data Visualization & Analysis Business Problem – ML problem framing Data Collection Data Integration Data Preparation & Cleaning Feature Engineering Model Training & Parameter Tuning Model Evaluation Are Business Goals met? Model Deployment Monitoring & Debugging – Predictions YesNo DataAugmentation Feature Augmentation Integration: The Data Architecture Retraining • Setup and manage Model Inference Clusters • Manage and Scale Model Inference APIs • Monitor and Debug Model Predictions • Models versioning and performance tracking • Automate New Model version promotion to production (A/B testing)
  • 36. DigitalGlobe – Amazon SageMaker By using Amazon SageMaker, DigitalGlobe’s cache rate improved by more than a factor of two, often being around 83% and sometimes trending to 90% cache hit. This allowed them to also cut their cloud storage cost in half by better utilizing their Amazon S3-optimized cache and retrieving less from their 100+ PB archive.
  • 37. Agility and innovation are key Amazon SageMaker AWS Deep Learning AMIs Amazon Rekognition Amazon Lex AWS DeepLens Amazon Comprehend Amazon Translate Amazon Transcribe Amazon Polly Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch Service Amazon Kinesis Amazon QuickSight AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Data lake on AWS Storage | Archival Storage | Data Catalog AnalyticsMachine learning Real-time dataOn-premises movementdata movement
  • 38. Core tenants • Data lakes and data warehouses complement each other • Loose coupling, but highly performant • Storage, analytics, metadata management, etc. • Future-proof your analytics • Choose the best tool for the job • Elasticity and multiple clusters for dedicated purposes • Replace capacity planning with a consumption model • Don’t forget metadata management
  • 39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Submit session feedback 1. Tap the Schedule icon. 2. Select the session you attended. 3. Tap Session Evaluation to submit your feedback.
  • 40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thank you!