Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Anaheim AWS Summit

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Randy Ridgley
Solutions Architect, Amazon Web Services
BDA305
Build Data Lakes & Analytics on AWS:
Patterns & Best Practices

VisualizationVariability
Big data is defined many different ways
Volume Velocity Variety Veracity Value

Data is changing → Analytics are adopting
Capture and store
new data at PB-EB scale
Do new type of analytics in
a cost-effective way
• Machine learning
• Big data processing
• Real-time analytics
• Full text search
New types of
analytics

AWS data lake helps address this
Quickly ingest and store any
type of data
Insights and security,
together …
Run the right tool for the right
job without manually copying
data around

Data lakes from AWS
Analytics
Machine
learning
Real-time dataTraditional
Data lake
on AWS
movementdata movement
Ingestion
Intelligence
Storage
catalog
Variety of
ingestion tools
Decoupled
analytics from
storage / catalog

Managed ML service
Deep learning AMIs
Video and image recognition
Conversational interfaces
Deep-learning video camera
Natural language processing
Language translation
Speech recognition
Text-to-speech
Interactive analysis
Hadoop & Spark
Data warehousing
Full-text search
Real-time analytics
Dashboards & visualizations
Dedicated network connection
Secure appliances
Ruggedized shipping container
Database migration
Connect devices to AWS
Real-time data streams
Real-time video streams
Data lake
on AWS
Storage | Archival Storage | Data Catalog
AnalyticsMachine learning
Real-time dataOn-premises movementdata movement
Data lakes, analytics, and IoT portfolio from AWS
Broadest, deepest set of analytic services

Data lakes, analytics, and IoT portfolio from AWS
Broadest, deepest set of analytic services
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data lake
on AWS

Many ways to transfer data into the data lake
Open and comprehensive
• Data movement from on-premises data centers
• Dedicated network connection
• Secure appliances
• Ruggedized shipping container
• Database migration
• Gateway that lets applications write to the cloud
• Data movement from real-time sources
• Connect devices to AWS
• Real-time data streams
• Real-time video streams
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Storage Gateway
AWS IoT Core
Data movement from
real-time sources
Data movement from your
data centers
Amazon S3
Amazon Glacier
AWS Glue

Amazon
Kinesis Data
Firehose
Real-time data movement and data lakes on AWS
AWS Glue
Data Catalog
Amazon
S3 data
Data lake
on AWS
Amazon
Kinesis Data
Streams
Data definitionAmazon Kinesis Agent
Apache Kafka
AWS SDK
LOG4J
Flume
Fluentd
AWS Mobile SDK
Kinesis Producer Library

Amazon S3
Amazon Glacier
AWS Glue
Ingest data in its raw form …
Open and comprehensive
CSV
ORC
Grok
Avro
Parquet
JSON
Store the data in its raw form:
BEFORE
Transforming
Analyzing
Manipulating
Doing … anything … to it
This becomes your source of truth you can always go back to
…
Lifecycle policies allow you to shift it to warm and cold
storage

Zillow uses AWS to build personalized website
• Ingests data public data
(property records, recent sales)
into Amazon Kinesis
• Spark on Amazon EMR performs
ML to provide real-time home
estimates
• Store data into Amazon S3
data lake
• Use data to do personalization,
advertising optimization,
and recommendations
for website
Zestimate home valuation
estimates
Personalization, advertising
optimization, and recommendations
for website
Public property records, home
tax assessments,
sales transactions, images,
video, MLS-listing data, and
user-provided data
Amazon
Kinesis
Amazon
S3
Spark
on Amazon
EMR
Public
data

Demo:
Ingesting data into your data lake

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Gartner:
“Through 2018, 80% of data lakes will not include effective
metadata management capabilities, making them inefficient."
What data do I have?
Data lake
on AWS

Data Catalog
ETL job
authoring
Discover data and
extract schema
Auto-generates
customizable ETL code
in Python and Spark
• Automatically discovers data and stores schema
• Catalog makes data searchable, and available for ETL
• Catalog contains table and job definitions
• Computes statistics to make queries efficient
AWS Glue – Data Catalog
M a ke d a ta d i s c o v e ra b l e

IAM Role
AWS Glue Crawler Databases
Amazon
Redshift
Amazon S3
JDBC connection
Object connection
Built-in classifiers
MySQL
MariaDB
PostgreSQL
Amazon Aurora
Oracle
Amazon Redshift
Avro
Parquet
ORC
XML
JSON & JSONPath
AWS CloudTrail
BSON
Logs
(Apache (Grok), Linux(Grok), MS(Grok), Ruby, Redis,
and many others)
Delimited
(comma, pipe, tab, semicolon)
< ALWAYS GROWING …>
What can crawlers discover?
Create additional custom
classifiers
Amazon
DynamoDB
NoSQL connection

But I have my own data formats …?
− There is a custom classifier for that …
Row-based
GROK classifier
A grok pattern is a
named set of regular
expressions (regex)
that are used to match
data one line at a time.
XML
XML classifier
XML tag that defines a
table row in the XML
document.
JSON
JSON classifier
JSON path to the
object, array, or value
that defines a row of
the table being
created. Type the
name in either dot or
bracket JSON syntax
using operators
supported by AWS
Glue.

Raw data stored in data lake:
Preparation:
No rmalized
Partitio ned
Co mpressed
S to rage o ptimized
Extract – Transform – Load
Preparing raw data for consumption
Data lake
on AWS
Raw
ingestion
Curated
datasets
Data Catalog
ETL

Demo:
Let’s discover data in our data lake

Other ways of populating the catalog
Call the AWS Glue CreateTable API
Create table manually Run Hive DDL statement
Apache Hive
Metastore
AWS GLUE ETL AWS GLUE
DATA CATALOG
Import from Apache Hive Metastore

How do I drive value?
Amazon SageMaker
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS IoT Core
Data lake
on AWS

Different tools for different users … solving different problems
Business
reporting
Data scientists
Data engineer
IDE
Data
Catalog
Central
storage
Amazon
SageMaker
Machine Learning/Deep Learning

Amazon Athena – interactive analysis
Interactive query service to analyze data in Amazon S3 using standard SQL
No infrastructure to set up or manage and no data to load
Ability to run SQL queries on data archived in Amazon Glacier (coming soon)
$ SQL
Query instantly
Zero setup cost; just
point to Amazon S3
and start querying
Pay per query
Pay only for queries run;
save 30%–90% on per-
query costs through
compression
Open
ANSI SQL interface,
JDBC/ODBC drivers, multiple
formats, compression types,
and complex joins and data
types
Easy
Serverless: Zero
infrastructure, zero
administration
Integrated with Amazon
QuickSight

Amazon EMR – big data processing
Analytics and ML at scale
19 open-source projects: Apache Hadoop, Spark, HBase, Presto, and more
Enterprise-grade security
$
Latest versions
Updated with the latest
open source frameworks
within 30 days of release
Low cost
Flexible billing with per-
second billing, Amazon
EC2 Spot, Reserved
Instances, and Auto
Scaling to reduce costs
50%-80%
Use Amazon S3 storage
Process data directly in
the Amazon S3 data lake
securely with high
performance using the
EMRFS connector
Easy
Launch fully managed
Hadoop & Spark in minutes;
no cluster setup, node
provisioning, cluster tuning
Data Lake
100110000100101011100
1010101110010101000
00111100101100101
010001100001

Hadoop / Spark Analytics on AWS
YARN (Hadoop ResourceManager)
NoSQLMachine
learning
Real-timeInteractiveScriptBatch
Data lake
on AWS
Amazon S3
Amazon EMR
Managed Hadoop / Spark
Object storage

Data warehouse …
Amazon Redshift data warehouse
Relational data
Gigabytes to petabytes scale
Reporting and analysis
Schema defined prior to data load
AWS
Glue ETL
On premises
Amazon QuickSight
Existing or new
BI tool
Redshift
COPY
Data lake
on AWS

Complementary to EDW (not replacement) Data lake can be source for EDW
Schema on read (no predefined schemas) Schema on write (predefined schemas)
Structured/semi-structured / unstructured data Structured data only
Fast ingestion of new data/content Time consuming to introduce new content
Data Science + Prediction / Advanced Analytics + BI
use cases
BI use cases
Data at low level of detail / granularity Data at summary / aggregated level of detail
Loosely defined SLAs Tight SLAs (production schedules)
Flexibility in tools (open source/tools for advanced
analytics)
Limited flexibility in tools (SQL only)
Elastic storage and compute capacity – decoupled
Explicitly sized environments, compute and storage
scaled in linearly
A data lake is not an enterprise data warehouse (EDW)
Data lake EDW

Amazon Redshift Spectrum
Extend the data warehouse to exabytes of data in Amazon S3 data lake
Amazon S3
data lake
Amazon
Redshift data
Redshift Spectrum
query engine • Exabyte Amazon Redshift SQL queries against Amazon S3
• Join data across Amazon Redshift and Amazon S3
• Scale compute and storage separately
• Stable query performance and unlimited concurrency
• CSV, ORC, Grok, Avro, & Parquet data formats
• Pay only for the amount of data scanned

Amazon Redshift
Spectrum
Q u er y you r d ata lake
Amazon
Redshift
JDBC / ODBC
...
1 2 3 4 N
Redshift Spectrum
Scale-out serverless compute
AWS Glue Data Catalog
COPY
commands
Hot data
Query directly
on data lake

Data lakes extend the traditional approach
Data warehouse
Business intelligence
OLTP ERP CRM LOB
• Relational and nonrelational data
• TBs–EBs scale
• Diverse analytical engines
• Low-cost storage & analytics
Devices Web Sensors Social
Data lake
Big data processing,
real-time, machine learning

Demo:
Query the data lake with multiple
analytics engines

Machine learning on your data lake
Amazon SageMaker
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS IoT Core
Data lake
on AWS

Amazon Polly
Amazon
Transcribe
Amazon Rekognition
Amazon Rekognition
Video
Amazon Translate
Amazon
Comprehend
Amazon Lex
VISION SPEECH LANGUAGE CHATBOT
SERVICES
GPUINFRASTRUCTURE CPU
IoT
(AWS Greengrass)
Mobile FPGAServerless
MXNET
FRAMEWORKS
TensorFlow
Caffe2
& Caffe
Gluon KerasCNTKPyTorch
DEEP LEARNING AMI
Amazon SageMakerPLATFORMS Amazon ML Spark &
Amazon EMR
Amazon
Mechanical Turk
AWS DeepLens
ML in the hands of every developer

Data Visualization &
Analysis
Business Problem –
ML problem framing Data Collection
Data Integration
Data Preparation &
Cleaning
Feature Engineering
Model Training &
Parameter Tuning
Model Evaluation
Are Business
Goals met?
Model Deployment
Monitoring &
Debugging
– Predictions
YesNo
DataAugmentation
Feature
Augmentation
Integration: The Data Architecture
Retraining
Build the data platform:
• Amazon S3
• AWS Glue
• Amazon Athena
• Amazon EMR
• Amazon Redshift Spectrum

Amazon SageMaker
1 2 3 4
I I I I
Notebook Instances Algorithms ML Training Service ML Hosting Service

Data Visualization &
Analysis
Business Problem –
ML problem framing Data Collection
Data Integration
Data Preparation &
Cleaning
Feature Engineering
Model Training &
Parameter Tuning
Model Evaluation
Are Business
Goals met?
Model Deployment
Monitoring &
Debugging
– Predictions
YesNo
DataAugmentation
Feature
Augmentation
Integration: The Data Architecture
Retraining
• Setup and manage Model
Inference Clusters
• Manage and Scale Model
Inference APIs
• Monitor and Debug Model
Predictions
• Models versioning and
performance tracking
• Automate New Model
version promotion to
production (A/B testing)

DigitalGlobe – Amazon SageMaker
By using Amazon SageMaker,
DigitalGlobe’s cache rate
improved by more than a factor
of two, often being around 83%
and sometimes trending to 90%
cache hit. This allowed them to
also cut their cloud storage cost
in half by better utilizing their
Amazon S3-optimized cache
and retrieving less from their
100+ PB archive.

Agility and innovation are key
Amazon SageMaker
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS IoT Core
Data lake
on AWS

Core tenants
• Data lakes and data warehouses complement each other
• Loose coupling, but highly performant
• Storage, analytics, metadata management, etc.
• Future-proof your analytics
• Choose the best tool for the job
• Elasticity and multiple clusters for dedicated purposes
• Replace capacity planning with a consumption model
• Don’t forget metadata management

Submit session feedback
1. Tap the Schedule icon.
2. Select the session you
attended.
3. Tap Session Evaluation to
submit your feedback.

Thank you!

Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Anaheim AWS Summit

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Anaheim AWS Summit

Similar a Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Anaheim AWS Summit (20)

Más de Amazon Web Services

Más de Amazon Web Services (20)

Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Anaheim AWS Summit