In this session, we show you how to understand what data you have, how to drive insights, and how to make predictions using purpose-built AWS services. Learn about the common pitfalls of building data lakes, and discover how to successfully drive analytics and insights from your data. Also learn how services such as Amazon S3, AWS Glue, Amazon Redshift, Amazon Athena, Amazon EMR, Amazon Kinesis, and Amazon ML services work together to build a successful data lake for various roles, including data scientists and business users.
3. Data is changing → Analytics are adopting
Capture and store
new data at PB-EB scale
Do new type of analytics in
a cost-effective way
• Machine learning
• Big data processing
• Real-time analytics
• Full text search
New types of
analytics
4. AWS data lake helps address this
Quickly ingest and store any
type of data
Insights and security,
together …
Run the right tool for the right
job without manually copying
data around
5. Data lakes from AWS
Analytics
Machine
learning
Real-time dataTraditional
Data lake
on AWS
movementdata movement
Ingestion
Intelligence
Storage
catalog
Variety of
ingestion tools
Decoupled
analytics from
storage / catalog
6. Managed ML service
Deep learning AMIs
Video and image recognition
Conversational interfaces
Deep-learning video camera
Natural language processing
Language translation
Speech recognition
Text-to-speech
Interactive analysis
Hadoop & Spark
Data warehousing
Full-text search
Real-time analytics
Dashboards & visualizations
Dedicated network connection
Secure appliances
Ruggedized shipping container
Database migration
Connect devices to AWS
Real-time data streams
Real-time video streams
Data lake
on AWS
Storage | Archival Storage | Data Catalog
AnalyticsMachine learning
Real-time dataOn-premises movementdata movement
Data lakes, analytics, and IoT portfolio from AWS
Broadest, deepest set of analytic services
7. Data lakes, analytics, and IoT portfolio from AWS
Broadest, deepest set of analytic services
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data lake
on AWS
Storage | Archival Storage | Data Catalog
AnalyticsMachine learning
Real-time dataOn-premises movementdata movement
8. Many ways to transfer data into the data lake
Open and comprehensive
• Data movement from on-premises data centers
• Dedicated network connection
• Secure appliances
• Ruggedized shipping container
• Database migration
• Gateway that lets applications write to the cloud
• Data movement from real-time sources
• Connect devices to AWS
• Real-time data streams
• Real-time video streams
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS Storage Gateway
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data movement from
real-time sources
Data movement from your
data centers
Amazon S3
Amazon Glacier
AWS Glue
9. Amazon
Kinesis Data
Firehose
Real-time data movement and data lakes on AWS
AWS Glue
Data Catalog
Amazon
S3 data
Data lake
on AWS
Amazon
Kinesis Data
Streams
Data definitionAmazon Kinesis Agent
Apache Kafka
AWS SDK
LOG4J
Flume
Fluentd
AWS Mobile SDK
Kinesis Producer Library
10. Amazon S3
Amazon Glacier
AWS Glue
Ingest data in its raw form …
Open and comprehensive
CSV
ORC
Grok
Avro
Parquet
JSON
Store the data in its raw form:
BEFORE
Transforming
Analyzing
Manipulating
Doing … anything … to it
This becomes your source of truth you can always go back to
…
Lifecycle policies allow you to shift it to warm and cold
storage
11. Zillow uses AWS to build personalized website
• Ingests data public data
(property records, recent sales)
into Amazon Kinesis
• Spark on Amazon EMR performs
ML to provide real-time home
estimates
• Store data into Amazon S3
data lake
• Use data to do personalization,
advertising optimization,
and recommendations
for website
Zestimate home valuation
estimates
Personalization, advertising
optimization, and recommendations
for website
Public property records, home
tax assessments,
sales transactions, images,
video, MLS-listing data, and
user-provided data
Amazon
Kinesis
Amazon
S3
Spark
on Amazon
EMR
Public
data
15. IAM Role
AWS Glue Crawler Databases
Amazon
Redshift
Amazon S3
JDBC connection
Object connection
Built-in classifiers
MySQL
MariaDB
PostgreSQL
Amazon Aurora
Oracle
Amazon Redshift
Avro
Parquet
ORC
XML
JSON & JSONPath
AWS CloudTrail
BSON
Logs
(Apache (Grok), Linux(Grok), MS(Grok), Ruby, Redis,
and many others)
Delimited
(comma, pipe, tab, semicolon)
< ALWAYS GROWING …>
What can crawlers discover?
Create additional custom
classifiers
Amazon
DynamoDB
NoSQL connection
16. But I have my own data formats …?
− There is a custom classifier for that …
Row-based
GROK classifier
A grok pattern is a
named set of regular
expressions (regex)
that are used to match
data one line at a time.
XML
XML classifier
XML tag that defines a
table row in the XML
document.
JSON
JSON classifier
JSON path to the
object, array, or value
that defines a row of
the table being
created. Type the
name in either dot or
bracket JSON syntax
using operators
supported by AWS
Glue.
19. Other ways of populating the catalog
Call the AWS Glue CreateTable API
Create table manually Run Hive DDL statement
Apache Hive
Metastore
AWS GLUE ETL AWS GLUE
DATA CATALOG
Import from Apache Hive Metastore
20. How do I drive value?
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data lake
on AWS
Storage | Archival Storage | Data Catalog
AnalyticsMachine learning
Real-time dataOn-premises movementdata movement
21. Different tools for different users … solving different problems
Business
reporting
Data scientists
Data engineer
IDE
Data
Catalog
Central
storage
Amazon
SageMaker
Machine Learning/Deep Learning
24. Hadoop / Spark Analytics on AWS
YARN (Hadoop ResourceManager)
NoSQLMachine
learning
Real-timeInteractiveScriptBatch
Data lake
on AWS
Amazon S3
Amazon EMR
Managed Hadoop / Spark
Object storage
25. Data warehouse …
Amazon Redshift data warehouse
Relational data
Gigabytes to petabytes scale
Reporting and analysis
Schema defined prior to data load
AWS
Glue ETL
On premises
Amazon QuickSight
Existing or new
BI tool
Redshift
COPY
Data lake
on AWS
26. Complementary to EDW (not replacement) Data lake can be source for EDW
Schema on read (no predefined schemas) Schema on write (predefined schemas)
Structured/semi-structured / unstructured data Structured data only
Fast ingestion of new data/content Time consuming to introduce new content
Data Science + Prediction / Advanced Analytics + BI
use cases
BI use cases
Data at low level of detail / granularity Data at summary / aggregated level of detail
Loosely defined SLAs Tight SLAs (production schedules)
Flexibility in tools (open source/tools for advanced
analytics)
Limited flexibility in tools (SQL only)
Elastic storage and compute capacity – decoupled
Explicitly sized environments, compute and storage
scaled in linearly
A data lake is not an enterprise data warehouse (EDW)
Data lake EDW
27. Amazon Redshift Spectrum
Extend the data warehouse to exabytes of data in Amazon S3 data lake
Amazon S3
data lake
Amazon
Redshift data
Redshift Spectrum
query engine • Exabyte Amazon Redshift SQL queries against Amazon S3
• Join data across Amazon Redshift and Amazon S3
• Scale compute and storage separately
• Stable query performance and unlimited concurrency
• CSV, ORC, Grok, Avro, & Parquet data formats
• Pay only for the amount of data scanned
28. Amazon Redshift
Spectrum
Q u er y you r d ata lake
Amazon
Redshift
JDBC / ODBC
...
1 2 3 4 N
Redshift Spectrum
Scale-out serverless compute
AWS Glue Data Catalog
COPY
commands
Hot data
Query directly
on data lake
29. Data lakes extend the traditional approach
Data warehouse
Business intelligence
OLTP ERP CRM LOB
• Relational and nonrelational data
• TBs–EBs scale
• Diverse analytical engines
• Low-cost storage & analytics
Devices Web Sensors Social
Data lake
Big data processing,
real-time, machine learning
31. Machine learning on your data lake
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data lake
on AWS
Storage | Archival Storage | Data Catalog
AnalyticsMachine learning
Real-time dataOn-premises movementdata movement
32. Amazon Polly
Amazon
Transcribe
Amazon Rekognition
Amazon Rekognition
Video
Amazon Translate
Amazon
Comprehend
Amazon Lex
VISION SPEECH LANGUAGE CHATBOT
SERVICES
GPUINFRASTRUCTURE CPU
IoT
(AWS Greengrass)
Mobile FPGAServerless
MXNET
FRAMEWORKS
TensorFlow
Caffe2
& Caffe
Gluon KerasCNTKPyTorch
DEEP LEARNING AMI
Amazon SageMakerPLATFORMS Amazon ML Spark &
Amazon EMR
Amazon
Mechanical Turk
AWS DeepLens
ML in the hands of every developer
33. Data Visualization &
Analysis
Business Problem –
ML problem framing Data Collection
Data Integration
Data Preparation &
Cleaning
Feature Engineering
Model Training &
Parameter Tuning
Model Evaluation
Are Business
Goals met?
Model Deployment
Monitoring &
Debugging
– Predictions
YesNo
DataAugmentation
Feature
Augmentation
Integration: The Data Architecture
Retraining
Build the data platform:
• Amazon S3
• AWS Glue
• Amazon Athena
• Amazon EMR
• Amazon Redshift Spectrum
35. Data Visualization &
Analysis
Business Problem –
ML problem framing Data Collection
Data Integration
Data Preparation &
Cleaning
Feature Engineering
Model Training &
Parameter Tuning
Model Evaluation
Are Business
Goals met?
Model Deployment
Monitoring &
Debugging
– Predictions
YesNo
DataAugmentation
Feature
Augmentation
Integration: The Data Architecture
Retraining
• Setup and manage Model
Inference Clusters
• Manage and Scale Model
Inference APIs
• Monitor and Debug Model
Predictions
• Models versioning and
performance tracking
• Automate New Model
version promotion to
production (A/B testing)
36. DigitalGlobe – Amazon SageMaker
By using Amazon SageMaker,
DigitalGlobe’s cache rate
improved by more than a factor
of two, often being around 83%
and sometimes trending to 90%
cache hit. This allowed them to
also cut their cloud storage cost
in half by better utilizing their
Amazon S3-optimized cache
and retrieving less from their
100+ PB archive.
37. Agility and innovation are key
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data lake
on AWS
Storage | Archival Storage | Data Catalog
AnalyticsMachine learning
Real-time dataOn-premises movementdata movement
38. Core tenants
• Data lakes and data warehouses complement each other
• Loose coupling, but highly performant
• Storage, analytics, metadata management, etc.
• Future-proof your analytics
• Choose the best tool for the job
• Elasticity and multiple clusters for dedicated purposes
• Replace capacity planning with a consumption model
• Don’t forget metadata management