Building-a-Data-Lake-on-AWS

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Building a Data Lake on AWS
Rahul Bhartia
Principal Big Data Architect - AWS
rbhartia@

There are more
people accessing data
And more
requirements for
making data available
Data Scientists
Analysts
Business Users
Applications
Secure Real time
Flexible Scalable

A
data lake
is a
centralized repository
that allows you to store
all your structured and unstructured data
at any scale

Building a Data lake on AWS
Amazon S3
AWS
Glue
AWS
Snowball
AWS
DataSync
AWS Data
Migration
AWS Storage
Gateway
Amazon
Kinesis
Crawler
Job
Data
Catalog
AWS Lake
Formation
Amazon
Athena
Amazon
EMR
Amazon
QuickSight
Amazon
Redshift
Amazon
SageMaker

Data lakes made serverless
Amazon
S3
AWS
Glue
Amazon
Athena
Amazon
QuickSight
Serverless. Zero
infrastructure. Zero
administration
Never pay for
idle resources
$
Availability and
fault tolerance
built in
Automatically scales
resources with
usage
Amazon
Kinesis
Amazon
Sagemaker
Devices Web Sensors Social

Amazon S3—Object Storage
Security and
compliance
• 3 different forms of
encryption at rest and
encryption in transit
• Log and monitor with
CloudTrail & discover and
protect data with Macie
Flexible management
• Classify, report, and
visualize data usage trends
• Use Tag for consumption,
cost, and security
• Build lifecycle policies to
automate tiering and
retention
Durability, availability
& scalability
• Built for eleven nine’s of
durability
• Data distributed across 3
facilities within a region;
• Global replication
capabilities
Query-in-Place
• Run analytics & ML on
without moving data
• Retrieve subset of
data, improving
performance by 400%
with S3 Select

AWS Glue Data Catalog
Unified metadata repository for data in
• Amazon S3
• Amazon DynamoDB
• Relational databases - Amazon RDS, Amazon Redshift
Query your data from Amazon Athena or Amazon
Redshift Spectrum or Amazon EMR
Augment technical metadata with business
metadata for tables
Schema evolution using versioning
Central and searchable
view of your data-assets

Crawlers automatically build your Data Catalog and
keep it in sync.
Automatically discover new data, extracts schema
definitions
• Detect schema changes and version tables
• Detect Hive style partitions on Amazon S3
Built-in classifiers for popular types; custom classifiers
using Grok expression
Run ad hoc or on a schedule; serverless – only pay
when crawler runs
AWS Glue Crawlers
Automatically catalog your data

Enforce security policies
across multiple services
Gain and manage new
insights
Identify, ingest, clean, and
transform data
Build a secure data lake in days
AWS Lake Formation

Easily load data to your data lake
Logs and Events
Databases
Amazon S3
Blueprints
AWS Lake Formation
Amazon
RDS
Amazon
Aurora
Amazon
Kinesis
Firehose
Amazon
CloudTrail
Full-load
Incremental

Blueprints build on AWS Glue

Easily de-duplicate your data with ML transforms

Secure once, access in multiple ways
Access Control
Admin
Amazon S3
AWS Lake
Formation
Amazon
Athena
Amazon
EMR
Amazon
QuickSight
Amazon
Redshift

Security permissions in AWS Lake Formation
Control data access with simple
grant and revoke permissions
Specify permissions on tables and
columns rather than on buckets and
objects
Easily view policies granted to a
particular user
Audit all data access at one place

Data movement from on-premises
AWS
Snowball
Petabyte and Exabyte-
scale data transport
solution that uses secure
appliances to transfer
large amounts of data
into and out of the AWS
cloud
AWS
DataSync
Automate moving data
between on-premises
and Amazon S3 using
Network File System
(NFS) protocol, at speeds
up to 10 times faster
than open-source tools.
AWS Storage
Gateway
Lets your on-premises
applications to use AWS
for storage; includes a
highly-optimized data
transfer mechanism,
bandwidth management,
along with local cache
AWS Database
Migration Service
Migrate database from
the most widely-used
commercial and open-
source offerings to AWS
quickly and securely with
minimal downtime to
applications

Change Data Capture (CDC) to Amazon S3
AWS Database Migration
Service
Source
database
Crawlers Data catalogSnapshot
Data
AWS Glue
Amazon Athena
Amazon EMR
New!
• Support for Parquet
• Support for S3 encryption with KMS
Amazon Redshift

Data movement in real-time
Amazon Kinesis
Video Streams
Securely stream video
from connected devices
to AWS for analytics,
machine learning (ML),
and other processing
Amazon Kinesis Data
Firehose
Capture, transform, and
load data streams into
AWS data stores for near
real-time analytics with
existing business
intelligence tools.
Amazon Kinesis Data
Streams
Build custom, real-time
applications that process
data streams using
popular stream
processing frameworks
Managed Streaming
For Kafka
Fully managed open-
source platform for
building real-time
streaming data pipelines
and applications.

Prefix: raw/life/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/
Buffer: Up to 128MB or 15 minutes
Real-time events to Amazon S3
Kinesis Data
Streams
Kinesis Data
Firehose
Lambda
Transformation
Aggregated
JSON Data
Aggregated
Parquet Data
Amazon Athena
Crawlers
Save as Parquet
Data Catalog

AWS Glue ETL
New!

Transforming data using AWS Glue
Amazon S3
(Raw data)
Amazon S3
(Processed data)
AWS Glue
Data Catalog
AWS Glue
Crawler
AWS Glue
Crawler
AWS Glue
job
Lambda
Function
Amazon S3
(Enriched data)
AWS Glue
Crawler
AWS Glue
job
File Put
Event
Trigger

Analytics
Amazon
QuickSight
Pay only for what you
use; Scale to tens of
thousands of users;
Embedded analytics;
Build end-to-end BI
solutions
Amazon
Athena
Amazon
EMR
Flexible, open source
choice for Hadoop and
Spark; Lower cost than
on-premises with
autoscaling; Security with
Encryption, Authentication
and Authorization
Amazon
Redshift
Cost-effective and up to 10x
faster than traditional data
warehouses; Easy to setup,
deploy and manage; Scale
on-demand for large data
volume and high query
concurrency
Run interactive queries to
easily analyze data in
Amazon S3 using standard
SQL; No infrastructure to
set up or manage and no
data to load
New!
Workgroups Multi-master Concurrency scaling ML Insights

Automated reports using Amazon Athena
athena.startQueryExecution("SELECT * FROM business_view”)
SNS
Queue
1
2
3 4
Email
notification
5 1. Schedule query
2. Track QueryID for status
3. Query results to Amazon S3
4. New file trigger
5. Job complete notification
Lambda
Function
Athena
Query
S3
Bucket
Lambda
Function
SNS
Topic
DynamoDB
Table

Athena Workgroups
Isolation Metrics Cost Controls Tags
Use tags to categorize
your AWS resources in
different ways, for
instance by purpose,
owner, or environment.
Build dashboards and
alerting based on
Workgroup metrics are
published to Cloudwatch
Define per query data
scanned threshold; Any query
exceeding that will be
cancelled; Trigger alarms to
notify of increasing usage
and cost
Unique query output
location per Workgroup
Encrypt results with
unique AWS KMS key per
Workgroup

Amazon EMR Notebooks
EMR Cluster
AWS Management
Console
EMR-managed Jupyter
notebook
Users
S3 bucket
Auto-save
Amazon S3
SageMaker
Athena

Amazon EMR – High Availability
Livy
Zookeeper
HiveServer2
Yarn RM
HDFS NameNode
Livy
Zookeeper
HiveServer2
Yarn RM
HDFS NameNode
Livy
Zookeeper
HiveServer2
Yarn RM
Master Node 1 Master Node 2 Master Node 3
EMR Cluster
Active
Standby
Standby
Active
Active
Active

Data-warehousing with Amazon Redshift
AWS Database Migration
Service
Database
Crawlers
Data catalog
Amazon Kinesis
Firehose
Amazon Redshift
Files
Events
Save as Parquet
Upload to S3
Redshift Spectrum
CDC Replication

Unload
to Parquet
Amazon Redshift
N E W !
New features
Speed
Scale
WLM
ConcurrencySimplicity
Amazon Lake
Formation integrationSecurity
Auto-Vacuum
& Analyze
Auto Data
Distribution
Deferred
Maintenance
Snapshot
Scheduler
Spectrum
Request
Accelerator
10x average
performance
improvement
Elastic
resize
Concurrency
Scaling
N E W !
N E W !N E W !
C O M I N G S O O N
C O M I N G S O O N C O M I N G S O O N
Improving
short query
acceleration
C O M I N G S O O N C O M I N G S O O N
Stored
procedures
N E W !
N E W !
N E W !

ML Insights with Amazon QuickSight
ML Anomaly
detection
ML Forecasting
Auto Narratives

AI & Machine Learning
AI services that enable
developers to plug-in pre-built
AI functionality into their apps
ML services that make it easy for
any developer to get started and
get deep with ML
Frameworks and interfaces for
machine learning practitioners
Amazon S3
Raw Data
Initial training data
is annotated by
human labelers
Active learning model
is trained from human
labeled data
Ambiguous data is sent to human
labelers for annotation
Human labeled data is then sent
back to retrain and improve the
machine learning model
Training data the
model understands is
labeled automatically
An accurate training data
set is ready for use in
Amazon SageMaker

Amazon SageMaker
Frameworks Interfaces
EC2 P3
& P3dn
EC2 C5 FPGASs GreenGrass Elastic
Inference
AI & Machine Learning
AI Services
Frameworks & Infrastructure
Rekognition
Image
Polly
Transcribe
Translate Comprehend
& Comprehend Medical
Rekognition
Video
Textract
Forecast PersonalizeLex
Vision Speech ChatbotsLanguage Forecasting Recommendations
Infrastructure
Pre-built algorithms & notebooks
Data labeling (Ground Truth)
One-click model training & tuning
Optimization (NEO)
One-click deployment & hosting
Reinforcement learningAlgorithms & models (AWS Marketplace for ML)
Train DeployBuild
ML Services

Thank you!

Building-a-Data-Lake-on-AWS

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Building-a-Data-Lake-on-AWS

Similar a Building-a-Data-Lake-on-AWS (20)

Más de Amazon Web Services

Más de Amazon Web Services (20)

Building-a-Data-Lake-on-AWS