Building your first Data lake platform

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Russell Nash – AWS Solutions Architect, AWS
Building
Your First Data Lake
on AWS
May 2017

Any data Any analysisData Lake

Data
Warehouse
Data Sources Reporting
and BI
ETL
Traditional Analytics Pipeline

ETLData Sources
Reporting
Data Lake
Data Lake Pipeline
Exploration
Data Science

SCALABLE FLEXIBLE MANAGEABLE
COST
EFFECTIVE

Data Lake technology?
BUT…….
Hadoop seems perfect

STORAGE
COMPUTE
COMPUTE
COMPUTE
COMPUTE
COMPUTE
COMPUTE
COMPUTE
COMPUTE
COMPUTE

Ingest Serving
Speed (Real-time)
Scale (Batch)
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
Sources
Data Lake
Amazon
S3
ETL
AWS
Snowball
AWS DMS
Amazon
Kinesis

Amazon
S3
Object Storage
Low Cost
Highly Scalable
11 9’s of durability

Ingest Serving
Speed (Real-time)
Scale (Batch)
Data analysts
Data scientists
Business users
Automation / events
Sources
Amazon
S3
ETL
AWS
Snowball
AWS DMS
Amazon
Kinesis
Data Lake

PIG
SQL
Infrastructure
Data Layer
Process Layer
Framework
Applications
Amazon
EMR

Amazon
EMR
• Managed Hadoop
• Optimized with S3
• Open Source Support

Decouple Storage and Compute
CPU
Memory
Storage
CPU
Memory
Storage
CPU
Memory
Storage
Hadoop Master Node

PIG
SQL
Infrastructure
Data Layer
Process Layer
Framework
Applications
Amazon
EMR
Amazon
S3
EMRFS

Transient Cluster - Batch Jobs
Persistent Cluster – Interactive Queries External Metastore
Workload specific clusters
Amazon S3
Decouple Compute & Storage
Amazon RDS

Compute Memory Storage
Machine Learning
C4 Family
C3 Family
X1 Family
R3 Family
Interactive Analysis
D2 Family
I2 Family
Large HDFS
General
Batch Process
M4 Family
M3 Family
Compute Flexibility

Spot Price – M3.2XL
On-Demand Spot-Price
$0.08$0.75

Ingest Serving
Speed (Real-time)
Scale (Batch)
Data analysts
Data scientists
Business users
Automation / events
Sources
Amazon
S3
Amazon
EMR
Amazon
S3
AWS
Glue
ETL
AWS
Snowball
AWS DMS
Amazon
Kinesis
Data Lake

AWS
Glue
• Managed Transform Engine
• Job Scheduler
• Data Catalog
• Built on Apache Spark

Ingest Serving
Speed (Real-time)
Scale (Batch)
Data analysts
Data scientists
Business users
Automation / events
Sources
Amazon
S3
Amazon
EMR
Amazon
S3
ETL
AWS
Snowball
AWS DMS
Amazon
Kinesis
Data Lake

Ingest Serving
Speed (Real-time)
Scale (Batch)
Data analysts
Data scientists
Business users
Automation / events
Sources
ETL
AWS
Snowball
AWS DMS
Amazon
Kinesis
Amazon
S3
Amazon
EMR
Amazon
S3
Data Lake

Availability
Zone
Availability
Zone
Availability
Zone
Amazon Kinesis
Stream
AWS Lambda
KCL App
Amazon EMR
Streaming
Data Lake
Alerts
Analysis
Dashboards
Predictions

Ingest Serving
Speed (Real-time)
Scale (Batch)
Data analysts
Data scientists
Business users
Automation / events
Sources
ETL
AWS
Snowball
AWS DMS
Amazon
Kinesis
AWS Lambda
Application
Amazon EMR
Streaming
Amazon
S3
Amazon
EMR
Amazon
S3
Data Lake
Amazon
EMR
Amazon
Redshift
Amazon
Athena
EC2

Amazon
Athena
Query S3 data with SQL
Serverless
Instant Spin-Up
Pay per Query

MPP SQL Database
Optimised for Analytics
Gigabytes to Exabytes
Fully relational
Amazon
Redshift

S3
Amazon
Redshift
CREATE TABLE users (
Userid integer,
Username varchar(30),
Firstname varchar(30),
Lastname varchar(30),
City varchar(30));
COPY users FROM "s3://data-bucket/users/";
USERS.TXT

Amazon
Redshift
SELECT COUNT (*)
FROM users
WHERE city = “Mumbai”
USERS TABLE

S3
Amazon
Redshift
CREATE EXTERNAL TABLE orders (
Orderid integer,
Userid integer,
Pricepaid decimal(8,2))
row format delimited
fields terminated by 't'
stored as textfile
location "s3://data-bucket/orders/";
ORDERS.TXT
REDSHIFT
SPECTRUM

S3
Amazon
Redshift
SELECT AVG(PRICEPAID)
FROM USERS U, ORDERS O
WHERE U.USERID = O.USERID
AND city = “Mumbai”
ORDERS.TXT
USERS TABLE
REDSHIFT
SPECTRUM

Ingest Serving
Speed (Real-time)
Scale (Batch)
Data analysts
Data scientists
Business users
Automation / events
Sources
ETL
AWS
Snowball
AWS DMS
Amazon
Kinesis
AWS Lambda
Application
Amazon EMR
Streaming
Amazon
S3
Amazon
EMR
Amazon
S3
Data Lake
Amazon
EMR
Amazon
Redshift
Amazon
Athena
EC2
AWS
Cloud Trail
AWS
IAM
Amazon
CloudWatch
AWS
KMS

Best Practices
Data Formats
Compression
Partitioning

CSV
JSON
AVRO
Parquet
ORC
ROW COLUMNAR

ID Age State
123 20 NSW
345 25 WA
678 40 VIC
999 21 WA
123 20 NSW 345 25 WA 678 40 VIC 999 21 WA
123 345 678 999 20 25 40 21 NSW WA VIC WA
ROW FORMAT
COLUMN FORMAT

gzip
Snappy
Zlib
LZO
bzip2
SPEED
SIZE
SPLITTABLE
DATA FORMAT

s3://mybucket/athena/inputdata/2016/data.csv
SELECT FIELDS from ACCESS_LOGS
WHERE YEAR = 2015
S3

CREATE EXTERNAL TABLE access_logs
(
ip_address String,
request_time Timestamp,
request_method String,
request_path String,
request_protocol String,
response_code String,
response_size String,
referrer_host String,
user_agent String
)
PARTITIONED BY (year STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'
STORED AS TEXTFILE
LOCATION ‘s3://mybucket/athena/inputdata/’

AWS Solution Builder – Data Lake on AWS
Reference Architecture deployment
via CloudFormation
Configures core services to tag,
search and catalogue datasets
Deploys a console to search and
browse available datasets
http://amzn.to/2nTVjcp

Building your first Data lake platform

Building your first Data lake platform

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building your first Data lake platform

Similar to Building your first Data lake platform (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Building your first Data lake platform

Editor's Notes