Optimizing data lakes with Amazon S3 - STG303 - Chicago AWS Summit

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Optimizing data lakes with Amazon S3
John Mallory
Storage business development manager
Amazon Web Services
S T G 3 0 3

125+ million players
Data provides a constant feedback loop
for game designers
Up-to-the-minute analysis of gamer satisfaction to
drive gamer engagement
Resulting in the most popular
game played in the world
Fortnite

Epic Games uses data lakes and analytics
Entire analytics platform running on AWS
Amazon S3 leveraged as a data lake
All telemetry data is collected with Amazon Kinesis
Real-time analytics done through Spark on Amazon EMR, Amazon
DynamoDB to create scoreboards and real-time queries
Use Amazon EMR for large-batch data processing
Game designers use data to inform their decisions
Game
clients
Game
servers
Launcher
Game
services
N E A R R E A L T I M E P I P E L I N E
N E A R R E A L T I M E P I P E L I N E
Grafana
Scoreboards API
Limited Raw Data
(real-time ad hoc SQL)User ETL
(metric definition)
Spark on Amazon
EMR
DynamoDB
NEAR-REAL-TIME PIPELINES
BATCH PIPELINES
ETL using
Amazon EMR
Tableau/BI
Ad hoc SQLS3
(Data Lake)
Kinesis
APIs
Databases
Amazon S3
Other
sources

Finding value in data is a journey
Business Monitoring
Business Insights
New Business Opportunity
Business Optimization
Business Transformation
Evolving Tools and Methods
AI/MLSQL Query

Why use AWS for Big Data & Analytics?
Agility Scalability
Get to insights faster
Broadest and deepest
capabilities
Low cost
Data migrations made easy

More data lakes and analytics than anywhere else
Morethan 10,000 data lakes on AWS

Defining the AWS data lake
Data lakes provide:
Relational and non-relational data
Scale-out to EBs
Diverse set of analytics and machine learning tools
Work on data without any data movement
Designed for low-cost storage and analytics
OLTP ERP CRM LOB
Data Warehouse
Business
Intelligence
Data Lake
100110000100101011100101010
111001010100001011111011010
0011110010110010110
0100011000010
Devices Web Sensors Social
Catalog
Machine
Learning
DW Queries Big Data
Processing
Interactive Real-time

Data lake on AWS
Catalog & Search Access & User Interfaces
Data Ingestion
Analytics & Serving
Amazon S3
DynamoDB Amazon Elasticsearch
Service
AWS
AppSync
Amazon
API Gateway
Amazon
Cognito
AWS
KMS
AWS
CloudTrail
Manage & Secure
AWS
IAM
Amazon
CloudWatch
AWS
Snowball
AWS Storage
Gateway
Amazon
Kinesis Data
Firehose
AWS Direct
Connect
AWS Database Migration
Service
Amazon
Athena
Amazon
EMR
AWS
Glue
Amazon
Redshift
Amazon
DynamoDB
Amazon
QuickSight
Amazon
Kinesis
Amazon
Elasticsearch
Service
Amazon
Neptune
Amazon
RDS
Central Storage
Scalable, secure, cost-
effective
AWS
Glue

User-defined functions
• Bring your own functions & code
• Execute without provisioning servers
Processing and querying in place
Fully managed process & query
• Catalog, transform & query data in Amazon
S3
• No physical instances to manage
Lambda function

Amazon S3 is the best place for data lakes
Most ways to
bring data in
Best security,
compliance,
and audit
capabilities
Object-level
controls
Unmatched
durability,
availability,
and scalability
Business
insights
into your data

Optimize costs with data tiering
Hot
Cold
Amazon S3
standard
Amazon S3—
infrequent access
Amazon S3
Glacier
HDFS ✓ Use Amazon EMR/Hadoop with
local HDFS for hottest datasets
✓ Store cooler data in Amazon S3 and
cold in Amazon Simple Storage
Service Glacier to reduce costs
✓ Use Amazon S3 analytics to
optimize tiering strategy
Amazon S3 analytics

Your choice of Amazon S3 storage classes
Access FrequencyFrequent Infrequent
• Active, frequently
accessed data
• Milliseconds access
• > 3 AZ
• From: $0.0210/GB
• Data with changing access
pattern
• > 3 AZ
• From: $0.0210 to
$0.0125/GB
• Monitoring fee per obj.
• Min storage duration
• Infrequently accessed
data
• > 3 AZ
• From: $0.0125/GB
• Retrieval fee per GB
• Min object size
Amazon S3
Standard
Amazon S3
Standard-IA
Amazon S3
One Zone-IA
Amazon S3 Glacier
• Re-creatable less-
accessed data
• 1 AZ
• From: $0.0100/GB
• Min object size
• Archive data
• Minutes to hours
access
• > 3 AZ
• From: $0.0040/GB
• Min object size
Amazon S3
Intelligent-Tiering
Amazon S3
Glacier Deep
Archive
• Archive data
• Hours access
• > 3 AZ
• From: $0.00099/GB
• Min object size
N E W ! N E W !

Amazon S3 Intelligent-Tiering NEW!
Automates cost savings
Automatically optimizes storage costs for
data with changing access patterns
Moves objects between two storage tiers:
• Frequent Access Tier
• Infrequent Access Tier
Monitors access patterns and auto-tiers on
granular object level
Milliseconds access, > 3 AZ, monitoring fee
per object, minimum storage duration

Amazon S3 Glacier Deep Archive NEW!
No tape to
manage
$0.00099/GB/month
Less than 1/4 the cost of
Amazon S3 Glacier
Designed for 11
9’s durability
Recover data in
hours
Lowest-cost storage available in the cloud
C o m i n g s o o n

A Data Lake Needs to
Accommodate a Wide
Variety of Concurrent
Data Sources
Rapidly ingest all data sources
IoT, Sensor Data, Clickstream Data,
Social Media Feeds, Streaming Logs
Oracle, MySQL, MongoDB, DB2,
SQL Server, Amazon RDS
On-premises ERP, Mainframes,
Lab Equipment, NAS Storage
Offline Sensor Data, NAS,
On-premises Hadoop
On-premises Data Lakes, EDW,
Large Scale Data Collection
Ingest
Methods

AWS transfer for SFTP NEW!
Fully-managed service enabling transfer
of data over SFTP, while stored in Amazon S3
Seamless migration
of existing workflows
Native integration
with AWS services
Simple
to use
Cost-
effective
Secure and compliantFully managed
in AWS

AWS
integrated
AWS
Transfer service that simplifies, automates, and accelerates data movement
Transfers up
to 10 Gbps
per agent
Pay as you goSecure and
reliable
transfers
Replicate data to AWS for
business continuity
Transfer data for timely in-
cloud analysis
Migrate active application
data to AWS
Combines the speed and reliability of network acceleration software
with the cost-effectiveness of open source tools
Simple data
movement to
Amazon S3 or
Amazon EFS
AWS DataSync NEW!

Process data in place . . .
Amazon S3
Athena Amazon Redshift
Spectrum
Amazon SageMaker AWS Glue

Amazon S3 Select
Select a subset of your object’s data using a SQL expression

Improved performance for data lakes
As customers store larger and larger datasets in Amazon S3,
Amazon S3 Select offers up to a 400% performance improvement

Amazon S3 Select enhancements NEW!
Now Supports:
CSV, JSON, JSON arrays, and Parquet
formats
GZIP, BZIP2, and Snappy compression
Integrated with Spark, Hive, and Presto on Amazon EMR

Seamless integration with Amazon S3
Data stored in Amazon S3 is loaded to
Amazon FSx for processing
Output of processing returned to
Amazon S3 for retention
When your workload finishes, simply delete your file system.
Link your Amazon S3 dataset to your Amazon FSx for Lustre file system, then . . .

Amazon FSx for Lustre
For compute-intensive data processing
use cases like HPC or machine learning
Raw data stored in Amazon S3 is loaded to
FSx for Lustre for processing
Output of processing returned to
Amazon S3 for retention

Amazon FSx for Lustre performance
Massively scalable performance
100+ GB/s throughput | Millions of IOPS |
Consistent sub-millisecond latencies
Parallel file system Supports hundreds of
thousands of cores
SSD-based

Choosing the right data formats
There is no such thing as the “best” data format
• All involve tradeoffs, depending on workload & tools
• CSV, TSV, JSON are easy, but not efficient
• Compress & store/archive as raw input
• Columnar compressed are generally preferred
• Parquet or ORC
• Smaller storage footprint = lower cost
• More efficient scan & query
• Row oriented (AVRO) good for full data scans
• Organize into partitions
• Coalescing to larger partitions over time
Key considerations are cost, performance, & support

Data prep is ~80% of data lake work
Building training sets
Cleaning and organizing data
Collecting datasets
Mining data for patterns
Refining algorithms
Other

Set up a catalog, ETL, and data prep
with AWS Glue
Serverless provisioning, configuration, and
scaling to run your ETL jobs on Apache Spark
Pay only for the resources used for jobs
Crawl your data sources, identify data
formats, and suggest schemas and
transformations
Automates the effort in building,
maintaining, and running ETL jobs

Event-driven AWS Glue ETL pipeline
Let Amazon CloudWatch Events and AWS Lambda drive the pipeline
New raw
data arrives
< 22:00
UTC
Start
crawler
Crawl
raw dataset
Run
‘optimize’
job
Start
job or trigger
Crawl
optimized
dataset
Start
crawler
SLA
deadline
02:00
UTC
Ready
for reporting
Reporting
dataset
ready
Data arrives
in Amazon S3
Crawler
succeeds
Job
succeeds

Security challenges with data lakes
Data challenges
• Controlling access to data
• Data masking, row / column / cell level
encryption, key management
• Data loss / exfiltration
• Loss of data integrity
• Data provenance
• Compliance requirements (GDPR
and others)
Management challenges
• Central administration
• Federated authentication,
typically with Active Directory
• Role-based access control (RBAC)
• Centralized audit
• End-to-end data protection (at-
rest and in-transit)

AWS helps you secure
Compliance
AWS Artifact
Amazon Inspector
AWS CloudHSM
Amazon Cognito
AWS CloudTrail
Security
Amazon GuardDuty
AWS Shield
AWS WAF
Amazon Macie
Amazon Virtual Private Cloud
(Amazon VPC)
Encryption
AWS Certificate Manager
AWS Key Management Service
(AWS KMS)
Encryption at rest
Encryption in transit
Bring your own keys, HSM
support
Identity
AWS Identity and Access
Management (IAM)
AWS Single Sign-On
Amazon Cloud Directory
AWS Directory Service
AWS Organizations
Customers need to have multiple levels of security, identity and access management, encryption,
and compliance to secure their data lake

Data lake security
• Data storage
• Metadata

Control access to data
Configure Amazon S3 permissions
• Implement your access control matrix using IAM
policies
• Use S3 bucket policies for easy cross-account data
sharing
• Limit role-based access from an Amazon EMR
cluster’s Amazon Elastic Compute Cloud (Amazon
EC2) instance profile
• Authorize access from other tools such as Amazon
Redshift using IAM roles
IAM Principals Amazon EMR Amazon Redshift

Block public access to Amazon S3
Amazon S3 provides four settings
• BlockPublicAcls—rejects new public object or bucket ACLs
• IgnorePublicAcls—ignores existing public object or bucket ACLs
• BlockPublicPolicy—rejects new public bucket access policy
• RestrictPublicBuckets—restricts access to only AWS services and authorized users
within the bucket owner’s account
But what is “public”?
• Public object (or bucket) ACL → grants permissions to members of the
predefined AllUsers or AuthenticatedUsers groups (grantees)
• Public bucket policy → doesn’t grant permissions to only fixed values in Principal and
Condition elements

S3 Object Lock NEW!
Immutable Amazon S3 Objects
• Write Once Read Many (WORM) Protection for Amazon S3 Objects
• Object or bucket control of WORM & retention attributes
Retention Management Controls
• Define retention periods in your app or with bucket-level defaults
• Objects locked for the duration of the retention period
• Support for legal hold scenarios
Data Protection and Compliance
• Assessed for use in SEC 17a-4, CFTC, and FINRA environments
• Extra protection against accidental or malicious delete

Metadata security
AWS Glue Data Catalog
• Apache Hive metastore compatible
• Track data evolution using schema versioning
• Integrates with Hive, Spark, Presto, Athena, and
Amazon Redshift spectrum
• Use crawlers to classify your data in one central list
that is searchable

Metadata security
Key learnings
• Create and maintain centralized data catalog
• Enable cross-account access
• Use IAM policies to control catalog access—similar to S3 bucket
policies
• Encrypt metadata in AWS Glue Data Catalog

AWS Glue Data Catalog—resource policies
• Fine-grained access control to Data Catalog using IAM policies
• Restrict what they can view and query

Typical steps of building a data lake
Set up storage1
Move data2
Cleanse, prep, and
catalog data
3
Configure and enforce
security and compliance
policies
4
Make data available
for analytics
5

Building data lakes can still take months

Enforce security policies across
multiple services
Gain and manage new insights
Identify, ingest, clean, and
transform data
Build a secure data lake in days
AWS lake formation

How it works

3 simple steps to an AWS data lake
Remove Data Silos
Aggregate Data
Better Agility
More Data = > Insights
Know What You Have
Better Data Management
Quicker Time to Results
Higher Quality Data
Extract Value from Data
Analyze & Report on Data
Apply Machine Learning
Visualize & Consume Results
Amazon Ingest & Storage
Amazon S3, Amazon S3 Glacier,
AWS DataSync, AWS Storage
Gateway, AWS Snow family*,
Kinesis
AWS Glue
Crawl, Discover, & Catalog Data
ETL Data
Amazon Analytics & ML
Athena, Amazon EMR, Amazon
Redshift, Amazon SageMaker,
Amazon Rekognition, Amazon
EC2+Amazon FSx Lustre
Collect & Centralize Catalog & Transform Analytics & Insights

Optimizing data lakes with Amazon S3 - STG303 - Chicago AWS Summit

Recomendados

Recomendados

Más contenido relacionado

Más de Amazon Web Services

Más de Amazon Web Services (20)

Optimizing data lakes with Amazon S3 - STG303 - Chicago AWS Summit