SlideShare una empresa de Scribd logo
1 de 43
Descargar para leer sin conexión
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Optimizing data lakes with Amazon S3
John Mallory
Storage business development manager
Amazon Web Services
S T G 3 0 3
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
125+ million players
Data provides a constant feedback loop
for game designers
Up-to-the-minute analysis of gamer satisfaction to
drive gamer engagement
Resulting in the most popular
game played in the world
Fortnite
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Epic Games uses data lakes and analytics
Entire analytics platform running on AWS
Amazon S3 leveraged as a data lake
All telemetry data is collected with Amazon Kinesis
Real-time analytics done through Spark on Amazon EMR, Amazon
DynamoDB to create scoreboards and real-time queries
Use Amazon EMR for large-batch data processing
Game designers use data to inform their decisions
Game
clients
Game
servers
Launcher
Game
services
N E A R R E A L T I M E P I P E L I N E
N E A R R E A L T I M E P I P E L I N E
Grafana
Scoreboards API
Limited Raw Data
(real-time ad hoc SQL)User ETL
(metric definition)
Spark on Amazon
EMR
DynamoDB
NEAR-REAL-TIME PIPELINES
BATCH PIPELINES
ETL using
Amazon EMR
Tableau/BI
Ad hoc SQLS3
(Data Lake)
Kinesis
APIs
Databases
Amazon S3
Other
sources
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Finding value in data is a journey
Business Monitoring
Business Insights
New Business Opportunity
Business Optimization
Business Transformation
Evolving Tools and Methods
AI/MLSQL Query
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Why use AWS for Big Data & Analytics?
Agility Scalability
Get to insights faster
Broadest and deepest
capabilities
Low cost
Data migrations made easy
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
More data lakes and analytics than anywhere else
Morethan 10,000 data lakes on AWS
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Defining the AWS data lake
Data lakes provide:
Relational and non-relational data
Scale-out to EBs
Diverse set of analytics and machine learning tools
Work on data without any data movement
Designed for low-cost storage and analytics
OLTP ERP CRM LOB
Data Warehouse
Business
Intelligence
Data Lake
100110000100101011100101010
111001010100001011111011010
0011110010110010110
0100011000010
Devices Web Sensors Social
Catalog
Machine
Learning
DW Queries Big Data
Processing
Interactive Real-time
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Data lake on AWS
Catalog & Search Access & User Interfaces
Data Ingestion
Analytics & Serving
Amazon S3
DynamoDB Amazon Elasticsearch
Service
AWS
AppSync
Amazon
API Gateway
Amazon
Cognito
AWS
KMS
AWS
CloudTrail
Manage & Secure
AWS
IAM
Amazon
CloudWatch
AWS
Snowball
AWS Storage
Gateway
Amazon
Kinesis Data
Firehose
AWS Direct
Connect
AWS Database Migration
Service
Amazon
Athena
Amazon
EMR
AWS
Glue
Amazon
Redshift
Amazon
DynamoDB
Amazon
QuickSight
Amazon
Kinesis
Amazon
Elasticsearch
Service
Amazon
Neptune
Amazon
RDS
Central Storage
Scalable, secure, cost-
effective
AWS
Glue
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
User-defined functions
• Bring your own functions & code
• Execute without provisioning servers
Processing and querying in place
Fully managed process & query
• Catalog, transform & query data in Amazon
S3
• No physical instances to manage
Lambda function
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Amazon S3 is the best place for data lakes
Most ways to
bring data in
Best security,
compliance,
and audit
capabilities
Object-level
controls
Unmatched
durability,
availability,
and scalability
Business
insights
into your data
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Optimize costs with data tiering
Hot
Cold
Amazon S3
standard
Amazon S3—
infrequent access
Amazon S3
Glacier
HDFS ✓ Use Amazon EMR/Hadoop with
local HDFS for hottest datasets
✓ Store cooler data in Amazon S3 and
cold in Amazon Simple Storage
Service Glacier to reduce costs
✓ Use Amazon S3 analytics to
optimize tiering strategy
Amazon S3 analytics
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Your choice of Amazon S3 storage classes
Access FrequencyFrequent Infrequent
• Active, frequently
accessed data
• Milliseconds access
• > 3 AZ
• From: $0.0210/GB
• Data with changing access
pattern
• Milliseconds access
• > 3 AZ
• From: $0.0210 to
$0.0125/GB
• Monitoring fee per obj.
• Min storage duration
• Infrequently accessed
data
• Milliseconds access
• > 3 AZ
• From: $0.0125/GB
• Retrieval fee per GB
• Min storage duration
• Min object size
Amazon S3
Standard
Amazon S3
Standard-IA
Amazon S3
One Zone-IA
Amazon S3 Glacier
• Re-creatable less-
accessed data
• Milliseconds access
• 1 AZ
• From: $0.0100/GB
• Retrieval fee per GB
• Min storage duration
• Min object size
• Archive data
• Minutes to hours
access
• > 3 AZ
• From: $0.0040/GB
• Retrieval fee per GB
• Min storage duration
• Min object size
Amazon S3
Intelligent-Tiering
Amazon S3
Glacier Deep
Archive
• Archive data
• Hours access
• > 3 AZ
• From: $0.00099/GB
• Retrieval fee per GB
• Min storage duration
• Min object size
N E W ! N E W !
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Amazon S3 Intelligent-Tiering NEW!
Automates cost savings
Automatically optimizes storage costs for
data with changing access patterns
Moves objects between two storage tiers:
• Frequent Access Tier
• Infrequent Access Tier
Monitors access patterns and auto-tiers on
granular object level
Milliseconds access, > 3 AZ, monitoring fee
per object, minimum storage duration
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Amazon S3 Glacier Deep Archive NEW!
No tape to
manage
$0.00099/GB/month
Less than 1/4 the cost of
Amazon S3 Glacier
Designed for 11
9’s durability
Recover data in
hours
Lowest-cost storage available in the cloud
C o m i n g s o o n
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
A Data Lake Needs to
Accommodate a Wide
Variety of Concurrent
Data Sources
Rapidly ingest all data sources
IoT, Sensor Data, Clickstream Data,
Social Media Feeds, Streaming Logs
Oracle, MySQL, MongoDB, DB2,
SQL Server, Amazon RDS
On-premises ERP, Mainframes,
Lab Equipment, NAS Storage
Offline Sensor Data, NAS,
On-premises Hadoop
On-premises Data Lakes, EDW,
Large Scale Data Collection
Ingest
Methods
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
AWS transfer for SFTP NEW!
Fully-managed service enabling transfer
of data over SFTP, while stored in Amazon S3
Seamless migration
of existing workflows
Native integration
with AWS services
Simple
to use
Cost-
effective
Secure and compliantFully managed
in AWS
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
AWS
integrated
AWS
Transfer service that simplifies, automates, and accelerates data movement
Transfers up
to 10 Gbps
per agent
Pay as you goSecure and
reliable
transfers
Replicate data to AWS for
business continuity
Transfer data for timely in-
cloud analysis
Migrate active application
data to AWS
Combines the speed and reliability of network acceleration software
with the cost-effectiveness of open source tools
Simple data
movement to
Amazon S3 or
Amazon EFS
AWS DataSync NEW!
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Process data in place . . .
Amazon S3
Athena Amazon Redshift
Spectrum
Amazon SageMaker AWS Glue
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Amazon S3 Select
Select a subset of your object’s data using a SQL expression
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Improved performance for data lakes
As customers store larger and larger datasets in Amazon S3,
Amazon S3 Select offers up to a 400% performance improvement
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Amazon S3 Select enhancements NEW!
Now Supports:
CSV, JSON, JSON arrays, and Parquet
formats
GZIP, BZIP2, and Snappy compression
Integrated with Spark, Hive, and Presto on Amazon EMR
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Seamless integration with Amazon S3
Data stored in Amazon S3 is loaded to
Amazon FSx for processing
Output of processing returned to
Amazon S3 for retention
When your workload finishes, simply delete your file system.
Link your Amazon S3 dataset to your Amazon FSx for Lustre file system, then . . .
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Amazon FSx for Lustre
For compute-intensive data processing
use cases like HPC or machine learning
Raw data stored in Amazon S3 is loaded to
FSx for Lustre for processing
Output of processing returned to
Amazon S3 for retention
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Amazon FSx for Lustre performance
Massively scalable performance
100+ GB/s throughput | Millions of IOPS |
Consistent sub-millisecond latencies
Parallel file system Supports hundreds of
thousands of cores
SSD-based
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Choosing the right data formats
There is no such thing as the “best” data format
• All involve tradeoffs, depending on workload & tools
• CSV, TSV, JSON are easy, but not efficient
• Compress & store/archive as raw input
• Columnar compressed are generally preferred
• Parquet or ORC
• Smaller storage footprint = lower cost
• More efficient scan & query
• Row oriented (AVRO) good for full data scans
• Organize into partitions
• Coalescing to larger partitions over time
Key considerations are cost, performance, & support
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Data prep is ~80% of data lake work
Building training sets
Cleaning and organizing data
Collecting datasets
Mining data for patterns
Refining algorithms
Other
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Set up a catalog, ETL, and data prep
with AWS Glue
Serverless provisioning, configuration, and
scaling to run your ETL jobs on Apache Spark
Pay only for the resources used for jobs
Crawl your data sources, identify data
formats, and suggest schemas and
transformations
Automates the effort in building,
maintaining, and running ETL jobs
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Event-driven AWS Glue ETL pipeline
Let Amazon CloudWatch Events and AWS Lambda drive the pipeline
New raw
data arrives
< 22:00
UTC
Start
crawler
Crawl
raw dataset
Run
‘optimize’
job
Start
job or trigger
Crawl
optimized
dataset
Start
crawler
SLA
deadline
02:00
UTC
Ready
for reporting
Reporting
dataset
ready
Data arrives
in Amazon S3
Crawler
succeeds
Job
succeeds
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Security challenges with data lakes
Data challenges
• Controlling access to data
• Data masking, row / column / cell level
encryption, key management
• Data loss / exfiltration
• Loss of data integrity
• Data provenance
• Compliance requirements (GDPR
and others)
Management challenges
• Central administration
• Federated authentication,
typically with Active Directory
• Role-based access control (RBAC)
• Centralized audit
• End-to-end data protection (at-
rest and in-transit)
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
AWS helps you secure
Compliance
AWS Artifact
Amazon Inspector
AWS CloudHSM
Amazon Cognito
AWS CloudTrail
Security
Amazon GuardDuty
AWS Shield
AWS WAF
Amazon Macie
Amazon Virtual Private Cloud
(Amazon VPC)
Encryption
AWS Certificate Manager
AWS Key Management Service
(AWS KMS)
Encryption at rest
Encryption in transit
Bring your own keys, HSM
support
Identity
AWS Identity and Access
Management (IAM)
AWS Single Sign-On
Amazon Cloud Directory
AWS Directory Service
AWS Organizations
Customers need to have multiple levels of security, identity and access management, encryption,
and compliance to secure their data lake
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Data lake security
• Data storage
• Metadata
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Control access to data
Configure Amazon S3 permissions
• Implement your access control matrix using IAM
policies
• Use S3 bucket policies for easy cross-account data
sharing
• Limit role-based access from an Amazon EMR
cluster’s Amazon Elastic Compute Cloud (Amazon
EC2) instance profile
• Authorize access from other tools such as Amazon
Redshift using IAM roles
IAM Principals Amazon EMR Amazon Redshift
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Block public access to Amazon S3
Amazon S3 provides four settings
• BlockPublicAcls—rejects new public object or bucket ACLs
• IgnorePublicAcls—ignores existing public object or bucket ACLs
• BlockPublicPolicy—rejects new public bucket access policy
• RestrictPublicBuckets—restricts access to only AWS services and authorized users
within the bucket owner’s account
But what is “public”?
• Public object (or bucket) ACL → grants permissions to members of the
predefined AllUsers or AuthenticatedUsers groups (grantees)
• Public bucket policy → doesn’t grant permissions to only fixed values in Principal and
Condition elements
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
S3 Object Lock NEW!
Immutable Amazon S3 Objects
• Write Once Read Many (WORM) Protection for Amazon S3 Objects
• Object or bucket control of WORM & retention attributes
Retention Management Controls
• Define retention periods in your app or with bucket-level defaults
• Objects locked for the duration of the retention period
• Support for legal hold scenarios
Data Protection and Compliance
• Assessed for use in SEC 17a-4, CFTC, and FINRA environments
• Extra protection against accidental or malicious delete
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Metadata security
AWS Glue Data Catalog
• Apache Hive metastore compatible
• Track data evolution using schema versioning
• Integrates with Hive, Spark, Presto, Athena, and
Amazon Redshift spectrum
• Use crawlers to classify your data in one central list
that is searchable
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Metadata security
Key learnings
• Create and maintain centralized data catalog
• Enable cross-account access
• Use IAM policies to control catalog access—similar to S3 bucket
policies
• Encrypt metadata in AWS Glue Data Catalog
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
AWS Glue Data Catalog—resource policies
• Fine-grained access control to Data Catalog using IAM policies
• Restrict what they can view and query
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Typical steps of building a data lake
Set up storage1
Move data2
Cleanse, prep, and
catalog data
3
Configure and enforce
security and compliance
policies
4
Make data available
for analytics
5
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Building data lakes can still take months
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Enforce security policies across
multiple services
Gain and manage new insights
Identify, ingest, clean, and
transform data
Build a secure data lake in days
AWS lake formation
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
How it works
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
3 simple steps to an AWS data lake
Remove Data Silos
Aggregate Data
Better Agility
More Data = > Insights
Know What You Have
Better Data Management
Quicker Time to Results
Higher Quality Data
Extract Value from Data
Analyze & Report on Data
Apply Machine Learning
Visualize & Consume Results
Amazon Ingest & Storage
Amazon S3, Amazon S3 Glacier,
AWS DataSync, AWS Storage
Gateway, AWS Snow family*,
Kinesis
AWS Glue
Crawl, Discover, & Catalog Data
ETL Data
Amazon Analytics & ML
Athena, Amazon EMR, Amazon
Redshift, Amazon SageMaker,
Amazon Rekognition, Amazon
EC2+Amazon FSx Lustre
Collect & Centralize Catalog & Transform Analytics & Insights
Thank you!
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
John Mallory
johmallo@amazon.com

Más contenido relacionado

Más de Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 
Protect your applications from DDoS/BOT & Advanced Attacks
Protect your applications from DDoS/BOT & Advanced AttacksProtect your applications from DDoS/BOT & Advanced Attacks
Protect your applications from DDoS/BOT & Advanced Attacks
Amazon Web Services
 
Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用
Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用
Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用
Amazon Web Services
 

Más de Amazon Web Services (20)

Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 
Come costruire un'architettura Serverless nel Cloud AWS
Come costruire un'architettura Serverless nel Cloud AWSCome costruire un'architettura Serverless nel Cloud AWS
Come costruire un'architettura Serverless nel Cloud AWS
 
AWS Serverless per startup: come innovare senza preoccuparsi dei server
AWS Serverless per startup: come innovare senza preoccuparsi dei serverAWS Serverless per startup: come innovare senza preoccuparsi dei server
AWS Serverless per startup: come innovare senza preoccuparsi dei server
 
Crea dashboard interattive con Amazon QuickSight
Crea dashboard interattive con Amazon QuickSightCrea dashboard interattive con Amazon QuickSight
Crea dashboard interattive con Amazon QuickSight
 
Costruisci modelli di Machine Learning con Amazon SageMaker Autopilot
Costruisci modelli di Machine Learning con Amazon SageMaker AutopilotCostruisci modelli di Machine Learning con Amazon SageMaker Autopilot
Costruisci modelli di Machine Learning con Amazon SageMaker Autopilot
 
Migra le tue file shares in cloud con FSx for Windows
Migra le tue file shares in cloud con FSx for Windows Migra le tue file shares in cloud con FSx for Windows
Migra le tue file shares in cloud con FSx for Windows
 
La tua organizzazione è pronta per adottare una strategia di cloud ibrido?
La tua organizzazione è pronta per adottare una strategia di cloud ibrido?La tua organizzazione è pronta per adottare una strategia di cloud ibrido?
La tua organizzazione è pronta per adottare una strategia di cloud ibrido?
 
Protect your applications from DDoS/BOT & Advanced Attacks
Protect your applications from DDoS/BOT & Advanced AttacksProtect your applications from DDoS/BOT & Advanced Attacks
Protect your applications from DDoS/BOT & Advanced Attacks
 
Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用
Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用
Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用
 

Optimizing data lakes with Amazon S3 - STG303 - Chicago AWS Summit

  • 1. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Optimizing data lakes with Amazon S3 John Mallory Storage business development manager Amazon Web Services S T G 3 0 3
  • 2. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T 125+ million players Data provides a constant feedback loop for game designers Up-to-the-minute analysis of gamer satisfaction to drive gamer engagement Resulting in the most popular game played in the world Fortnite
  • 3. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Epic Games uses data lakes and analytics Entire analytics platform running on AWS Amazon S3 leveraged as a data lake All telemetry data is collected with Amazon Kinesis Real-time analytics done through Spark on Amazon EMR, Amazon DynamoDB to create scoreboards and real-time queries Use Amazon EMR for large-batch data processing Game designers use data to inform their decisions Game clients Game servers Launcher Game services N E A R R E A L T I M E P I P E L I N E N E A R R E A L T I M E P I P E L I N E Grafana Scoreboards API Limited Raw Data (real-time ad hoc SQL)User ETL (metric definition) Spark on Amazon EMR DynamoDB NEAR-REAL-TIME PIPELINES BATCH PIPELINES ETL using Amazon EMR Tableau/BI Ad hoc SQLS3 (Data Lake) Kinesis APIs Databases Amazon S3 Other sources
  • 4. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Finding value in data is a journey Business Monitoring Business Insights New Business Opportunity Business Optimization Business Transformation Evolving Tools and Methods AI/MLSQL Query
  • 5. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Why use AWS for Big Data & Analytics? Agility Scalability Get to insights faster Broadest and deepest capabilities Low cost Data migrations made easy
  • 6. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T More data lakes and analytics than anywhere else Morethan 10,000 data lakes on AWS
  • 7. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Defining the AWS data lake Data lakes provide: Relational and non-relational data Scale-out to EBs Diverse set of analytics and machine learning tools Work on data without any data movement Designed for low-cost storage and analytics OLTP ERP CRM LOB Data Warehouse Business Intelligence Data Lake 100110000100101011100101010 111001010100001011111011010 0011110010110010110 0100011000010 Devices Web Sensors Social Catalog Machine Learning DW Queries Big Data Processing Interactive Real-time
  • 8. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Data lake on AWS Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving Amazon S3 DynamoDB Amazon Elasticsearch Service AWS AppSync Amazon API Gateway Amazon Cognito AWS KMS AWS CloudTrail Manage & Secure AWS IAM Amazon CloudWatch AWS Snowball AWS Storage Gateway Amazon Kinesis Data Firehose AWS Direct Connect AWS Database Migration Service Amazon Athena Amazon EMR AWS Glue Amazon Redshift Amazon DynamoDB Amazon QuickSight Amazon Kinesis Amazon Elasticsearch Service Amazon Neptune Amazon RDS Central Storage Scalable, secure, cost- effective AWS Glue
  • 9. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T User-defined functions • Bring your own functions & code • Execute without provisioning servers Processing and querying in place Fully managed process & query • Catalog, transform & query data in Amazon S3 • No physical instances to manage Lambda function
  • 10. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon S3 is the best place for data lakes Most ways to bring data in Best security, compliance, and audit capabilities Object-level controls Unmatched durability, availability, and scalability Business insights into your data
  • 11. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Optimize costs with data tiering Hot Cold Amazon S3 standard Amazon S3— infrequent access Amazon S3 Glacier HDFS ✓ Use Amazon EMR/Hadoop with local HDFS for hottest datasets ✓ Store cooler data in Amazon S3 and cold in Amazon Simple Storage Service Glacier to reduce costs ✓ Use Amazon S3 analytics to optimize tiering strategy Amazon S3 analytics
  • 12. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Your choice of Amazon S3 storage classes Access FrequencyFrequent Infrequent • Active, frequently accessed data • Milliseconds access • > 3 AZ • From: $0.0210/GB • Data with changing access pattern • Milliseconds access • > 3 AZ • From: $0.0210 to $0.0125/GB • Monitoring fee per obj. • Min storage duration • Infrequently accessed data • Milliseconds access • > 3 AZ • From: $0.0125/GB • Retrieval fee per GB • Min storage duration • Min object size Amazon S3 Standard Amazon S3 Standard-IA Amazon S3 One Zone-IA Amazon S3 Glacier • Re-creatable less- accessed data • Milliseconds access • 1 AZ • From: $0.0100/GB • Retrieval fee per GB • Min storage duration • Min object size • Archive data • Minutes to hours access • > 3 AZ • From: $0.0040/GB • Retrieval fee per GB • Min storage duration • Min object size Amazon S3 Intelligent-Tiering Amazon S3 Glacier Deep Archive • Archive data • Hours access • > 3 AZ • From: $0.00099/GB • Retrieval fee per GB • Min storage duration • Min object size N E W ! N E W !
  • 13. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon S3 Intelligent-Tiering NEW! Automates cost savings Automatically optimizes storage costs for data with changing access patterns Moves objects between two storage tiers: • Frequent Access Tier • Infrequent Access Tier Monitors access patterns and auto-tiers on granular object level Milliseconds access, > 3 AZ, monitoring fee per object, minimum storage duration
  • 14. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon S3 Glacier Deep Archive NEW! No tape to manage $0.00099/GB/month Less than 1/4 the cost of Amazon S3 Glacier Designed for 11 9’s durability Recover data in hours Lowest-cost storage available in the cloud C o m i n g s o o n
  • 15. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T A Data Lake Needs to Accommodate a Wide Variety of Concurrent Data Sources Rapidly ingest all data sources IoT, Sensor Data, Clickstream Data, Social Media Feeds, Streaming Logs Oracle, MySQL, MongoDB, DB2, SQL Server, Amazon RDS On-premises ERP, Mainframes, Lab Equipment, NAS Storage Offline Sensor Data, NAS, On-premises Hadoop On-premises Data Lakes, EDW, Large Scale Data Collection Ingest Methods
  • 16. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T AWS transfer for SFTP NEW! Fully-managed service enabling transfer of data over SFTP, while stored in Amazon S3 Seamless migration of existing workflows Native integration with AWS services Simple to use Cost- effective Secure and compliantFully managed in AWS
  • 17. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T AWS integrated AWS Transfer service that simplifies, automates, and accelerates data movement Transfers up to 10 Gbps per agent Pay as you goSecure and reliable transfers Replicate data to AWS for business continuity Transfer data for timely in- cloud analysis Migrate active application data to AWS Combines the speed and reliability of network acceleration software with the cost-effectiveness of open source tools Simple data movement to Amazon S3 or Amazon EFS AWS DataSync NEW!
  • 18. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Process data in place . . . Amazon S3 Athena Amazon Redshift Spectrum Amazon SageMaker AWS Glue
  • 19. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon S3 Select Select a subset of your object’s data using a SQL expression
  • 20. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Improved performance for data lakes As customers store larger and larger datasets in Amazon S3, Amazon S3 Select offers up to a 400% performance improvement
  • 21. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon S3 Select enhancements NEW! Now Supports: CSV, JSON, JSON arrays, and Parquet formats GZIP, BZIP2, and Snappy compression Integrated with Spark, Hive, and Presto on Amazon EMR
  • 22. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Seamless integration with Amazon S3 Data stored in Amazon S3 is loaded to Amazon FSx for processing Output of processing returned to Amazon S3 for retention When your workload finishes, simply delete your file system. Link your Amazon S3 dataset to your Amazon FSx for Lustre file system, then . . .
  • 23. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon FSx for Lustre For compute-intensive data processing use cases like HPC or machine learning Raw data stored in Amazon S3 is loaded to FSx for Lustre for processing Output of processing returned to Amazon S3 for retention
  • 24. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon FSx for Lustre performance Massively scalable performance 100+ GB/s throughput | Millions of IOPS | Consistent sub-millisecond latencies Parallel file system Supports hundreds of thousands of cores SSD-based
  • 25. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Choosing the right data formats There is no such thing as the “best” data format • All involve tradeoffs, depending on workload & tools • CSV, TSV, JSON are easy, but not efficient • Compress & store/archive as raw input • Columnar compressed are generally preferred • Parquet or ORC • Smaller storage footprint = lower cost • More efficient scan & query • Row oriented (AVRO) good for full data scans • Organize into partitions • Coalescing to larger partitions over time Key considerations are cost, performance, & support
  • 26. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Data prep is ~80% of data lake work Building training sets Cleaning and organizing data Collecting datasets Mining data for patterns Refining algorithms Other
  • 27. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Set up a catalog, ETL, and data prep with AWS Glue Serverless provisioning, configuration, and scaling to run your ETL jobs on Apache Spark Pay only for the resources used for jobs Crawl your data sources, identify data formats, and suggest schemas and transformations Automates the effort in building, maintaining, and running ETL jobs
  • 28. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Event-driven AWS Glue ETL pipeline Let Amazon CloudWatch Events and AWS Lambda drive the pipeline New raw data arrives < 22:00 UTC Start crawler Crawl raw dataset Run ‘optimize’ job Start job or trigger Crawl optimized dataset Start crawler SLA deadline 02:00 UTC Ready for reporting Reporting dataset ready Data arrives in Amazon S3 Crawler succeeds Job succeeds
  • 29. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Security challenges with data lakes Data challenges • Controlling access to data • Data masking, row / column / cell level encryption, key management • Data loss / exfiltration • Loss of data integrity • Data provenance • Compliance requirements (GDPR and others) Management challenges • Central administration • Federated authentication, typically with Active Directory • Role-based access control (RBAC) • Centralized audit • End-to-end data protection (at- rest and in-transit)
  • 30. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T AWS helps you secure Compliance AWS Artifact Amazon Inspector AWS CloudHSM Amazon Cognito AWS CloudTrail Security Amazon GuardDuty AWS Shield AWS WAF Amazon Macie Amazon Virtual Private Cloud (Amazon VPC) Encryption AWS Certificate Manager AWS Key Management Service (AWS KMS) Encryption at rest Encryption in transit Bring your own keys, HSM support Identity AWS Identity and Access Management (IAM) AWS Single Sign-On Amazon Cloud Directory AWS Directory Service AWS Organizations Customers need to have multiple levels of security, identity and access management, encryption, and compliance to secure their data lake
  • 31. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Data lake security • Data storage • Metadata
  • 32. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Control access to data Configure Amazon S3 permissions • Implement your access control matrix using IAM policies • Use S3 bucket policies for easy cross-account data sharing • Limit role-based access from an Amazon EMR cluster’s Amazon Elastic Compute Cloud (Amazon EC2) instance profile • Authorize access from other tools such as Amazon Redshift using IAM roles IAM Principals Amazon EMR Amazon Redshift
  • 33. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Block public access to Amazon S3 Amazon S3 provides four settings • BlockPublicAcls—rejects new public object or bucket ACLs • IgnorePublicAcls—ignores existing public object or bucket ACLs • BlockPublicPolicy—rejects new public bucket access policy • RestrictPublicBuckets—restricts access to only AWS services and authorized users within the bucket owner’s account But what is “public”? • Public object (or bucket) ACL → grants permissions to members of the predefined AllUsers or AuthenticatedUsers groups (grantees) • Public bucket policy → doesn’t grant permissions to only fixed values in Principal and Condition elements
  • 34. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T S3 Object Lock NEW! Immutable Amazon S3 Objects • Write Once Read Many (WORM) Protection for Amazon S3 Objects • Object or bucket control of WORM & retention attributes Retention Management Controls • Define retention periods in your app or with bucket-level defaults • Objects locked for the duration of the retention period • Support for legal hold scenarios Data Protection and Compliance • Assessed for use in SEC 17a-4, CFTC, and FINRA environments • Extra protection against accidental or malicious delete
  • 35. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Metadata security AWS Glue Data Catalog • Apache Hive metastore compatible • Track data evolution using schema versioning • Integrates with Hive, Spark, Presto, Athena, and Amazon Redshift spectrum • Use crawlers to classify your data in one central list that is searchable
  • 36. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Metadata security Key learnings • Create and maintain centralized data catalog • Enable cross-account access • Use IAM policies to control catalog access—similar to S3 bucket policies • Encrypt metadata in AWS Glue Data Catalog
  • 37. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T AWS Glue Data Catalog—resource policies • Fine-grained access control to Data Catalog using IAM policies • Restrict what they can view and query
  • 38. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Typical steps of building a data lake Set up storage1 Move data2 Cleanse, prep, and catalog data 3 Configure and enforce security and compliance policies 4 Make data available for analytics 5
  • 39. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Building data lakes can still take months
  • 40. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Enforce security policies across multiple services Gain and manage new insights Identify, ingest, clean, and transform data Build a secure data lake in days AWS lake formation
  • 41. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T How it works
  • 42. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T 3 simple steps to an AWS data lake Remove Data Silos Aggregate Data Better Agility More Data = > Insights Know What You Have Better Data Management Quicker Time to Results Higher Quality Data Extract Value from Data Analyze & Report on Data Apply Machine Learning Visualize & Consume Results Amazon Ingest & Storage Amazon S3, Amazon S3 Glacier, AWS DataSync, AWS Storage Gateway, AWS Snow family*, Kinesis AWS Glue Crawl, Discover, & Catalog Data ETL Data Amazon Analytics & ML Athena, Amazon EMR, Amazon Redshift, Amazon SageMaker, Amazon Rekognition, Amazon EC2+Amazon FSx Lustre Collect & Centralize Catalog & Transform Analytics & Insights
  • 43. Thank you! S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. John Mallory johmallo@amazon.com