We will introduce key concepts for a data lake and present aspects related to its implementation. Also discussing critical success factors, pitfalls to avoid operational aspects, and insights on how AWS enables a server-less data lake architecture.
Speaker: Sebastien Menant, Solutions Architect, Amazon Web Services
4. Definition
“A data lake provides massive storage for
any kind of data, enormous processing
power and the ability to handle virtually
limitless concurrent tasks or jobs”
- Wikipedia
5. Characteristics of a Data Lake
Collect
Everything
Dive in
Anywhere
Flexible
Access
13. New Business Outcomes and Capabilities
• Enable New Insights in Your Data
• Cost Savings of Compute and Storage
• Use the Right Tool for the Job
• Increase Durability of Data
• Charge Storage Costs to Owner
• Streaming and Real-time Analysis
Retain all your data, for years!
18. Requirements for Storage
• Multi-year Scalable Storage Capability
• High Durability
• Store Raw Data from Any Input Sources
• Support for Any Data Type
• Low Cost
21. Recommendations #1
• S3 Buckets
• Close to Users and Compute
• Select Region for Regulatory Compliance
• Naming
• Human-readable Path
• Random Hash Prefix for Optimal Partitioning
• Format
• Structured vs Unstructured + Compression
• CSV, Parquet, ORC, JSON, XML, logs, etc
• GZIP for small files, Avro, LZO, Snappy
22. Recommendations #2
• Optimise
• Store Everything
• Use Large Files with Split-able Format
• Lifecycle Policies for Cost-savings
• Tagging for Cost Allocation
• Security
• Encryption
• Bucket Policies, ACL, Tagging, CloudTrail
23. Requirements for Ingestion
• Batch File Support
• Traditional ETL
• Streaming Data
• Consumption of any Dataset as a Stream
• Low Latency Analytics
• Replay-ability from the Data Lake
• Server-less ETL Capabilities
24. Amazon Kinesis Firehose
1. Easy to use with Agent
2. Automatic Elasticity
3. Near Real-time
4. Simultaneous Destinations
Key Services for Ingestion
Amazon Kinesis Streams
1. Enables Custom Processing
2. Continuous Data Collection
3. Real-time
4. API Driven for Custom Apps
Amazon
Kinesis
Streams
Amazon
Kinesis
Firehose
25. Data
Sources
Data
Sources
Data
Sources
Data
Sources
Data
Sources
S3
DynamoDB
Redshift
Amazon Kinesis
Availability
Zone
Availability
Zone
Availability
Zone
Stream
AWS Lambda
KCL App
EMR
Elasticsearch
27. Recommendations
• Reminder
• Added Complexity needs Business Justification
• Select the Right Tools
• Real-time Analysis: Apache Spark Streaming, Storm, Flink
• Firehose to Redshift for BI and Dashboards
• Tips
• AWS Lambda for ETL Transformation
• Persist Streams into S3
31. Requirements for Catalogue and Search
• Metadata Index
• Automated Metadata Processing
• Discovery and Search
• Data Classification
• Server-less and Event-driven
41. Recommendations
• Start Early
• Security Needs Practice!
• Federate with your Corporate Directory
• Best Practice
• Use CloudTrail and CloudWatch
• Encrypt Where Possible
• Select Bucket Region for Regulatory Compliance
• Tips
• IAM Policies, S3 Versioning and MFA Delete
• Lambda for Data Masking
42. API and UI
Storage and
Ingestion
Catalogue and
Search
Security
API and UI
43. Requirements for API and UI
• Serve Data and Capabilities to Customers
• Programmatically
• Search Catalogue
• Run Compute
• Extend Access Control Management
• And… Use of Familiar Visualisation Tools
44. Amazon API Gateway
1. Performance at Any Scale
2. Create RESTful Frontend
3. Managed API Lifecycle
Key Services for API and UI
AWS Lambda
1. Enables Server-less API
2. Custom Logic for Services
3. Automatic Scaling
AWS
Lambda
Amazon API
Gateway
46. Recommendations
• Tips
• Go Server-less!
• Extend Existing AWS Services and Build Custom Logic
• Data Management, Processing and Transformations
• API Gateway for Data Access
• Serve the Data, Search and Compute via RESTful APIs
• Distribute a Custom SDK
• Extend the Solution
• Build Advanced Security Controls using Metadata Index
47. The Whole Picture…
Storage and
Ingestion
Catalogue and
Search
Security
API and UI
Storage and
Ingestion
Catalogue and
Search
Security
API and UI
48. Amazon
EMR
Amazon
RDS
Amazon
S3
Amazon
Glacier
Amazon
Kinesis
Storage
and
Ingestion
Security
AWS
KMS
AWS
IAM
API
And
UI Amazon
API Gateway
AWS
Lambda USERS
Amazon
Redshift
Catalogue and Search
AWS
Lambda
Amazon
DynamoDB
Amazon
Elasticsearch
49. A Data Lake is…
• Foundation of Data Storage and Streaming Data
• Metadata index to help Categorise and Govern
• Search Index to Enable Data Discovery
• Robust Set of Security Controls
• Governance Through Technology Not Policy
• Interface to Expose Data and Capabilities to Users
55. Next Steps
• How to Get Started
• AWS Documentation
• Getting Started Guide
• AWS Training & Certification
• Big Data on AWS
• AWS Partner Network
• AWS Professional Services
• Big Data Specialists
56. AWS Training & Certification
Intro Videos & Labs
Free videos and labs to
help you learn to work
with 30+ AWS services
– in minutes!
Training Classes
In-person and online
courses to build
technical skills –
taught by accredited
AWS instructors
Online Labs
Practice working with
AWS services in live
environment –
Learn how related
services work
together
AWS Certification
Validate technical
skills and expertise –
identify qualified IT
talent or show you
are AWS cloud ready
Learn more: aws.amazon.com/training
57. Your Training Next Steps:
ü Visit the AWS Training & Certification pod to discuss your
training plan & AWS Summit training offer
ü Register & attend AWS instructor led training
ü Get Certified
AWS Certified? Visit the AWS Summit Certification Lounge to pick up your swag
Learn more: aws.amazon.com/training