Data Lake allows an organisation to store all of their data, structured and unstructured, in one, centralised repository. Since data can be stored as-is, there is no need to convert it to a predefined schema and you no longer need to know what questions you want to ask of your data beforehand. In this session we will explore the architecture of a Data Lake on AWS and cover topics such as storage, processing and security.
3. Outcome 1 : Modernize and consolidate
• Insights to enhance business applications and create new digital services
Outcome 2 : Innovate for new revenues
• Personalization, demand forecasting, risk analysis
Outcome 3 : Real-time engagement
• Interactive customer experience, event-driven automation, fraud detection
Outcome 4 : Automate for expansive reach
• Automation of business processes and physical infrastructure
Business Outcomes on a Modern Data Architecture
4. Expanding access requirements
Data
scientists
Automation /
events
Business
users
Data
analysts
Engagement
platforms
1. More personas need access to data, through appropriate tools
2. More systems need to link to data for decision and process automation
3. Users need to be able to find information, and access it securely
5. Exponential growth of business data
1. Data must be captured from diverse sources at speed and scale
2. Data needs to be pulled together, breaking down traditional silos
3. Benefits need to far outweigh the costs of collection and analysis
Transactions ERP Connected
devices
Social mediaWeb logs /
cookies
6. Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architecture
Insights to enhance business applications, new digital services
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
7. Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architecture
Insights to enhance business applications, new digital services
Transactions
Web logs /
cookies
ERP
Data analysts
Data scientists
Business users
Engagement platformsConnected
devices
Social media Automation / events
8. Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architecture
Insights to enhance business applications, new digital services
Data Warehouse
Amazon Redshift
Legacy Apps
Amazon RDS
Data analysts
Data scientists
Business users
Engagement platforms
Schemaless
Amazon ElasticSearch
Direct Query
Amazon Athena
Near-Zero Latency
Amazon DynamoDB
Automation / events
Semi/Unstructured
Amazon EMR
Transactions
Web logs /
cookies
ERP
Connected
devices
Social media
9. Characteristics of a Data Lake
Future
Proof
Flexible
Access
Dive in
Anywhere
Collect
Anything
10. Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architecture
Insights to enhance business applications, new digital services
Data Warehouse
Amazon Redshift
Legacy Apps
Amazon RDS
Data analysts
Data scientists
Business users
Engagement platforms
Schemaless
Amazon ElasticSearch
Direct Query
Amazon Athena
Near-Zero Latency
Amazon DynamoDB
Automation / events
Semi/Unstructured
Amazon EMR
Transactions
Web logs /
cookies
ERP
Connected
devices
Social media
Designed for 11 9s
of durability
Designed for
99.99% availability
Durable
Available
Store as much as you need
Scale storage and compute
independently
Scalable
Amazon S3
Amazon Redshift / Spectrum
Amazon EMR
Amazon Athena
Amazon DynamoDB
Integrated
11. Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architecture
Insights to enhance business applications, new digital services
Data Warehouse
Amazon Redshift
Legacy Apps
Amazon RDS
Data analysts
Data scientists
Business users
Engagement platforms
Schemaless
Amazon ElasticSearch
Direct Query
Amazon Athena
Near-Zero Latency
Amazon DynamoDB
Automation / events
Amazon S3
Staged Data
(Data Lake)
Semi/Unstructured
Amazon EMR
Transactions
Web logs /
cookies
ERP
Connected
devices
Social media
13. Important Components of a Data Lake
Catalogue
& Search
Protect
& Secure
Access &
User Interface Ingest & Store
14. Data Ingestion into S3
AWS Direct Connect
AWS SnowballISV Connectors
Amazon Kinesis
Firehose
AWS Storage
Gateway
S3 Transfer
Acceleration
15. Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architecture
Insights to enhance business applications, new digital services
Data Warehouse
Amazon Redshift
Legacy Apps
Amazon RDS
Data analysts
Data scientists
Business users
Engagement platforms
Schemaless
Amazon ElasticSearch
Direct Query
Amazon Athena
Near-Zero Latency
Amazon DynamoDB
Automation / events
Amazon S3
Staged Data
(Data Lake)
Semi/Unstructured
Amazon EMR
Transactions
Web logs /
cookies
ERP
AWS Database
Migration
AWS Direct
Connect
Internet
Interfaces
Amazon
Kinesis
Connected
devices
Social media
16. Building a Data Catalogue
• Aggregated information about your storage & streaming
layer
• Storage service for metadata
Ownership, data lineage
• Data abstraction layer
Customer data = collection of prefixes
• Enabling data discovery
• API for use by entitlements service
18. AWS
Glue
Managed Transform Engine
Job Scheduler
Data Catalog
Built on Apache Spark
Integrated with S3, RDS, Redshift & any
JDBC-compliant data store
19. Security
Identity and Access
Management (IAM) policies
Bucket policies
Access Control Lists (ACLs)
Private VPC endpoints to
Amazon S3
Pre-signed S3 URLs
Encryption
SSL endpoints
Server Side Encryption
(SSE-S3)
S3 Server Side
Encryption with
provided keys (SSE-C,
SSE-KMS)
Client-side Encryption
Audit & Compliance
Buckets access logs
Lifecycle Management
Policies
Versioning & MFA
deletes
Certifications – HIPAA,
PCI, SOC 1/2/3 etc.
Implement the right cloud security controls
20. Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architecture
Insights to enhance business applications, new digital services
Data Warehouse
Amazon Redshift
Legacy Apps
Amazon RDS
Data analysts
Data scientists
Business users
Engagement platforms
Schemaless
Amazon ElasticSearch
Direct Query
Amazon Athena
Near-Zero Latency
Amazon DynamoDB
Automation / events
Amazon S3
Staged Data
(Data Lake)
Semi/Unstructured
Amazon EMR
Transactions
Web logs /
cookies
ERP
AWS Database
Migration
AWS Direct
Connect
Internet
Interfaces
Amazon
Kinesis
Connected
devices
Social media
AWS
Cloud Trail
AWS
IAM
Amazon
CloudWatch
AWS
KMS
21. Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architecture
Insights to enhance business applications, new digital services
Data Warehouse
Amazon Redshift
Legacy Apps
Amazon RDS
Data analysts
Data scientists
Business users
Engagement platforms
Schemaless
Amazon ElasticSearch
Direct Query
Amazon Athena
Near-Zero Latency
Amazon DynamoDB
Automation / events
Amazon S3
Staged Data
(Data Lake)
Semi/Unstructured
Amazon EMR
Transactions
Web logs /
cookies
ERP
AWS Database
Migration
AWS Direct
Connect
Internet
Interfaces
Amazon
Kinesis
Connected
devices
Social media
Amazon S3
Raw Data
Amazon EMR
ETL
Advanced
Analytics
MLlib
Event Capture
Amazon Kinesis
Stream Analysis
Amazon EMR
AWS
Cloud Trail
AWS
IAM
Amazon
CloudWatch
AWS
KMS
28. • Free and easy
• Excellent for initial
• Good learning materials
Google Analytics (GA)
29. • Free and easy
• Excellent for initial
• Good learning materials
Google Analytics (GA)
• free version latency &
accuracy issue
• GA 360 (Premium) +
BigQuery are expensive
• Not flexible enough
30. Our needs
• Large data volume
• Raw data for Machine Learning
• Flexible for further processing
• Low latency
35. Experience on AWS
• Complete and Integrated
• Quick. 2 man weeks for first version
• Easy to scale
• Minimal maintenance cost
36. Future
• More server-less in future
• S3 as datalake
• click event, system log, etc
• raw, processed data (like ML result)
• Hot on disk, cold on s3
• Explore AWS Machine Learning
AWS
Lambda
API
Gateway
Kinesis
Firehose
Redshift
Quicksight
S3
Machine
learning
EMR
SparkML
ML
Hot data
Raw data
Processed data
Cold data
Direct Query
Athena
Redshift
Spectrum
P2
Deep
learning AMI
Visualization
Serverless