Más contenido relacionado La actualidad más candente (20) Similar a Accelerated Data Lakes Deep Dive Webinar - Paul Macey (20) Más de Amazon Web Services (20) Accelerated Data Lakes Deep Dive Webinar - Paul Macey1. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Paul Macey
Specialist Solution Architect, Big Data & Analytics
AWS Public Sector
Accelerated Data Lakes
Deep Dive Webinar
2. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
Organisational data challenges
Accelerated data lake
Architecture
Onboarding
Demonstration
Wrap up
Questions
3. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Organisational data challenges
Silos Governance
?
ScalabilitySecurity
4. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Accelerated Data Lake
Security
Day 0
Data governance
& metadata
Data centralised
& scalable
SQL & BI
ready
Analytical &
Data Science
foundation
Repeatable &
extensible
5. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Available today @ GitHub
https://github.com/aws-samples/accelerated-data-lake
Includes
Data lake pipeline (CloudFormation)
Instructions
Data configuration, security and metadata templates
Delivery
Professional services
AWS partners
Accelerated Data Lake
6. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Accelerated Data Lake
Flywheel of success
Start Small
Establish a
Repeatable
Workflow
Deliver
benefits
Improve
and Iterate
Repeat
7. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Architecture
8. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Accelerated Data Lake
High Level Data Flow
Lambda Functions
• Validation
• Apply Security
• Attach Metadata
• Catalog object
• File movement
• Alerts
Time based or
Event Driven
ProcessInitiation
S3 buckets
• Staging
• Raw
• Curated
• Gold
• Data discovery
• Logs
Data Lake
Storage
Metadata
Data Catalog
Data Lake
Enabling Analytics and Insights
Big Data, Querying
ETL & ML
Database /
BI
Analytics & Insights
9. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Accelerated Data Lake
File Processing Pipeline - S3 lambda event example
File arrives in the S3
Staging bucket
A lambda function is
triggered when the
object is created
The lambda passes the
S3 event data payload to
an AWS Step Function
The step function moves
through a repeatable data
file onboarding process
Validate Data Add security
tags to S3 object
Add metadata
to S3 object
Add object metadata
to DynamoDB
Index metadata
into ElasticSearch
Move file from the
Staging bucket to
Raw bucket
Get the
file specification
10. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Onboarding
11. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Accelerated Data Lake
Data onboarding process
1) Create an new file specification entry in a DynamoDb table
The table in the data lake solution is called “Data Sources”
1) Create a folder structure within the S3 staging bucket for the new data type
This is just to keep everything in S3 organised but also optimised for later use
12. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
File Specification Settings
File Settings S3 Object Tags
Simple Metadata
Extended Metadata
13. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Accelerated Data Lake
Example S3 storage structure
Source Env /
Schema
Description /
Table / View
CMX
API-Raw
DB_Prod_HR
Staff
Data Flow Data FlowValidated & Approved
dev-rawdev-staging dev-curated dev-gold dev-validation
dev-data
discovery
sandpit
dev-logging
Wireless
CMX
2019
01
API-Raw
2019
01
select count(userid) from CMX where year=2019
select userid
from CMX
inner join API-RAW
on CMX.userid=API-RAW.userid
where year=2019
14. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
S3 Object Tags and Metadata
Image1001.jpg
jpeg image data
S3 object tags
S3 object
data.csv
S3 metadata
Key Value
Classification Internal
PII False
Use Case Interaction Extracts
Team Analytics
Key Value
Policy facility_iinternal
MD5 ab3116cded134
Data Owner User Interaction Team
Data Source prod_int_extraction_dw
15. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Demonstration
16. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
IoT Simulator
https://aws.amazon.com/solutions/iot-device-simulator/
17. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Demonstration
18. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Wrap up
Security
Day 0
Data governance
& metadata
Data centralised
& scalable
SQL & BI
ready
Analytical &
Data Science
foundation
Repeatable &
Extensible
The accelerated data lake solution
Can enable your data
Support data security and data governance
Can grow and scale in harmony with your organisation
Can be granted access to AWS’s analytics, ML and AI ecosystem
19. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
References
Amazon S3 security
https://aws.amazon.com/s3/faqs/#Security
https://docs.aws.amazon.com/AmazonS3/latest/dev/DataDurability.html
AWS Accelerated Data Lake (Git)
https://github.com/aws-samples/accelerated-data-lake
AWS Accelerated Data Lake Blog (part 1 & 2)
https://aws.amazon.com/blogs/publicsector/from-data-silos-to-data-domains-bringing-common-data-together
https://aws.amazon.com/blogs/publicsector/securing-your-data-by-knowing-your-data
Our data lake story: How Woot.com built a serverless data lake on AWS
https://aws.amazon.com/blogs/big-data/our-data-lake-story-how-woot-com-built-a-serverless-data-lake-on-aws
20. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Additional slides
21. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Classifications
Data Classification High Level Control Recommendations
1 - Sensitive Must obtain data manager approval to access the data.
Data encryption is mandatory for transmission and storage
Access log reporting is required, which is sent to the data manager and data sponsor for review on regular basis
Failed access events trigger alerts for follow-up
Must obtain data sponsor, Legal, & Security approval before sharing data to external parties or hosting in cloud services
Review and update the data classifications at least every 6 months.
2 - Restricted Must obtain data manager approval to access the data.
Data encryption is mandatory for transmission, mandatory for storage in cloud or shared environments and preferred for storage in
dedicated environments
Access log is required, which is sent to the data manager and data sponsor for review on less regular basis
Failed access events trigger alerts for follow-up
Need to obtain data manager, Legal, & Security approval before sharing data to external parties or hosting in cloud services
Review and update the data classifications at least every 6 months.
3 - Internal Data SME may approve group/role access (member change without data manager approval)
Data encryption is mandatory for transmission, mandatory for storage in cloud or shared environments and optional for storage in
dedicated environments
Access log is required
Review and update the data classifications at least annually.
4 - Public Apply baseline security controls
Access granted to anyone who asks
Data encryption is not required for storage or transmission
Review and update the data Classifications at least annually.
22. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3 security and data governance
Bucket level security
Object level security
23. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Object Tagging
• Access to Amazon S3 buckets, S3 objects and IAM policies are intrinsically connected.
• Object tagging enables granular control over access to objects.
• The object tagging key/values are stored in DynamoDB.
Tag values in Amazon DynamoDB IAM PolicyTag entries on an S3 object
24. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Handling IT security risk in an AWS data lake
IAM Policies and S3 object tagging
Data
Domain
Data
Classification
Policy Example Tags Restricte
d to
Wireless Public Wireless_Public
Classification=Public
Domain=Wireless
Wireless Internal Wireless_Internal
Classification=Internal
Domain=Wireless
Wireless Restricted Wireless_Restricted
Classification=Restricted
Domain=Wireless
Users,
Groups or
Services
with data
owner
approval
Wireless Sensitive Wireless_Sensitive
Classification=Sensitive
Domain=Wireless
Users,
Groups or
Services
with data
owner
approval
Within IAM, we can create a policy where a user can access
an S3 object provided the object has the object tags
(key value pair)
Classification : Restricted
Domain : Wireless
Example IAM Polices Example S3 tags
Wireless
CMX
Cisco
Netgear
Other
Domain Public Internal Restricted Sensitive
25. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Policy examples
Example of policy types that define how objects can be accessed.
Policy Type Description
Content Policies What the data contains and its classification
Service Policies What state the data is in
Development and System Policies
What environment the data is located in and what actions can be
performed on the data
Metadata Policies Who has what access to object metadata
Tagging Policies Who has what access to object tagging
26. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Content Policy examples
Policy Data Domain Example Tags Data Classification Restricted to
Wireless_Public Wireless
Classification=Public
Domain=Wireless
Public
Wireless_Internal Wireless
Classification=Internal
Domain=Wireless
Internal
Wireless_Restricted Wireless
Classification=Restricted
Domain=Wireless
Restricted
Users, Groups or Services
with data owner approval
Wireless_Highly_Restricted Wireless
Classification=Highly_Restricted
Domain=Wireless
Highly_Restricted
Users, Groups or Services
with data owner approval
Facility_Public Facility
Classification=Public
Domain=Facility
Public
Facility_Internal Facility
Classification=Internal
Domain=Facility
Internal
Facility_Restricted Facility
Classification=Restricted
Domain=Facility
Restricted
Users, Groups or Services
with data owner approval
Facility_Highly_Restricted Facility
Classification=Highly_Restricted
Domain=Facility
Highly_Restricted
Users, Groups or Services
with data owner approval
27. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Service Policy examples
Policy Description Tagging Rules Restricted to
S3_Raw
Landing location of data files from
source systems
Bucket=Raw
System, service accounts and platform
engineers
S3_Staging
Location of data files after object tagging
and metadata applied
Bucket=Staging
Chief data office and data science
teams
S3_Curated Optimised versions of data files Bucket=Curated
Chief data office, data science and
analytics teams
S3_Gold
Business consumable versions of data
and one off data sets
Bucket=Gold
Chief data office, data science, analytics
teams and business users as required
S3_Validation
Data files generated from insights or
aggregations that need to be validated
and have security and metadata applied.
Bucket=Validation
Chief data office / Information system
admins
S3_Data_Discovery
Data files that have been taken from
other S3 locations or external sources
that are blended with other datasets or
used for data science and analytics
Bucket=Data_Discovery Data Science and analytic teams
S3_Logs Cloud trail and cloud watch logs Bucket=Logs
ISO, system, service accounts and
platform engineers. Other teams on
request
28. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Development and System Policy examples
Role Description Tagging Rules Restricted to
Dev_Read
Read only access to
Development
Environment
Environment=Dev
Access=Read
System, service accounts and platform
engineers
Test_Read
Read only access to
Test Environment
Environment=Test
Access=Read
System, service accounts and platform
engineers
Prod_Read
Read only access to
Production
Environment
Environment=Prod
Access=Read
Dev_Write
Write access to
Development
Environment
Environment=Dev
Access=Write
System, service accounts and platform
engineers
Test_Write
Write access to Test
Environment
Environment=Test
Access=Write
System, service accounts and platform
engineers
Prod_Write
Write access to
Production
Environment
Environment=Prod
Access=Write
System accounts
29. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Metadata Policy examples
Policy Description Tagging Rules Restricted to
Metadata_Read
Read only access to
metadata
Domain=Metadata
Access=Read
Metadata_Write
Write access to
metadata
Domain=Metadata
Access=Write
Chief data office / Information System
Administrators, Data Owners, Data
Stewards
30. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Tagging Policy examples
Policy Description Tagging Rules Restricted to
Tagging_Read Read only access
Domain=Tags
Access=Read
Tagging_Write Write access
Domain=Tags
Access=Write
ISO, Chief data office / Information
System Administrators, Data Owners,
Data Stewards
31. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
IAM policy examples
Example Required Policies
An analytics user requests read only access to Public Facility data in
Raw, Curated and Gold buckets
Prod_Read
S3_Raw, S3_Curated or S3_Gold
Facility_Public
A data science team member requests write access to the data discovery
environment and read access to the curated bucket
Prod_Write
S3_Data_Discovery
or
Prod_Read
S3_Curated