SlideShare una empresa de Scribd logo
1 de 31
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Paul Macey
Specialist Solution Architect, Big Data & Analytics
AWS Public Sector
Accelerated Data Lakes
Deep Dive Webinar
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
Organisational data challenges
Accelerated data lake
Architecture
Onboarding
Demonstration
Wrap up
Questions
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Organisational data challenges
Silos Governance
?
ScalabilitySecurity
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Accelerated Data Lake
Security
Day 0
Data governance
& metadata
Data centralised
& scalable
SQL & BI
ready
Analytical &
Data Science
foundation
Repeatable &
extensible
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Available today @ GitHub
https://github.com/aws-samples/accelerated-data-lake
Includes
Data lake pipeline (CloudFormation)
Instructions
Data configuration, security and metadata templates
Delivery
Professional services
AWS partners
Accelerated Data Lake
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Accelerated Data Lake
Flywheel of success
Start Small
Establish a
Repeatable
Workflow
Deliver
benefits
Improve
and Iterate
Repeat
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Architecture
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Accelerated Data Lake
High Level Data Flow
Lambda Functions
• Validation
• Apply Security
• Attach Metadata
• Catalog object
• File movement
• Alerts
Time based or
Event Driven
ProcessInitiation
S3 buckets
• Staging
• Raw
• Curated
• Gold
• Data discovery
• Logs
Data Lake
Storage
Metadata
Data Catalog
Data Lake
Enabling Analytics and Insights
Big Data, Querying
ETL & ML
Database /
BI
Analytics & Insights
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Accelerated Data Lake
File Processing Pipeline - S3 lambda event example
File arrives in the S3
Staging bucket
A lambda function is
triggered when the
object is created
The lambda passes the
S3 event data payload to
an AWS Step Function
The step function moves
through a repeatable data
file onboarding process
Validate Data Add security
tags to S3 object
Add metadata
to S3 object
Add object metadata
to DynamoDB
Index metadata
into ElasticSearch
Move file from the
Staging bucket to
Raw bucket
Get the
file specification
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Onboarding
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Accelerated Data Lake
Data onboarding process
1) Create an new file specification entry in a DynamoDb table
The table in the data lake solution is called “Data Sources”
1) Create a folder structure within the S3 staging bucket for the new data type
This is just to keep everything in S3 organised but also optimised for later use 
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
File Specification Settings
File Settings S3 Object Tags
Simple Metadata
Extended Metadata
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Accelerated Data Lake
Example S3 storage structure
Source Env /
Schema
Description /
Table / View
CMX
API-Raw
DB_Prod_HR
Staff
Data Flow Data FlowValidated & Approved
dev-rawdev-staging dev-curated dev-gold dev-validation
dev-data
discovery
sandpit
dev-logging
Wireless
CMX
2019
01
API-Raw
2019
01
select count(userid) from CMX where year=2019
select userid
from CMX
inner join API-RAW
on CMX.userid=API-RAW.userid
where year=2019
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
S3 Object Tags and Metadata
Image1001.jpg
jpeg image data
S3 object tags
S3 object
data.csv
S3 metadata
Key Value
Classification Internal
PII False
Use Case Interaction Extracts
Team Analytics
Key Value
Policy facility_iinternal
MD5 ab3116cded134
Data Owner User Interaction Team
Data Source prod_int_extraction_dw
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Demonstration
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
IoT Simulator
https://aws.amazon.com/solutions/iot-device-simulator/
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Demonstration
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Wrap up
Security
Day 0
Data governance
& metadata
Data centralised
& scalable
SQL & BI
ready
Analytical &
Data Science
foundation
Repeatable &
Extensible
The accelerated data lake solution
Can enable your data
Support data security and data governance
Can grow and scale in harmony with your organisation
Can be granted access to AWS’s analytics, ML and AI ecosystem
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
References
Amazon S3 security
https://aws.amazon.com/s3/faqs/#Security
https://docs.aws.amazon.com/AmazonS3/latest/dev/DataDurability.html
AWS Accelerated Data Lake (Git)
https://github.com/aws-samples/accelerated-data-lake
AWS Accelerated Data Lake Blog (part 1 & 2)
https://aws.amazon.com/blogs/publicsector/from-data-silos-to-data-domains-bringing-common-data-together
https://aws.amazon.com/blogs/publicsector/securing-your-data-by-knowing-your-data
Our data lake story: How Woot.com built a serverless data lake on AWS
https://aws.amazon.com/blogs/big-data/our-data-lake-story-how-woot-com-built-a-serverless-data-lake-on-aws
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Additional slides
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Classifications
Data Classification High Level Control Recommendations
1 - Sensitive  Must obtain data manager approval to access the data.
 Data encryption is mandatory for transmission and storage
 Access log reporting is required, which is sent to the data manager and data sponsor for review on regular basis
 Failed access events trigger alerts for follow-up
 Must obtain data sponsor, Legal, & Security approval before sharing data to external parties or hosting in cloud services
 Review and update the data classifications at least every 6 months.
2 - Restricted  Must obtain data manager approval to access the data.
 Data encryption is mandatory for transmission, mandatory for storage in cloud or shared environments and preferred for storage in
dedicated environments
 Access log is required, which is sent to the data manager and data sponsor for review on less regular basis
 Failed access events trigger alerts for follow-up
 Need to obtain data manager, Legal, & Security approval before sharing data to external parties or hosting in cloud services
 Review and update the data classifications at least every 6 months.
3 - Internal  Data SME may approve group/role access (member change without data manager approval)
 Data encryption is mandatory for transmission, mandatory for storage in cloud or shared environments and optional for storage in
dedicated environments
 Access log is required
 Review and update the data classifications at least annually.
4 - Public  Apply baseline security controls
 Access granted to anyone who asks
 Data encryption is not required for storage or transmission
 Review and update the data Classifications at least annually.
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3 security and data governance
Bucket level security
Object level security
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Object Tagging
• Access to Amazon S3 buckets, S3 objects and IAM policies are intrinsically connected.
• Object tagging enables granular control over access to objects.
• The object tagging key/values are stored in DynamoDB.
Tag values in Amazon DynamoDB IAM PolicyTag entries on an S3 object
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Handling IT security risk in an AWS data lake
IAM Policies and S3 object tagging
Data
Domain
Data
Classification
Policy Example Tags Restricte
d to
Wireless Public Wireless_Public
Classification=Public
Domain=Wireless
Wireless Internal Wireless_Internal
Classification=Internal
Domain=Wireless
Wireless Restricted Wireless_Restricted
Classification=Restricted
Domain=Wireless
Users,
Groups or
Services
with data
owner
approval
Wireless Sensitive Wireless_Sensitive
Classification=Sensitive
Domain=Wireless
Users,
Groups or
Services
with data
owner
approval
Within IAM, we can create a policy where a user can access
an S3 object provided the object has the object tags
(key value pair)
Classification : Restricted
Domain : Wireless
Example IAM Polices Example S3 tags
Wireless
CMX
Cisco
Netgear
Other
Domain Public Internal Restricted Sensitive
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Policy examples
Example of policy types that define how objects can be accessed.
Policy Type Description
Content Policies What the data contains and its classification
Service Policies What state the data is in
Development and System Policies
What environment the data is located in and what actions can be
performed on the data
Metadata Policies Who has what access to object metadata
Tagging Policies Who has what access to object tagging
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Content Policy examples
Policy Data Domain Example Tags Data Classification Restricted to
Wireless_Public Wireless
Classification=Public
Domain=Wireless
Public
Wireless_Internal Wireless
Classification=Internal
Domain=Wireless
Internal
Wireless_Restricted Wireless
Classification=Restricted
Domain=Wireless
Restricted
Users, Groups or Services
with data owner approval
Wireless_Highly_Restricted Wireless
Classification=Highly_Restricted
Domain=Wireless
Highly_Restricted
Users, Groups or Services
with data owner approval
Facility_Public Facility
Classification=Public
Domain=Facility
Public
Facility_Internal Facility
Classification=Internal
Domain=Facility
Internal
Facility_Restricted Facility
Classification=Restricted
Domain=Facility
Restricted
Users, Groups or Services
with data owner approval
Facility_Highly_Restricted Facility
Classification=Highly_Restricted
Domain=Facility
Highly_Restricted
Users, Groups or Services
with data owner approval
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Service Policy examples
Policy Description Tagging Rules Restricted to
S3_Raw
Landing location of data files from
source systems
Bucket=Raw
System, service accounts and platform
engineers
S3_Staging
Location of data files after object tagging
and metadata applied
Bucket=Staging
Chief data office and data science
teams
S3_Curated Optimised versions of data files Bucket=Curated
Chief data office, data science and
analytics teams
S3_Gold
Business consumable versions of data
and one off data sets
Bucket=Gold
Chief data office, data science, analytics
teams and business users as required
S3_Validation
Data files generated from insights or
aggregations that need to be validated
and have security and metadata applied.
Bucket=Validation
Chief data office / Information system
admins
S3_Data_Discovery
Data files that have been taken from
other S3 locations or external sources
that are blended with other datasets or
used for data science and analytics
Bucket=Data_Discovery Data Science and analytic teams
S3_Logs Cloud trail and cloud watch logs Bucket=Logs
ISO, system, service accounts and
platform engineers. Other teams on
request
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Development and System Policy examples
Role Description Tagging Rules Restricted to
Dev_Read
Read only access to
Development
Environment
Environment=Dev
Access=Read
System, service accounts and platform
engineers
Test_Read
Read only access to
Test Environment
Environment=Test
Access=Read
System, service accounts and platform
engineers
Prod_Read
Read only access to
Production
Environment
Environment=Prod
Access=Read
Dev_Write
Write access to
Development
Environment
Environment=Dev
Access=Write
System, service accounts and platform
engineers
Test_Write
Write access to Test
Environment
Environment=Test
Access=Write
System, service accounts and platform
engineers
Prod_Write
Write access to
Production
Environment
Environment=Prod
Access=Write
System accounts
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Metadata Policy examples
Policy Description Tagging Rules Restricted to
Metadata_Read
Read only access to
metadata
Domain=Metadata
Access=Read
Metadata_Write
Write access to
metadata
Domain=Metadata
Access=Write
Chief data office / Information System
Administrators, Data Owners, Data
Stewards
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Tagging Policy examples
Policy Description Tagging Rules Restricted to
Tagging_Read Read only access
Domain=Tags
Access=Read
Tagging_Write Write access
Domain=Tags
Access=Write
ISO, Chief data office / Information
System Administrators, Data Owners,
Data Stewards
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
IAM policy examples
Example Required Policies
An analytics user requests read only access to Public Facility data in
Raw, Curated and Gold buckets
Prod_Read
S3_Raw, S3_Curated or S3_Gold
Facility_Public
A data science team member requests write access to the data discovery
environment and read access to the curated bucket
Prod_Write
S3_Data_Discovery
or
Prod_Read
S3_Curated

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Migliora la disponibilità e le prestazioni delle tue applicazioni con Amazon ...
Migliora la disponibilità e le prestazioni delle tue applicazioni con Amazon ...Migliora la disponibilità e le prestazioni delle tue applicazioni con Amazon ...
Migliora la disponibilità e le prestazioni delle tue applicazioni con Amazon ...
 
Building intelligent applications using AI services
Building intelligent applications using AI servicesBuilding intelligent applications using AI services
Building intelligent applications using AI services
 
Driving performance & security across your industrial facility with AWS - SVC...
Driving performance & security across your industrial facility with AWS - SVC...Driving performance & security across your industrial facility with AWS - SVC...
Driving performance & security across your industrial facility with AWS - SVC...
 
Grid computing in the cloud for Financial Services industry - CMP205-I - New ...
Grid computing in the cloud for Financial Services industry - CMP205-I - New ...Grid computing in the cloud for Financial Services industry - CMP205-I - New ...
Grid computing in the cloud for Financial Services industry - CMP205-I - New ...
 
Next generation intelligent data lakes, powered by GraphQL & AWS AppSync - MA...
Next generation intelligent data lakes, powered by GraphQL & AWS AppSync - MA...Next generation intelligent data lakes, powered by GraphQL & AWS AppSync - MA...
Next generation intelligent data lakes, powered by GraphQL & AWS AppSync - MA...
 
Managing Enterprise security in the Cloud
Managing Enterprise security in the CloudManaging Enterprise security in the Cloud
Managing Enterprise security in the Cloud
 
Building data lakes for analytics on AWS - ADB201 - Santa Clara AWS Summit.pdf
Building data lakes for analytics on AWS - ADB201 - Santa Clara AWS Summit.pdfBuilding data lakes for analytics on AWS - ADB201 - Santa Clara AWS Summit.pdf
Building data lakes for analytics on AWS - ADB201 - Santa Clara AWS Summit.pdf
 
Building Enterprise Solutions with Blockchain and Ledger Technology - SVC202 ...
Building Enterprise Solutions with Blockchain and Ledger Technology - SVC202 ...Building Enterprise Solutions with Blockchain and Ledger Technology - SVC202 ...
Building Enterprise Solutions with Blockchain and Ledger Technology - SVC202 ...
 
Discuss data migration with AWS experts - STG304 - Santa Clara AWS Summit
Discuss data migration with AWS experts - STG304 - Santa Clara AWS SummitDiscuss data migration with AWS experts - STG304 - Santa Clara AWS Summit
Discuss data migration with AWS experts - STG304 - Santa Clara AWS Summit
 
AWS及客戶在AI/ML的數位運行過程中得到的重要經驗與學習
AWS及客戶在AI/ML的數位運行過程中得到的重要經驗與學習AWS及客戶在AI/ML的數位運行過程中得到的重要經驗與學習
AWS及客戶在AI/ML的數位運行過程中得到的重要經驗與學習
 
What's new in Amazon Aurora - ADB204 - Santa Clara AWS Summit.pdf
What's new in Amazon Aurora - ADB204 - Santa Clara AWS Summit.pdfWhat's new in Amazon Aurora - ADB204 - Santa Clara AWS Summit.pdf
What's new in Amazon Aurora - ADB204 - Santa Clara AWS Summit.pdf
 
What's New with Amazon S3, Amazon EFS, and Other AWS Storage Services - STG20...
What's New with Amazon S3, Amazon EFS, and Other AWS Storage Services - STG20...What's New with Amazon S3, Amazon EFS, and Other AWS Storage Services - STG20...
What's New with Amazon S3, Amazon EFS, and Other AWS Storage Services - STG20...
 
Favorire l'innovazione passando da applicazioni monolitiche ad architetture m...
Favorire l'innovazione passando da applicazioni monolitiche ad architetture m...Favorire l'innovazione passando da applicazioni monolitiche ad architetture m...
Favorire l'innovazione passando da applicazioni monolitiche ad architetture m...
 
Creare e gestire Data Lake e Data Warehouses
Creare e gestire Data Lake e Data WarehousesCreare e gestire Data Lake e Data Warehouses
Creare e gestire Data Lake e Data Warehouses
 
Add intelligence to applications - AIM205 - Santa Clara AWS Summit.pdf
Add intelligence to applications - AIM205 - Santa Clara AWS Summit.pdfAdd intelligence to applications - AIM205 - Santa Clara AWS Summit.pdf
Add intelligence to applications - AIM205 - Santa Clara AWS Summit.pdf
 
Developing intelligent robots with AWS RoboMaker - SVC207 - New York AWS Summit
Developing intelligent robots with AWS RoboMaker - SVC207 - New York AWS SummitDeveloping intelligent robots with AWS RoboMaker - SVC207 - New York AWS Summit
Developing intelligent robots with AWS RoboMaker - SVC207 - New York AWS Summit
 
Sicurezza in AWS automazione e best practice
Sicurezza in AWS automazione e best practiceSicurezza in AWS automazione e best practice
Sicurezza in AWS automazione e best practice
 
Developing intelligent robots with AWS RoboMaker - SVC207 - Santa Clara AWS S...
Developing intelligent robots with AWS RoboMaker - SVC207 - Santa Clara AWS S...Developing intelligent robots with AWS RoboMaker - SVC207 - Santa Clara AWS S...
Developing intelligent robots with AWS RoboMaker - SVC207 - Santa Clara AWS S...
 
Migrating Business Critical Applications to AWS
Migrating Business Critical Applications to AWSMigrating Business Critical Applications to AWS
Migrating Business Critical Applications to AWS
 
Introduction to EC2 A1 instances, powered by the AWS Graviton processor - CMP...
Introduction to EC2 A1 instances, powered by the AWS Graviton processor - CMP...Introduction to EC2 A1 instances, powered by the AWS Graviton processor - CMP...
Introduction to EC2 A1 instances, powered by the AWS Graviton processor - CMP...
 

Similar a Accelerated Data Lakes Deep Dive Webinar - Paul Macey

Similar a Accelerated Data Lakes Deep Dive Webinar - Paul Macey (20)

Accelerated Data Lakes Webinar
Accelerated Data Lakes WebinarAccelerated Data Lakes Webinar
Accelerated Data Lakes Webinar
 
Immersion Day - Como construir seu Data Lake em dias na AWS
Immersion Day - Como construir seu Data Lake em dias na AWSImmersion Day - Como construir seu Data Lake em dias na AWS
Immersion Day - Como construir seu Data Lake em dias na AWS
 
Building a Data Lake on S3 for IoT Workloads
Building a Data Lake on S3 for IoT WorkloadsBuilding a Data Lake on S3 for IoT Workloads
Building a Data Lake on S3 for IoT Workloads
 
Data Lifecycle Management
Data Lifecycle ManagementData Lifecycle Management
Data Lifecycle Management
 
Introduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptxIntroduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptx
 
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
 
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
 
Best Practices to Secure Data Lake on AWS (ANT327) - AWS re:Invent 2018
Best Practices to Secure Data Lake on AWS (ANT327) - AWS re:Invent 2018Best Practices to Secure Data Lake on AWS (ANT327) - AWS re:Invent 2018
Best Practices to Secure Data Lake on AWS (ANT327) - AWS re:Invent 2018
 
BI & Analytics
BI & AnalyticsBI & Analytics
BI & Analytics
 
AWS Public Datasets: Learnings from Staging Petabytes of Data for Analysis in...
AWS Public Datasets: Learnings from Staging Petabytes of Data for Analysis in...AWS Public Datasets: Learnings from Staging Petabytes of Data for Analysis in...
AWS Public Datasets: Learnings from Staging Petabytes of Data for Analysis in...
 
Building Data Lakes with AWS
Building Data Lakes with AWSBuilding Data Lakes with AWS
Building Data Lakes with AWS
 
Deep Dive: Building Hybrid Cloud Storage Architectures with AWS Storage Gatew...
Deep Dive: Building Hybrid Cloud Storage Architectures with AWS Storage Gatew...Deep Dive: Building Hybrid Cloud Storage Architectures with AWS Storage Gatew...
Deep Dive: Building Hybrid Cloud Storage Architectures with AWS Storage Gatew...
 
Storage Data Management: Tools and Templates to Seamlessly Automate and Optim...
Storage Data Management: Tools and Templates to Seamlessly Automate and Optim...Storage Data Management: Tools and Templates to Seamlessly Automate and Optim...
Storage Data Management: Tools and Templates to Seamlessly Automate and Optim...
 
AWS and Symantec: Cyber Defense at Scale (SEC311-S) - AWS re:Invent 2018
AWS and Symantec: Cyber Defense at Scale (SEC311-S) - AWS re:Invent 2018AWS and Symantec: Cyber Defense at Scale (SEC311-S) - AWS re:Invent 2018
AWS and Symantec: Cyber Defense at Scale (SEC311-S) - AWS re:Invent 2018
 
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Preparing Your Data for Cloud Analytics & AI/ML
Preparing Your Data for Cloud Analytics & AI/MLPreparing Your Data for Cloud Analytics & AI/ML
Preparing Your Data for Cloud Analytics & AI/ML
 
How a Biotech Firm Streamlined Data Protection on AWS
 How a Biotech Firm Streamlined Data Protection on AWS How a Biotech Firm Streamlined Data Protection on AWS
How a Biotech Firm Streamlined Data Protection on AWS
 
Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...
Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...
Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...
 

Más de Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Más de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Accelerated Data Lakes Deep Dive Webinar - Paul Macey

  • 1. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Paul Macey Specialist Solution Architect, Big Data & Analytics AWS Public Sector Accelerated Data Lakes Deep Dive Webinar
  • 2. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Organisational data challenges Accelerated data lake Architecture Onboarding Demonstration Wrap up Questions
  • 3. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Organisational data challenges Silos Governance ? ScalabilitySecurity
  • 4. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Accelerated Data Lake Security Day 0 Data governance & metadata Data centralised & scalable SQL & BI ready Analytical & Data Science foundation Repeatable & extensible
  • 5. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Available today @ GitHub https://github.com/aws-samples/accelerated-data-lake Includes Data lake pipeline (CloudFormation) Instructions Data configuration, security and metadata templates Delivery Professional services AWS partners Accelerated Data Lake
  • 6. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Accelerated Data Lake Flywheel of success Start Small Establish a Repeatable Workflow Deliver benefits Improve and Iterate Repeat
  • 7. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Architecture
  • 8. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Accelerated Data Lake High Level Data Flow Lambda Functions • Validation • Apply Security • Attach Metadata • Catalog object • File movement • Alerts Time based or Event Driven ProcessInitiation S3 buckets • Staging • Raw • Curated • Gold • Data discovery • Logs Data Lake Storage Metadata Data Catalog Data Lake Enabling Analytics and Insights Big Data, Querying ETL & ML Database / BI Analytics & Insights
  • 9. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Accelerated Data Lake File Processing Pipeline - S3 lambda event example File arrives in the S3 Staging bucket A lambda function is triggered when the object is created The lambda passes the S3 event data payload to an AWS Step Function The step function moves through a repeatable data file onboarding process Validate Data Add security tags to S3 object Add metadata to S3 object Add object metadata to DynamoDB Index metadata into ElasticSearch Move file from the Staging bucket to Raw bucket Get the file specification
  • 10. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Onboarding
  • 11. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Accelerated Data Lake Data onboarding process 1) Create an new file specification entry in a DynamoDb table The table in the data lake solution is called “Data Sources” 1) Create a folder structure within the S3 staging bucket for the new data type This is just to keep everything in S3 organised but also optimised for later use 
  • 12. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. File Specification Settings File Settings S3 Object Tags Simple Metadata Extended Metadata
  • 13. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Accelerated Data Lake Example S3 storage structure Source Env / Schema Description / Table / View CMX API-Raw DB_Prod_HR Staff Data Flow Data FlowValidated & Approved dev-rawdev-staging dev-curated dev-gold dev-validation dev-data discovery sandpit dev-logging Wireless CMX 2019 01 API-Raw 2019 01 select count(userid) from CMX where year=2019 select userid from CMX inner join API-RAW on CMX.userid=API-RAW.userid where year=2019
  • 14. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. S3 Object Tags and Metadata Image1001.jpg jpeg image data S3 object tags S3 object data.csv S3 metadata Key Value Classification Internal PII False Use Case Interaction Extracts Team Analytics Key Value Policy facility_iinternal MD5 ab3116cded134 Data Owner User Interaction Team Data Source prod_int_extraction_dw
  • 15. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Demonstration
  • 16. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. IoT Simulator https://aws.amazon.com/solutions/iot-device-simulator/
  • 17. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Demonstration
  • 18. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Wrap up Security Day 0 Data governance & metadata Data centralised & scalable SQL & BI ready Analytical & Data Science foundation Repeatable & Extensible The accelerated data lake solution Can enable your data Support data security and data governance Can grow and scale in harmony with your organisation Can be granted access to AWS’s analytics, ML and AI ecosystem
  • 19. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. References Amazon S3 security https://aws.amazon.com/s3/faqs/#Security https://docs.aws.amazon.com/AmazonS3/latest/dev/DataDurability.html AWS Accelerated Data Lake (Git) https://github.com/aws-samples/accelerated-data-lake AWS Accelerated Data Lake Blog (part 1 & 2) https://aws.amazon.com/blogs/publicsector/from-data-silos-to-data-domains-bringing-common-data-together https://aws.amazon.com/blogs/publicsector/securing-your-data-by-knowing-your-data Our data lake story: How Woot.com built a serverless data lake on AWS https://aws.amazon.com/blogs/big-data/our-data-lake-story-how-woot-com-built-a-serverless-data-lake-on-aws
  • 20. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Additional slides
  • 21. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Classifications Data Classification High Level Control Recommendations 1 - Sensitive  Must obtain data manager approval to access the data.  Data encryption is mandatory for transmission and storage  Access log reporting is required, which is sent to the data manager and data sponsor for review on regular basis  Failed access events trigger alerts for follow-up  Must obtain data sponsor, Legal, & Security approval before sharing data to external parties or hosting in cloud services  Review and update the data classifications at least every 6 months. 2 - Restricted  Must obtain data manager approval to access the data.  Data encryption is mandatory for transmission, mandatory for storage in cloud or shared environments and preferred for storage in dedicated environments  Access log is required, which is sent to the data manager and data sponsor for review on less regular basis  Failed access events trigger alerts for follow-up  Need to obtain data manager, Legal, & Security approval before sharing data to external parties or hosting in cloud services  Review and update the data classifications at least every 6 months. 3 - Internal  Data SME may approve group/role access (member change without data manager approval)  Data encryption is mandatory for transmission, mandatory for storage in cloud or shared environments and optional for storage in dedicated environments  Access log is required  Review and update the data classifications at least annually. 4 - Public  Apply baseline security controls  Access granted to anyone who asks  Data encryption is not required for storage or transmission  Review and update the data Classifications at least annually.
  • 22. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 security and data governance Bucket level security Object level security
  • 23. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Object Tagging • Access to Amazon S3 buckets, S3 objects and IAM policies are intrinsically connected. • Object tagging enables granular control over access to objects. • The object tagging key/values are stored in DynamoDB. Tag values in Amazon DynamoDB IAM PolicyTag entries on an S3 object
  • 24. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Handling IT security risk in an AWS data lake IAM Policies and S3 object tagging Data Domain Data Classification Policy Example Tags Restricte d to Wireless Public Wireless_Public Classification=Public Domain=Wireless Wireless Internal Wireless_Internal Classification=Internal Domain=Wireless Wireless Restricted Wireless_Restricted Classification=Restricted Domain=Wireless Users, Groups or Services with data owner approval Wireless Sensitive Wireless_Sensitive Classification=Sensitive Domain=Wireless Users, Groups or Services with data owner approval Within IAM, we can create a policy where a user can access an S3 object provided the object has the object tags (key value pair) Classification : Restricted Domain : Wireless Example IAM Polices Example S3 tags Wireless CMX Cisco Netgear Other Domain Public Internal Restricted Sensitive
  • 25. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Policy examples Example of policy types that define how objects can be accessed. Policy Type Description Content Policies What the data contains and its classification Service Policies What state the data is in Development and System Policies What environment the data is located in and what actions can be performed on the data Metadata Policies Who has what access to object metadata Tagging Policies Who has what access to object tagging
  • 26. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Content Policy examples Policy Data Domain Example Tags Data Classification Restricted to Wireless_Public Wireless Classification=Public Domain=Wireless Public Wireless_Internal Wireless Classification=Internal Domain=Wireless Internal Wireless_Restricted Wireless Classification=Restricted Domain=Wireless Restricted Users, Groups or Services with data owner approval Wireless_Highly_Restricted Wireless Classification=Highly_Restricted Domain=Wireless Highly_Restricted Users, Groups or Services with data owner approval Facility_Public Facility Classification=Public Domain=Facility Public Facility_Internal Facility Classification=Internal Domain=Facility Internal Facility_Restricted Facility Classification=Restricted Domain=Facility Restricted Users, Groups or Services with data owner approval Facility_Highly_Restricted Facility Classification=Highly_Restricted Domain=Facility Highly_Restricted Users, Groups or Services with data owner approval
  • 27. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Service Policy examples Policy Description Tagging Rules Restricted to S3_Raw Landing location of data files from source systems Bucket=Raw System, service accounts and platform engineers S3_Staging Location of data files after object tagging and metadata applied Bucket=Staging Chief data office and data science teams S3_Curated Optimised versions of data files Bucket=Curated Chief data office, data science and analytics teams S3_Gold Business consumable versions of data and one off data sets Bucket=Gold Chief data office, data science, analytics teams and business users as required S3_Validation Data files generated from insights or aggregations that need to be validated and have security and metadata applied. Bucket=Validation Chief data office / Information system admins S3_Data_Discovery Data files that have been taken from other S3 locations or external sources that are blended with other datasets or used for data science and analytics Bucket=Data_Discovery Data Science and analytic teams S3_Logs Cloud trail and cloud watch logs Bucket=Logs ISO, system, service accounts and platform engineers. Other teams on request
  • 28. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Development and System Policy examples Role Description Tagging Rules Restricted to Dev_Read Read only access to Development Environment Environment=Dev Access=Read System, service accounts and platform engineers Test_Read Read only access to Test Environment Environment=Test Access=Read System, service accounts and platform engineers Prod_Read Read only access to Production Environment Environment=Prod Access=Read Dev_Write Write access to Development Environment Environment=Dev Access=Write System, service accounts and platform engineers Test_Write Write access to Test Environment Environment=Test Access=Write System, service accounts and platform engineers Prod_Write Write access to Production Environment Environment=Prod Access=Write System accounts
  • 29. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Metadata Policy examples Policy Description Tagging Rules Restricted to Metadata_Read Read only access to metadata Domain=Metadata Access=Read Metadata_Write Write access to metadata Domain=Metadata Access=Write Chief data office / Information System Administrators, Data Owners, Data Stewards
  • 30. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Tagging Policy examples Policy Description Tagging Rules Restricted to Tagging_Read Read only access Domain=Tags Access=Read Tagging_Write Write access Domain=Tags Access=Write ISO, Chief data office / Information System Administrators, Data Owners, Data Stewards
  • 31. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. IAM policy examples Example Required Policies An analytics user requests read only access to Public Facility data in Raw, Curated and Gold buckets Prod_Read S3_Raw, S3_Curated or S3_Gold Facility_Public A data science team member requests write access to the data discovery environment and read access to the curated bucket Prod_Write S3_Data_Discovery or Prod_Read S3_Curated