SlideShare una empresa de Scribd logo
1 de 39
Descargar para leer sin conexión
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Dickson Yue
Solution Architect
21 June 2017
Building Your First Data Lake
Modern Data Architectures on AWS
Today's conversation
Business drivers for a Data Lake
Designing and building
Production use cases
Outcome 1 : Modernize and consolidate
• Insights to enhance business applications and create new digital services
Outcome 2 : Innovate for new revenues
• Personalization, demand forecasting, risk analysis
Outcome 3 : Real-time engagement
• Interactive customer experience, event-driven automation, fraud detection
Outcome 4 : Automate for expansive reach
• Automation of business processes and physical infrastructure
Business Outcomes on a Modern Data Architecture
Expanding access requirements
Data
scientists
Automation /
events
Business
users
Data
analysts
Engagement
platforms
1. More personas need access to data, through appropriate tools
2. More systems need to link to data for decision and process automation
3. Users need to be able to find information, and access it securely
Exponential growth of business data
1. Data must be captured from diverse sources at speed and scale
2. Data needs to be pulled together, breaking down traditional silos
3. Benefits need to far outweigh the costs of collection and analysis
Transactions ERP Connected
devices
Social mediaWeb logs /
cookies
Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architecture
Insights to enhance business applications, new digital services
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architecture
Insights to enhance business applications, new digital services
Transactions
Web logs /
cookies
ERP
Data analysts
Data scientists
Business users
Engagement platformsConnected
devices
Social media Automation / events
Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architecture
Insights to enhance business applications, new digital services
Data Warehouse
Amazon Redshift
Legacy Apps
Amazon RDS
Data analysts
Data scientists
Business users
Engagement platforms
Schemaless
Amazon ElasticSearch
Direct Query
Amazon Athena
Near-Zero Latency
Amazon DynamoDB
Automation / events
Semi/Unstructured
Amazon EMR
Transactions
Web logs /
cookies
ERP
Connected
devices
Social media
Characteristics of a Data Lake
Future
Proof
Flexible
Access
Dive in
Anywhere
Collect
Anything
Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architecture
Insights to enhance business applications, new digital services
Data Warehouse
Amazon Redshift
Legacy Apps
Amazon RDS
Data analysts
Data scientists
Business users
Engagement platforms
Schemaless
Amazon ElasticSearch
Direct Query
Amazon Athena
Near-Zero Latency
Amazon DynamoDB
Automation / events
Semi/Unstructured
Amazon EMR
Transactions
Web logs /
cookies
ERP
Connected
devices
Social media
Designed for 11 9s
of durability
Designed for
99.99% availability
Durable
Available
Store as much as you need
Scale storage and compute
independently
Scalable
Amazon S3
Amazon Redshift / Spectrum
Amazon EMR
Amazon Athena
Amazon DynamoDB
Integrated
Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architecture
Insights to enhance business applications, new digital services
Data Warehouse
Amazon Redshift
Legacy Apps
Amazon RDS
Data analysts
Data scientists
Business users
Engagement platforms
Schemaless
Amazon ElasticSearch
Direct Query
Amazon Athena
Near-Zero Latency
Amazon DynamoDB
Automation / events
Amazon S3
Staged Data
(Data Lake)
Semi/Unstructured
Amazon EMR
Transactions
Web logs /
cookies
ERP
Connected
devices
Social media
Today's conversation
Business drivers for a Data Lake
Designing and building
Production use cases
Important Components of a Data Lake
Catalogue
& Search
Protect
& Secure
Access &
User Interface Ingest & Store
Data Ingestion into S3
AWS Direct Connect
AWS SnowballISV Connectors
Amazon Kinesis
Firehose
AWS Storage
Gateway
S3 Transfer
Acceleration
Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architecture
Insights to enhance business applications, new digital services
Data Warehouse
Amazon Redshift
Legacy Apps
Amazon RDS
Data analysts
Data scientists
Business users
Engagement platforms
Schemaless
Amazon ElasticSearch
Direct Query
Amazon Athena
Near-Zero Latency
Amazon DynamoDB
Automation / events
Amazon S3
Staged Data
(Data Lake)
Semi/Unstructured
Amazon EMR
Transactions
Web logs /
cookies
ERP
AWS Database
Migration
AWS Direct
Connect
Internet
Interfaces
Amazon
Kinesis
Connected
devices
Social media
Building a Data Catalogue
• Aggregated information about your storage & streaming
layer
• Storage service for metadata
Ownership, data lineage
• Data abstraction layer
Customer data = collection of prefixes
• Enabling data discovery
• API for use by entitlements service
AWS
Lambda
Amazon
DynamoDB
+
Streams
Amazon
Elasticsearch
AWS
LambdaS3
Bucket
PUT
OBJECT
CREATE
OBJECT
PUT
ITEM
UPDATE
STREAM
UPDATE
INDEX
Populating Metadata and Search
AWS
Glue
Managed Transform Engine
Job Scheduler
Data Catalog
Built on Apache Spark
Integrated with S3, RDS, Redshift & any
JDBC-compliant data store
Security
 Identity and Access
Management (IAM) policies
 Bucket policies
 Access Control Lists (ACLs)
 Private VPC endpoints to
Amazon S3
 Pre-signed S3 URLs
Encryption
 SSL endpoints
 Server Side Encryption
(SSE-S3)
 S3 Server Side
Encryption with
provided keys (SSE-C,
SSE-KMS)
 Client-side Encryption
Audit & Compliance
 Buckets access logs
 Lifecycle Management
Policies
 Versioning & MFA
deletes
 Certifications – HIPAA,
PCI, SOC 1/2/3 etc.
Implement the right cloud security controls
Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architecture
Insights to enhance business applications, new digital services
Data Warehouse
Amazon Redshift
Legacy Apps
Amazon RDS
Data analysts
Data scientists
Business users
Engagement platforms
Schemaless
Amazon ElasticSearch
Direct Query
Amazon Athena
Near-Zero Latency
Amazon DynamoDB
Automation / events
Amazon S3
Staged Data
(Data Lake)
Semi/Unstructured
Amazon EMR
Transactions
Web logs /
cookies
ERP
AWS Database
Migration
AWS Direct
Connect
Internet
Interfaces
Amazon
Kinesis
Connected
devices
Social media
AWS
Cloud Trail
AWS
IAM
Amazon
CloudWatch
AWS
KMS
Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architecture
Insights to enhance business applications, new digital services
Data Warehouse
Amazon Redshift
Legacy Apps
Amazon RDS
Data analysts
Data scientists
Business users
Engagement platforms
Schemaless
Amazon ElasticSearch
Direct Query
Amazon Athena
Near-Zero Latency
Amazon DynamoDB
Automation / events
Amazon S3
Staged Data
(Data Lake)
Semi/Unstructured
Amazon EMR
Transactions
Web logs /
cookies
ERP
AWS Database
Migration
AWS Direct
Connect
Internet
Interfaces
Amazon
Kinesis
Connected
devices
Social media
Amazon S3
Raw Data
Amazon EMR
ETL
Advanced
Analytics
MLlib
Event Capture
Amazon Kinesis
Stream Analysis
Amazon EMR
AWS
Cloud Trail
AWS
IAM
Amazon
CloudWatch
AWS
KMS
Today's conversation
Business drivers for a Data Lake
Designing and building
Production use cases
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Angus Tse
Director of Engineer
21 June 2017
Clickstream Analytics Pipeline
HK01
Growth
DATA BEATS
EMOTIONS
Sean Rad
Founder & CEO Tinder
Clickstream Analytics
• Free and easy
• Excellent for initial
• Good learning materials
Google Analytics (GA)
• Free and easy
• Excellent for initial
• Good learning materials
Google Analytics (GA)
• free version latency &
accuracy issue
• GA 360 (Premium) +
BigQuery are expensive
• Not flexible enough
Our needs
• Large data volume
• Raw data for Machine Learning
• Flexible for further processing
• Low latency
Building a scalable pipeline on
AWS
Piwik
• Open-source analytics
platform
• Realtime dashboard
• Web & mobile SDK
• PageView
• Content / Media
• A/B Test
Piwik
• Open-source analytics
platform
• Realtime dashboard
• Web & mobile SDK
• PageView
• Content / Media
• A/B Test
Phase 1
AWS
Lambda
API
Gateway
Kinesis
Firehose
Redshift Quicksight
Experience on AWS
• Complete and Integrated
• Quick. 2 man weeks for first version
• Easy to scale
• Minimal maintenance cost
Future
• More server-less in future
• S3 as datalake
• click event, system log, etc
• raw, processed data (like ML result)
• Hot on disk, cold on s3
• Explore AWS Machine Learning
AWS
Lambda
API
Gateway
Kinesis
Firehose
Redshift
Quicksight
S3
Machine
learning
EMR
SparkML
ML
Hot data
Raw data
Processed data
Cold data
Direct Query
Athena
Redshift
Spectrum
P2
Deep
learning AMI
Visualization
Serverless
Download HK01 We’re Hiring
Thanks
Join Us : hk01.com/job
Summary
1. S3 as data lake
2. Pick the right tool to match the
persona requirements
3. Go serverless

Más contenido relacionado

La actualidad más candente

Idera live 2021: Keynote Presentation The Future of Data is The Data Cloud b...
Idera live 2021:  Keynote Presentation The Future of Data is The Data Cloud b...Idera live 2021:  Keynote Presentation The Future of Data is The Data Cloud b...
Idera live 2021: Keynote Presentation The Future of Data is The Data Cloud b...
IDERA Software
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
Jeffrey T. Pollock
 
Building a Modern Data Platform on AWS
Building a Modern Data Platform on AWSBuilding a Modern Data Platform on AWS
Building a Modern Data Platform on AWS
Amazon Web Services
 

La actualidad más candente (20)

Considerations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseConsiderations for Data Access in the Lakehouse
Considerations for Data Access in the Lakehouse
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
Idera live 2021: Keynote Presentation The Future of Data is The Data Cloud b...
Idera live 2021:  Keynote Presentation The Future of Data is The Data Cloud b...Idera live 2021:  Keynote Presentation The Future of Data is The Data Cloud b...
Idera live 2021: Keynote Presentation The Future of Data is The Data Cloud b...
 
How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Implementing a Data Lake
Implementing a Data LakeImplementing a Data Lake
Implementing a Data Lake
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
Building-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWSBuilding-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWS
 
Data Mesh for Dinner
Data Mesh for DinnerData Mesh for Dinner
Data Mesh for Dinner
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Introduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptxIntroduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptx
 
Modernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data PipelinesModernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data Pipelines
 
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
 
Building a Modern Data Platform on AWS
Building a Modern Data Platform on AWSBuilding a Modern Data Platform on AWS
Building a Modern Data Platform on AWS
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 

Similar a Building your Datalake on AWS

Similar a Building your Datalake on AWS (20)

Driving Business Outcomes with a Modern Data Architecture - Level 100
Driving Business Outcomes with a Modern Data Architecture - Level 100Driving Business Outcomes with a Modern Data Architecture - Level 100
Driving Business Outcomes with a Modern Data Architecture - Level 100
 
Driving Business Insights with a Modern Data Architecture AWS Summit SG 2017
Driving Business Insights with a Modern Data Architecture  AWS Summit SG 2017Driving Business Insights with a Modern Data Architecture  AWS Summit SG 2017
Driving Business Insights with a Modern Data Architecture AWS Summit SG 2017
 
AWS Big Data Platform
AWS Big Data PlatformAWS Big Data Platform
AWS Big Data Platform
 
Finding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS AnalyticsFinding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS Analytics
 
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
 
The AWS Big Data Platform – Overview
The AWS Big Data Platform – OverviewThe AWS Big Data Platform – Overview
The AWS Big Data Platform – Overview
 
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K..."Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
 
Building your First Big Data Application on AWS
Building your First Big Data Application on AWSBuilding your First Big Data Application on AWS
Building your First Big Data Application on AWS
 
AWS Webcast - Informatica - Big Data Solutions Showcase
AWS Webcast - Informatica - Big Data Solutions ShowcaseAWS Webcast - Informatica - Big Data Solutions Showcase
AWS Webcast - Informatica - Big Data Solutions Showcase
 
Getting Started with Amazon QuickSight
Getting Started with Amazon QuickSightGetting Started with Amazon QuickSight
Getting Started with Amazon QuickSight
 
Big Data Meets AI - Driving Insights and Adding Intelligence to Your Solutions
 Big Data Meets AI - Driving Insights and Adding Intelligence to Your Solutions Big Data Meets AI - Driving Insights and Adding Intelligence to Your Solutions
Big Data Meets AI - Driving Insights and Adding Intelligence to Your Solutions
 
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...
 
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
 
Modern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesModern Data Architectures for Business Outcomes
Modern Data Architectures for Business Outcomes
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWS
 
ABD202_Best Practices for Building Serverless Big Data Applications
ABD202_Best Practices for Building Serverless Big Data ApplicationsABD202_Best Practices for Building Serverless Big Data Applications
ABD202_Best Practices for Building Serverless Big Data Applications
 
How TrueCar Gains Actionable Insights with Splunk Cloud PPT
How TrueCar Gains Actionable Insights with Splunk Cloud PPTHow TrueCar Gains Actionable Insights with Splunk Cloud PPT
How TrueCar Gains Actionable Insights with Splunk Cloud PPT
 
Modern Data Architectures for Business Insights at Scale
Modern Data Architectures for Business Insights at Scale Modern Data Architectures for Business Insights at Scale
Modern Data Architectures for Business Insights at Scale
 
Modern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesModern Data Architectures for Business Outcomes
Modern Data Architectures for Business Outcomes
 
AWS Big Data Solution Days
AWS Big Data Solution DaysAWS Big Data Solution Days
AWS Big Data Solution Days
 

Más de Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Más de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Último (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

Building your Datalake on AWS

  • 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Dickson Yue Solution Architect 21 June 2017 Building Your First Data Lake Modern Data Architectures on AWS
  • 2. Today's conversation Business drivers for a Data Lake Designing and building Production use cases
  • 3. Outcome 1 : Modernize and consolidate • Insights to enhance business applications and create new digital services Outcome 2 : Innovate for new revenues • Personalization, demand forecasting, risk analysis Outcome 3 : Real-time engagement • Interactive customer experience, event-driven automation, fraud detection Outcome 4 : Automate for expansive reach • Automation of business processes and physical infrastructure Business Outcomes on a Modern Data Architecture
  • 4. Expanding access requirements Data scientists Automation / events Business users Data analysts Engagement platforms 1. More personas need access to data, through appropriate tools 2. More systems need to link to data for decision and process automation 3. Users need to be able to find information, and access it securely
  • 5. Exponential growth of business data 1. Data must be captured from diverse sources at speed and scale 2. Data needs to be pulled together, breaking down traditional silos 3. Benefits need to far outweigh the costs of collection and analysis Transactions ERP Connected devices Social mediaWeb logs / cookies
  • 6. Speed (Real-time) Ingest ServingData sources Scale (Batch) Modern data architecture Insights to enhance business applications, new digital services Data analysts Data scientists Business users Engagement platforms Automation / events
  • 7. Speed (Real-time) Ingest ServingData sources Scale (Batch) Modern data architecture Insights to enhance business applications, new digital services Transactions Web logs / cookies ERP Data analysts Data scientists Business users Engagement platformsConnected devices Social media Automation / events
  • 8. Speed (Real-time) Ingest ServingData sources Scale (Batch) Modern data architecture Insights to enhance business applications, new digital services Data Warehouse Amazon Redshift Legacy Apps Amazon RDS Data analysts Data scientists Business users Engagement platforms Schemaless Amazon ElasticSearch Direct Query Amazon Athena Near-Zero Latency Amazon DynamoDB Automation / events Semi/Unstructured Amazon EMR Transactions Web logs / cookies ERP Connected devices Social media
  • 9. Characteristics of a Data Lake Future Proof Flexible Access Dive in Anywhere Collect Anything
  • 10. Speed (Real-time) Ingest ServingData sources Scale (Batch) Modern data architecture Insights to enhance business applications, new digital services Data Warehouse Amazon Redshift Legacy Apps Amazon RDS Data analysts Data scientists Business users Engagement platforms Schemaless Amazon ElasticSearch Direct Query Amazon Athena Near-Zero Latency Amazon DynamoDB Automation / events Semi/Unstructured Amazon EMR Transactions Web logs / cookies ERP Connected devices Social media Designed for 11 9s of durability Designed for 99.99% availability Durable Available Store as much as you need Scale storage and compute independently Scalable Amazon S3 Amazon Redshift / Spectrum Amazon EMR Amazon Athena Amazon DynamoDB Integrated
  • 11. Speed (Real-time) Ingest ServingData sources Scale (Batch) Modern data architecture Insights to enhance business applications, new digital services Data Warehouse Amazon Redshift Legacy Apps Amazon RDS Data analysts Data scientists Business users Engagement platforms Schemaless Amazon ElasticSearch Direct Query Amazon Athena Near-Zero Latency Amazon DynamoDB Automation / events Amazon S3 Staged Data (Data Lake) Semi/Unstructured Amazon EMR Transactions Web logs / cookies ERP Connected devices Social media
  • 12. Today's conversation Business drivers for a Data Lake Designing and building Production use cases
  • 13. Important Components of a Data Lake Catalogue & Search Protect & Secure Access & User Interface Ingest & Store
  • 14. Data Ingestion into S3 AWS Direct Connect AWS SnowballISV Connectors Amazon Kinesis Firehose AWS Storage Gateway S3 Transfer Acceleration
  • 15. Speed (Real-time) Ingest ServingData sources Scale (Batch) Modern data architecture Insights to enhance business applications, new digital services Data Warehouse Amazon Redshift Legacy Apps Amazon RDS Data analysts Data scientists Business users Engagement platforms Schemaless Amazon ElasticSearch Direct Query Amazon Athena Near-Zero Latency Amazon DynamoDB Automation / events Amazon S3 Staged Data (Data Lake) Semi/Unstructured Amazon EMR Transactions Web logs / cookies ERP AWS Database Migration AWS Direct Connect Internet Interfaces Amazon Kinesis Connected devices Social media
  • 16. Building a Data Catalogue • Aggregated information about your storage & streaming layer • Storage service for metadata Ownership, data lineage • Data abstraction layer Customer data = collection of prefixes • Enabling data discovery • API for use by entitlements service
  • 18. AWS Glue Managed Transform Engine Job Scheduler Data Catalog Built on Apache Spark Integrated with S3, RDS, Redshift & any JDBC-compliant data store
  • 19. Security  Identity and Access Management (IAM) policies  Bucket policies  Access Control Lists (ACLs)  Private VPC endpoints to Amazon S3  Pre-signed S3 URLs Encryption  SSL endpoints  Server Side Encryption (SSE-S3)  S3 Server Side Encryption with provided keys (SSE-C, SSE-KMS)  Client-side Encryption Audit & Compliance  Buckets access logs  Lifecycle Management Policies  Versioning & MFA deletes  Certifications – HIPAA, PCI, SOC 1/2/3 etc. Implement the right cloud security controls
  • 20. Speed (Real-time) Ingest ServingData sources Scale (Batch) Modern data architecture Insights to enhance business applications, new digital services Data Warehouse Amazon Redshift Legacy Apps Amazon RDS Data analysts Data scientists Business users Engagement platforms Schemaless Amazon ElasticSearch Direct Query Amazon Athena Near-Zero Latency Amazon DynamoDB Automation / events Amazon S3 Staged Data (Data Lake) Semi/Unstructured Amazon EMR Transactions Web logs / cookies ERP AWS Database Migration AWS Direct Connect Internet Interfaces Amazon Kinesis Connected devices Social media AWS Cloud Trail AWS IAM Amazon CloudWatch AWS KMS
  • 21. Speed (Real-time) Ingest ServingData sources Scale (Batch) Modern data architecture Insights to enhance business applications, new digital services Data Warehouse Amazon Redshift Legacy Apps Amazon RDS Data analysts Data scientists Business users Engagement platforms Schemaless Amazon ElasticSearch Direct Query Amazon Athena Near-Zero Latency Amazon DynamoDB Automation / events Amazon S3 Staged Data (Data Lake) Semi/Unstructured Amazon EMR Transactions Web logs / cookies ERP AWS Database Migration AWS Direct Connect Internet Interfaces Amazon Kinesis Connected devices Social media Amazon S3 Raw Data Amazon EMR ETL Advanced Analytics MLlib Event Capture Amazon Kinesis Stream Analysis Amazon EMR AWS Cloud Trail AWS IAM Amazon CloudWatch AWS KMS
  • 22. Today's conversation Business drivers for a Data Lake Designing and building Production use cases
  • 23. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Angus Tse Director of Engineer 21 June 2017 Clickstream Analytics Pipeline HK01
  • 24.
  • 28. • Free and easy • Excellent for initial • Good learning materials Google Analytics (GA)
  • 29. • Free and easy • Excellent for initial • Good learning materials Google Analytics (GA) • free version latency & accuracy issue • GA 360 (Premium) + BigQuery are expensive • Not flexible enough
  • 30. Our needs • Large data volume • Raw data for Machine Learning • Flexible for further processing • Low latency
  • 31. Building a scalable pipeline on AWS
  • 32. Piwik • Open-source analytics platform • Realtime dashboard • Web & mobile SDK • PageView • Content / Media • A/B Test
  • 33. Piwik • Open-source analytics platform • Realtime dashboard • Web & mobile SDK • PageView • Content / Media • A/B Test
  • 35. Experience on AWS • Complete and Integrated • Quick. 2 man weeks for first version • Easy to scale • Minimal maintenance cost
  • 36. Future • More server-less in future • S3 as datalake • click event, system log, etc • raw, processed data (like ML result) • Hot on disk, cold on s3 • Explore AWS Machine Learning AWS Lambda API Gateway Kinesis Firehose Redshift Quicksight S3 Machine learning EMR SparkML ML Hot data Raw data Processed data Cold data Direct Query Athena Redshift Spectrum P2 Deep learning AMI Visualization Serverless
  • 38. Thanks Join Us : hk01.com/job
  • 39. Summary 1. S3 as data lake 2. Pick the right tool to match the persona requirements 3. Go serverless