SlideShare a Scribd company logo
1 of 44
Download to read offline
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Neeraj Verma – AWS Solutions Architect
Saurav Mahanti – Senior Manager – Information Systems
Dario Rivera – AWS Solutions Architect
November 28, 2016
How to Build a Big Data Analytics
Data Lake
What to expect from this short talk
• Data Lake concept
• Data Lake - Important Capabilities
• AMGEN’s Data Lake initiative
• How to Build a Data Lake in your AWS account
Data Lake Concept
What is a Data Lake?
Data Lake is a new and increasingly
popular way to store and analyze
massive volumes and heterogenous
types of data in a centralized repository.
Benefits of a Data Lake – Quick Ingest
Quickly ingest data
without needing to force it into a
pre-defined schema.
“How can I collect data quickly
from various sources and store
it efficiently?”
Benefits of a Data Lake – All Data in One Place
“Why is the data distributed in
many locations? Where is the
single source of truth ?”
Store and analyze all of your data,
from all of your sources, in one
centralized location.
Benefits of a Data Lake – Storage vs Compute
Separating your storage and compute
allows you to scale each component as
required
“How can I scale up with the
volume of data being generated?”
Benefits of a Data Lake – Schema on Read
“Is there a way I can apply multiple
analytics and processing frameworks
to the same data?”
A Data Lake enables ad-hoc
analysis by applying schemas
on read, not write.
Important Capabilities of a
“Data Lake”
Important components of a Data Lake
Ingest and Store Catalogue & Search Protect & Secure Access & User
Interface
Amazon SQS apps
Streaming
Amazon Kinesis
Analytics
Amazon KCL
apps
AWS Lambda
Amazon Redshift
COLLECT INGEST/STORE CONSUMEPROCESS / ANALYZE
Amazon Machine
Learning
Presto
Amazon
EMR
Amazon Elasticsearch
Service
Apache Kafka
Amazon SQS
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon DynamoDB
Amazon S3
Amazon ElastiCache
Amazon RDS
Amazon DynamoDB
Streams
BatchMessageInteractiveStreamML
SearchSQLNoSQLCacheFileQueueStream
Amazon EC2
Mobile apps
Web apps
Devices
Messaging
Message
Sensors and
IoT platforms
AWS IoT
Data centers
AWS Direct
Connect
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
RECORDS
DOCUMENTS
FILES
MESSAGES
STREAMS
Amazon QuickSight
Apps & Services
Analysis&visualizationNotebooksIDEAPI
LoggingIoTApplicationsTransportMessaging
ETL
Many tools to Support the Data Analytics LifeCycle
Ingest and Store
Ingest real time and batch data
Support for any type of data at scale
Durable
Low cost
Use S3 as Data Substrate – Apply to Compute as Needed
EMR Kinesis
Redshift DynamoDB RDS
Data Pipeline
Spark Streaming Storm
Amazon
S3
Import/Export
Snowball
Highly Durable
Low Cost
Scalable Storage
Amazon S3 as your cluster’s persistent data store
Amazon S3
Separate compute and storage
Resize and shut down Analytics
Compute Environments with no data
loss
Point multiple compute clusters at
same data in Amazon S3
AWS Direct Connect AWS Snowball ISV Connectors
Amazon Kinesis
Firehose
S3 Transfer
Acceleration
AWS Storage
Gateway
Data Ingestion into Amazon S3
Metadata lake
Used for summary statistics and data
Classification management
Simplified model for data discovery &
governance
Catalogue & Search
Catalogue & Search Architecture
Data Collectors
(EC2,ECS)
AWS LambdaS3 Bucket
AWS Lambda
Metadata Index
(DynamoDB)
AWS Elasticsearch
Service
Extract Search Fields
Put Object
Object Created,
Object Deleted
Put Item
Update
Stream
Update
Index
Access Control - Authentication &
Authorization
Data protection – Encryption
Logging and Monitoring
Protect and Secure
Encryption ComplianceSecurity
 Identity and Access
Management (IAM) policies
 Bucket policies
 Access Control Lists (ACLs)
 Query string authentication
 Private VPC endpoints to
Amazon S3
 SSL endpoints
 Server Side Encryption
(SSE-S3)
 S3 Server Side
Encryption with
provided keys (SSE-C,
SSE-KMS)
 Client-side Encryption
 Buckets access logs
 Lifecycle Management
Policies
 Access Control Lists
(ACLs)
 Versioning & MFA
deletes
 Certifications – HIPAA,
PCI, SOC 1/2/3 etc.
Implement the right controls
Exposes the data lake to customers
Programmatically query catalogue
Expose search API
Ensures that entitlements are respected
API & User Interface
API & UI Architecture
API Gateway
AWS Lambda Metadata IndexUsers
User
Management
Static Website
Putting It All Together
Common Add-On Capability to Data Lake
Backend ETL Organizational
Control Gates for
Dataset Access
Data Correlation
Identification via
Machine Learning
Dataset Updates to
Dependent Compute
Environments
ETL
json
Data Lake is a Journey
There are Multiple implementation Methods
for Building a Data Lake
• Using a Combination of Specialized/Open Source Tools from
Various Providers to build a Customized Solution
• Use various managed services from AWS such as S3, Amazon
ElasticSearch Service, DynamoDB, Cognito, Lambda, etc as a Core
Data Lake Solution
• Using a DataLake-as-a-Service Solution
Data Lake Evolution @ Amgen
SAURAV MAHANTI
Senior Manager - Information Systems
One Data Lake
One code-base
Multiple Locations
(Cloud / On-Premise)
Common scalable
infrastructure
Multiple Business
Functions
Shared Features
A CONCEPTUAL VIEW OF THE DATA LAKE
DATA LAKE PLATFORM
Common Tools and Capabilities
Innovative new tools and
capabilities are reused
across all functions (e.g.
search, data processing &
storage, visualization tools)Functions can manage
their own data, while
contributing to the
common data layer
Business applications are
built to meet specific
information needs, from
simple data access, data
visualization, to complex
statistical/predictive models
Manufacturing Data
BatchGenealogyVisualization
PDSelf-ServiceAnalytics
InstrumentBusDataSearch
Real World Data
FDASentinelAnalytics
EpiProgrammer’sWorkbench
PatientPopulationAnalytics
Commercial Data
USFieldSalesReporting
GlobalForecasting
USMarketingAnalytics
Best
Practices
Awards
Bio IT World
2016
Winner
HIGH LEVEL COMPONENT ARCHITECTURE of the data lake
Procure/Ingest
DataIngestion
Adapter
Managed
Infrastructure
Self-service
Adapter
Application
Accelerator
Curate and Enrichment
Self-Service
Data Integration
Tools
IS Owned
Automated
Integration
Data
Catalog
Storage and Processing
User Data
Workspace
Raw Mastered Apps
Elastic
Compute
Capability
Analytics Toolkit
Analytical
Toolkit
Data Science
Toolkit
BI Reporting
User Specific
Apps
Access Portal
Centralized Analytics Toolkit
provides different reporting and
analytics solutions for end-users to
access the data on the Data Lake
Curate and Enrichment Layer
Enhances the value and usability of the
data through automation, cataloging
and linking
Data processing & storage layer is
the core of the Enterprise Data Lake
that enables its scale and flexibility
Procure/Ingest
a collection of tools and
technologies to easily
move data to and from
the Data Lake
Reference Data
Linkage
DATA PROCESSING AND STORAGE LAYERProcure/Ingest
DataIngestion
Adapter
Managed
Infrastructure
Self-service
Adapter
Application
Accelerator
Curate and Enrichment
Self-Service Data
Integration Tools
IS Owned Automated
Integration
Data Catalog
Centralized Analytics Toolkit
Analytical Toolkit Data Science
Toolkit
BI Reporting
User Specific
Apps
Centralized Access Portal
Centralized Analytics Toolkit
provides different reporting and analytics
solutions for end-users to access the data
on the Data Lake
Curate and Enrichment Layer
provides different reporting and analytics
solutions for end-users to access the data
on the Data Lake
Data processing & storage layer is the
core of the Enterprise Data Lake that
enables its scale and flexibility
Procure/Ingest
a collection of tools and technologies to
easily move data to and from the
Enterprise Data Lake
The combination of Hadoop HDFS
and Amazon S3 provides the right
cost/performance balance with
unlimited scalability while
maintaining security and encryption
at the data file level
Powerful execution
engines like YARN,
Map Reduce and
SPARK bring the
“compute to the
data”
HIVE and Impala provide SQL over
structured data;
HBASE is used for NoSQL/Transactional
jobs
Solr is used for search capabilities over
documents
MarkLogic and Neo4J provide semantic and
graph capabilities
Amazon RedShift
and Amazon EMR
low-cost elastic
computing with the
ability to spin up
clusters for data
processing and
metrics calculations
Storage and Processing
User Data
Workspace
Raw Mastered Apps
Elastic
Compute
Capability
Mark
Logic
PROCURE AND INGEST - Pre-built and configurable common components to load any type
of data into the Lake
Procure/Ingest
DataIngestion
Adapter
Managed
Infrastructure
Self-service
Adapter
Application
Accelerator
Curate and Enrichment
Self-Service
Data Integration
Tools
IS Owned
Automated
Integration
Data Catalog
Storage and Processing
User Data
Workspace
Raw Mastered Apps
Elastic
Compute
Capability
Centralized Analytics Toolkit
Analytical
Toolkit
Data Science
Toolkit
BI Reporting
User Specific
Apps
Centralized Access Portal
Centralized Analytics Toolkit
provides different reporting and
analytics solutions for end-users to
access the data on the Data Lake
Curate and Enrichment Layer
provides different reporting and
analytics solutions for end-users to
access the data on the Data Lake
Data processing & storage layer is
the core of the Enterprise Data Lake
that enables its scale and flexibilityCloud Data Integration tools like SnapLogic enable data analysts end users to build data
pipelines into the Data Lake using pre-built connectors to various cloud hosted services like Box
and make it easy to move data set between HDFS, S3, sFTP and fileshares
Structured Data Ingestion - A common component for scheduled production data
loads of incremental or full data. It uses Python and native Hadoop tools like Scoop
and HIVE for efficiency and speed.
Real-Time Data Ingestion for real time or streaming data. It
uses the Kafka messaging queue, Spark streaming, Hbase and
Java.
Unstructured Data Pipeline Morphlines Document Pipeline for document ingestion, text
processing and indexing. It uses Morphlines, Lily indexer and HBase.
CURATE AND ENRICHMENT
Enhance the value of the data by linking and cataloging
Reference Data
linkage – Connect
Datasets to
Ontologies and
Vocabularies to get
more relevant and
better results
Data Catalog is an enterprise-
wide metadata catalog that stores,
describes, indexes, and shows
how to access any registered data
asset
Self-Service
Data Integration
Tools
Analytical
Subject Areas
Data
Catalog
Curate and Enrichment
Reference Data
Linkage
Analytical Subject Areas – Build
targeted applications using Data
Integration tools like SnapLogic or
deploy packaged applications or
data marts that transform the data
in the Lake for consumption by
end user tools
PDSelf-ServiceAnalytics
FDASentinelAnalytics
USMarketingAnalytics
CENTRALIZED ANALYTICS TOOLKIT - provides reporting and analytics solutions for end-
users to access the data in the Lake
Analytics Toolkit
Analytical
Toolkit
Data Science
Toolkit
BI Reporting
User Specific
Apps
Access Portal
Data Science toolkit enables
analysts and data scientists use
tools like SAS, R, Jupyter (Python
notebooks) to connect to their
analytics sandbox within the Lake
and submit distributed computing
jobs
Reporting and Business
Intelligence tools like
MicroStrategy and Cognos can be
deployed either directly on the Lake
or on derived Analytical Subject
Areas
Analytics and
Visualization tools are
the most common
methods used to query
and analyze data in the
Lake
Focused applications that target
a specific use-case can be built
using open source products and
deployed to a specific user
community through the portal
Data Lake Portal
provides a user-friendly,
mobile-enabled and
secure portal to all the
end-user applications
Business Impact
Marketed product
defense
Clinical trial speed &
outcomes
Design & Analytic Toolbox
Pre-calculated Cohorts for All Pipeline & Marketed Medicines
Real World Data Platform: Shared capability used by multiple business functions from R&D
and Commercial
RWD Data Lake – Claims, EMR+
Superior processing speed enables simultaneous processing of terabytes of data
Datasets Converted to Common OMOP Data Model
Asia US EU ROW
Descriptive
Rapid RWD Query
Targeted Demand
Therapy areas
Study Design
Advanced Analytics
Spotfire, R, SAS, Achilles
Product value
Benefit:Risk
Evidence-based
research
That’s a lot of Work! –
Where do I even Start?!
Data Lake Solution
Introducing:
Presented by: Dario Rivera – AWS Solutions Architect
Built by: Sean Senior – AWS Solutions Builder
Data Lake Solution
Package of Code –
Deployed via CloudFormation
into your AWS Account
Architecture Overview
Data Lake Server-less Composition
• Amazon S3 (2 buckets)
• Primary data lake content storage
• Static Website Hosting
• Amazon API Gateway (1 RESTful API)
• Amazon DynamoDB (7 Tables)
• Amazon Cognito Your User Pools (1 User Pool)
• AWS Lambda
• 5 microservice functions,
• 1 Amazon API Gateway custom authorizer function,
• 1 Amazon Cognito User Pool event trigger function [Pre sign-up, Post confirmation]
• Amazon Elasticsearch Service (1 cluster)
• AWS IAM (7 policies, 7 roles)
Demo of
AWS Data Lake Solution
http://tinyurl.com/DataLakeSolution
All of this and it costs less than a $1 /
hour to run the Data Lake Solution
* Excluding Storage and Analytics Environment Costs
Data Lake Solution will be available End of Q4.
Available via AWS Answers
https://aws.amazon.com/answers/
Thank you!
Neeraj Verma – rajverma@amazon.com
Saurav Mahanti – smahanti@amgen.com
Dario Rivera – darior@amazon.com

More Related Content

What's hot

AWS Enterprise First Call Deck
AWS Enterprise First Call DeckAWS Enterprise First Call Deck
AWS Enterprise First Call Deck
Alexandre Melo
 

What's hot (20)

Real-Time Streaming: Intro to Amazon Kinesis
Real-Time Streaming: Intro to Amazon KinesisReal-Time Streaming: Intro to Amazon Kinesis
Real-Time Streaming: Intro to Amazon Kinesis
 
Ml ops on AWS
Ml ops on AWSMl ops on AWS
Ml ops on AWS
 
Amazon Aurora
Amazon AuroraAmazon Aurora
Amazon Aurora
 
Building Data Lakes with AWS
Building Data Lakes with AWSBuilding Data Lakes with AWS
Building Data Lakes with AWS
 
From Insights to Action, How to build and maintain a Data Driven Organization...
From Insights to Action, How to build and maintain a Data Driven Organization...From Insights to Action, How to build and maintain a Data Driven Organization...
From Insights to Action, How to build and maintain a Data Driven Organization...
 
Introduction to Amazon Aurora
Introduction to Amazon AuroraIntroduction to Amazon Aurora
Introduction to Amazon Aurora
 
Building-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWSBuilding-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWS
 
A Brief Look at Serverless Architecture
A Brief Look at Serverless ArchitectureA Brief Look at Serverless Architecture
A Brief Look at Serverless Architecture
 
Introduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptxIntroduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptx
 
AWS 101
AWS 101AWS 101
AWS 101
 
Managing and governing multi-account AWS environments using AWS Organizations...
Managing and governing multi-account AWS environments using AWS Organizations...Managing and governing multi-account AWS environments using AWS Organizations...
Managing and governing multi-account AWS environments using AWS Organizations...
 
How to Streamline DataOps on AWS
How to Streamline DataOps on AWSHow to Streamline DataOps on AWS
How to Streamline DataOps on AWS
 
Migrating to Amazon RDS with Database Migration Service
Migrating to Amazon RDS with Database Migration ServiceMigrating to Amazon RDS with Database Migration Service
Migrating to Amazon RDS with Database Migration Service
 
AWS Summit Seoul 2023 | 삼성전자/쿠팡의 대규모 트래픽 처리를 위한 클라우드 네이티브 데이터베이스 활용
AWS Summit Seoul 2023 | 삼성전자/쿠팡의 대규모 트래픽 처리를 위한 클라우드 네이티브 데이터베이스 활용AWS Summit Seoul 2023 | 삼성전자/쿠팡의 대규모 트래픽 처리를 위한 클라우드 네이티브 데이터베이스 활용
AWS Summit Seoul 2023 | 삼성전자/쿠팡의 대규모 트래픽 처리를 위한 클라우드 네이티브 데이터베이스 활용
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Performing real-time ETL into data lakes - ADB202 - Santa Clara AWS Summit.pdf
Performing real-time ETL into data lakes - ADB202 - Santa Clara AWS Summit.pdfPerforming real-time ETL into data lakes - ADB202 - Santa Clara AWS Summit.pdf
Performing real-time ETL into data lakes - ADB202 - Santa Clara AWS Summit.pdf
 
AWS Enterprise First Call Deck
AWS Enterprise First Call DeckAWS Enterprise First Call Deck
AWS Enterprise First Call Deck
 
AWS Summit Seoul 2023 | 천만 사용자를 위한 카카오의 AWS Native 글로벌 채팅 서비스
AWS Summit Seoul 2023 | 천만 사용자를 위한 카카오의 AWS Native 글로벌 채팅 서비스AWS Summit Seoul 2023 | 천만 사용자를 위한 카카오의 AWS Native 글로벌 채팅 서비스
AWS Summit Seoul 2023 | 천만 사용자를 위한 카카오의 AWS Native 글로벌 채팅 서비스
 
Building a Data Processing Pipeline on AWS
Building a Data Processing Pipeline on AWSBuilding a Data Processing Pipeline on AWS
Building a Data Processing Pipeline on AWS
 
Amazon S3 & Amazon Glacier - Object Storage Overview
Amazon S3 & Amazon Glacier - Object Storage OverviewAmazon S3 & Amazon Glacier - Object Storage Overview
Amazon S3 & Amazon Glacier - Object Storage Overview
 

Similar to AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)

Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2
Amazon Web Services
 
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Amazon Web Services
 

Similar to AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303) (20)

AWS Big Data Landscape
AWS Big Data LandscapeAWS Big Data Landscape
AWS Big Data Landscape
 
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
 
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Data Modernization_Harinath Susairaj.pptx
Data Modernization_Harinath Susairaj.pptxData Modernization_Harinath Susairaj.pptx
Data Modernization_Harinath Susairaj.pptx
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWS
 
AWS March 2016 Webinar Series Building Your Data Lake on AWS
AWS March 2016 Webinar Series Building Your Data Lake on AWS AWS March 2016 Webinar Series Building Your Data Lake on AWS
AWS March 2016 Webinar Series Building Your Data Lake on AWS
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
 
Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
 
Understanding AWS Managed Databases and Analytic Services - AWS Innovate Otta...
Understanding AWS Managed Databases and Analytic Services - AWS Innovate Otta...Understanding AWS Managed Databases and Analytic Services - AWS Innovate Otta...
Understanding AWS Managed Databases and Analytic Services - AWS Innovate Otta...
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
 
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...
 
Azure Data.pptx
Azure Data.pptxAzure Data.pptx
Azure Data.pptx
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
 
Implementing a Data Lake
Implementing a Data LakeImplementing a Data Lake
Implementing a Data Lake
 

More from Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)

  • 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Neeraj Verma – AWS Solutions Architect Saurav Mahanti – Senior Manager – Information Systems Dario Rivera – AWS Solutions Architect November 28, 2016 How to Build a Big Data Analytics Data Lake
  • 2. What to expect from this short talk • Data Lake concept • Data Lake - Important Capabilities • AMGEN’s Data Lake initiative • How to Build a Data Lake in your AWS account
  • 4. What is a Data Lake? Data Lake is a new and increasingly popular way to store and analyze massive volumes and heterogenous types of data in a centralized repository.
  • 5. Benefits of a Data Lake – Quick Ingest Quickly ingest data without needing to force it into a pre-defined schema. “How can I collect data quickly from various sources and store it efficiently?”
  • 6. Benefits of a Data Lake – All Data in One Place “Why is the data distributed in many locations? Where is the single source of truth ?” Store and analyze all of your data, from all of your sources, in one centralized location.
  • 7. Benefits of a Data Lake – Storage vs Compute Separating your storage and compute allows you to scale each component as required “How can I scale up with the volume of data being generated?”
  • 8. Benefits of a Data Lake – Schema on Read “Is there a way I can apply multiple analytics and processing frameworks to the same data?” A Data Lake enables ad-hoc analysis by applying schemas on read, not write.
  • 9. Important Capabilities of a “Data Lake”
  • 10. Important components of a Data Lake Ingest and Store Catalogue & Search Protect & Secure Access & User Interface
  • 11. Amazon SQS apps Streaming Amazon Kinesis Analytics Amazon KCL apps AWS Lambda Amazon Redshift COLLECT INGEST/STORE CONSUMEPROCESS / ANALYZE Amazon Machine Learning Presto Amazon EMR Amazon Elasticsearch Service Apache Kafka Amazon SQS Amazon Kinesis Streams Amazon Kinesis Firehose Amazon DynamoDB Amazon S3 Amazon ElastiCache Amazon RDS Amazon DynamoDB Streams BatchMessageInteractiveStreamML SearchSQLNoSQLCacheFileQueueStream Amazon EC2 Mobile apps Web apps Devices Messaging Message Sensors and IoT platforms AWS IoT Data centers AWS Direct Connect AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail RECORDS DOCUMENTS FILES MESSAGES STREAMS Amazon QuickSight Apps & Services Analysis&visualizationNotebooksIDEAPI LoggingIoTApplicationsTransportMessaging ETL Many tools to Support the Data Analytics LifeCycle
  • 12. Ingest and Store Ingest real time and batch data Support for any type of data at scale Durable Low cost
  • 13. Use S3 as Data Substrate – Apply to Compute as Needed EMR Kinesis Redshift DynamoDB RDS Data Pipeline Spark Streaming Storm Amazon S3 Import/Export Snowball Highly Durable Low Cost Scalable Storage
  • 14. Amazon S3 as your cluster’s persistent data store Amazon S3 Separate compute and storage Resize and shut down Analytics Compute Environments with no data loss Point multiple compute clusters at same data in Amazon S3
  • 15. AWS Direct Connect AWS Snowball ISV Connectors Amazon Kinesis Firehose S3 Transfer Acceleration AWS Storage Gateway Data Ingestion into Amazon S3
  • 16. Metadata lake Used for summary statistics and data Classification management Simplified model for data discovery & governance Catalogue & Search
  • 17. Catalogue & Search Architecture Data Collectors (EC2,ECS) AWS LambdaS3 Bucket AWS Lambda Metadata Index (DynamoDB) AWS Elasticsearch Service Extract Search Fields Put Object Object Created, Object Deleted Put Item Update Stream Update Index
  • 18. Access Control - Authentication & Authorization Data protection – Encryption Logging and Monitoring Protect and Secure
  • 19. Encryption ComplianceSecurity  Identity and Access Management (IAM) policies  Bucket policies  Access Control Lists (ACLs)  Query string authentication  Private VPC endpoints to Amazon S3  SSL endpoints  Server Side Encryption (SSE-S3)  S3 Server Side Encryption with provided keys (SSE-C, SSE-KMS)  Client-side Encryption  Buckets access logs  Lifecycle Management Policies  Access Control Lists (ACLs)  Versioning & MFA deletes  Certifications – HIPAA, PCI, SOC 1/2/3 etc. Implement the right controls
  • 20. Exposes the data lake to customers Programmatically query catalogue Expose search API Ensures that entitlements are respected API & User Interface
  • 21. API & UI Architecture API Gateway AWS Lambda Metadata IndexUsers User Management Static Website
  • 22. Putting It All Together
  • 23.
  • 24. Common Add-On Capability to Data Lake Backend ETL Organizational Control Gates for Dataset Access Data Correlation Identification via Machine Learning Dataset Updates to Dependent Compute Environments ETL json
  • 25. Data Lake is a Journey There are Multiple implementation Methods for Building a Data Lake • Using a Combination of Specialized/Open Source Tools from Various Providers to build a Customized Solution • Use various managed services from AWS such as S3, Amazon ElasticSearch Service, DynamoDB, Cognito, Lambda, etc as a Core Data Lake Solution • Using a DataLake-as-a-Service Solution
  • 26. Data Lake Evolution @ Amgen SAURAV MAHANTI Senior Manager - Information Systems
  • 27. One Data Lake One code-base Multiple Locations (Cloud / On-Premise) Common scalable infrastructure Multiple Business Functions Shared Features
  • 28. A CONCEPTUAL VIEW OF THE DATA LAKE DATA LAKE PLATFORM Common Tools and Capabilities Innovative new tools and capabilities are reused across all functions (e.g. search, data processing & storage, visualization tools)Functions can manage their own data, while contributing to the common data layer Business applications are built to meet specific information needs, from simple data access, data visualization, to complex statistical/predictive models Manufacturing Data BatchGenealogyVisualization PDSelf-ServiceAnalytics InstrumentBusDataSearch Real World Data FDASentinelAnalytics EpiProgrammer’sWorkbench PatientPopulationAnalytics Commercial Data USFieldSalesReporting GlobalForecasting USMarketingAnalytics Best Practices Awards Bio IT World 2016 Winner
  • 29. HIGH LEVEL COMPONENT ARCHITECTURE of the data lake Procure/Ingest DataIngestion Adapter Managed Infrastructure Self-service Adapter Application Accelerator Curate and Enrichment Self-Service Data Integration Tools IS Owned Automated Integration Data Catalog Storage and Processing User Data Workspace Raw Mastered Apps Elastic Compute Capability Analytics Toolkit Analytical Toolkit Data Science Toolkit BI Reporting User Specific Apps Access Portal Centralized Analytics Toolkit provides different reporting and analytics solutions for end-users to access the data on the Data Lake Curate and Enrichment Layer Enhances the value and usability of the data through automation, cataloging and linking Data processing & storage layer is the core of the Enterprise Data Lake that enables its scale and flexibility Procure/Ingest a collection of tools and technologies to easily move data to and from the Data Lake Reference Data Linkage
  • 30. DATA PROCESSING AND STORAGE LAYERProcure/Ingest DataIngestion Adapter Managed Infrastructure Self-service Adapter Application Accelerator Curate and Enrichment Self-Service Data Integration Tools IS Owned Automated Integration Data Catalog Centralized Analytics Toolkit Analytical Toolkit Data Science Toolkit BI Reporting User Specific Apps Centralized Access Portal Centralized Analytics Toolkit provides different reporting and analytics solutions for end-users to access the data on the Data Lake Curate and Enrichment Layer provides different reporting and analytics solutions for end-users to access the data on the Data Lake Data processing & storage layer is the core of the Enterprise Data Lake that enables its scale and flexibility Procure/Ingest a collection of tools and technologies to easily move data to and from the Enterprise Data Lake The combination of Hadoop HDFS and Amazon S3 provides the right cost/performance balance with unlimited scalability while maintaining security and encryption at the data file level Powerful execution engines like YARN, Map Reduce and SPARK bring the “compute to the data” HIVE and Impala provide SQL over structured data; HBASE is used for NoSQL/Transactional jobs Solr is used for search capabilities over documents MarkLogic and Neo4J provide semantic and graph capabilities Amazon RedShift and Amazon EMR low-cost elastic computing with the ability to spin up clusters for data processing and metrics calculations Storage and Processing User Data Workspace Raw Mastered Apps Elastic Compute Capability Mark Logic
  • 31. PROCURE AND INGEST - Pre-built and configurable common components to load any type of data into the Lake Procure/Ingest DataIngestion Adapter Managed Infrastructure Self-service Adapter Application Accelerator Curate and Enrichment Self-Service Data Integration Tools IS Owned Automated Integration Data Catalog Storage and Processing User Data Workspace Raw Mastered Apps Elastic Compute Capability Centralized Analytics Toolkit Analytical Toolkit Data Science Toolkit BI Reporting User Specific Apps Centralized Access Portal Centralized Analytics Toolkit provides different reporting and analytics solutions for end-users to access the data on the Data Lake Curate and Enrichment Layer provides different reporting and analytics solutions for end-users to access the data on the Data Lake Data processing & storage layer is the core of the Enterprise Data Lake that enables its scale and flexibilityCloud Data Integration tools like SnapLogic enable data analysts end users to build data pipelines into the Data Lake using pre-built connectors to various cloud hosted services like Box and make it easy to move data set between HDFS, S3, sFTP and fileshares Structured Data Ingestion - A common component for scheduled production data loads of incremental or full data. It uses Python and native Hadoop tools like Scoop and HIVE for efficiency and speed. Real-Time Data Ingestion for real time or streaming data. It uses the Kafka messaging queue, Spark streaming, Hbase and Java. Unstructured Data Pipeline Morphlines Document Pipeline for document ingestion, text processing and indexing. It uses Morphlines, Lily indexer and HBase.
  • 32. CURATE AND ENRICHMENT Enhance the value of the data by linking and cataloging Reference Data linkage – Connect Datasets to Ontologies and Vocabularies to get more relevant and better results Data Catalog is an enterprise- wide metadata catalog that stores, describes, indexes, and shows how to access any registered data asset Self-Service Data Integration Tools Analytical Subject Areas Data Catalog Curate and Enrichment Reference Data Linkage Analytical Subject Areas – Build targeted applications using Data Integration tools like SnapLogic or deploy packaged applications or data marts that transform the data in the Lake for consumption by end user tools PDSelf-ServiceAnalytics FDASentinelAnalytics USMarketingAnalytics
  • 33. CENTRALIZED ANALYTICS TOOLKIT - provides reporting and analytics solutions for end- users to access the data in the Lake Analytics Toolkit Analytical Toolkit Data Science Toolkit BI Reporting User Specific Apps Access Portal Data Science toolkit enables analysts and data scientists use tools like SAS, R, Jupyter (Python notebooks) to connect to their analytics sandbox within the Lake and submit distributed computing jobs Reporting and Business Intelligence tools like MicroStrategy and Cognos can be deployed either directly on the Lake or on derived Analytical Subject Areas Analytics and Visualization tools are the most common methods used to query and analyze data in the Lake Focused applications that target a specific use-case can be built using open source products and deployed to a specific user community through the portal Data Lake Portal provides a user-friendly, mobile-enabled and secure portal to all the end-user applications
  • 34. Business Impact Marketed product defense Clinical trial speed & outcomes Design & Analytic Toolbox Pre-calculated Cohorts for All Pipeline & Marketed Medicines Real World Data Platform: Shared capability used by multiple business functions from R&D and Commercial RWD Data Lake – Claims, EMR+ Superior processing speed enables simultaneous processing of terabytes of data Datasets Converted to Common OMOP Data Model Asia US EU ROW Descriptive Rapid RWD Query Targeted Demand Therapy areas Study Design Advanced Analytics Spotfire, R, SAS, Achilles Product value Benefit:Risk Evidence-based research
  • 35. That’s a lot of Work! – Where do I even Start?!
  • 36. Data Lake Solution Introducing: Presented by: Dario Rivera – AWS Solutions Architect Built by: Sean Senior – AWS Solutions Builder
  • 37. Data Lake Solution Package of Code – Deployed via CloudFormation into your AWS Account
  • 39. Data Lake Server-less Composition • Amazon S3 (2 buckets) • Primary data lake content storage • Static Website Hosting • Amazon API Gateway (1 RESTful API) • Amazon DynamoDB (7 Tables) • Amazon Cognito Your User Pools (1 User Pool) • AWS Lambda • 5 microservice functions, • 1 Amazon API Gateway custom authorizer function, • 1 Amazon Cognito User Pool event trigger function [Pre sign-up, Post confirmation] • Amazon Elasticsearch Service (1 cluster) • AWS IAM (7 policies, 7 roles)
  • 40. Demo of AWS Data Lake Solution http://tinyurl.com/DataLakeSolution
  • 41.
  • 42. All of this and it costs less than a $1 / hour to run the Data Lake Solution * Excluding Storage and Analytics Environment Costs
  • 43. Data Lake Solution will be available End of Q4. Available via AWS Answers https://aws.amazon.com/answers/
  • 44. Thank you! Neeraj Verma – rajverma@amazon.com Saurav Mahanti – smahanti@amgen.com Dario Rivera – darior@amazon.com