SlideShare una empresa de Scribd logo
1 de 34
BIG DATA ANALYTICS IN THE CLOUD
Sushant Rao
Cloud Product Marketing @ Cloudera
Rohit Pujari
Solutions Architect @ Amazon Web Services
© Cloudera, Inc. All rights reserved.2
Primary Advantages for Cloud
● Agility
○ Speed of making changes to meet business / technical needs
● Scalable & Elastic
○ Scale up and down quickly
● Reliable
○ Multiple options to ensure infrastructure / services are available
● Cost effectiveness
○ Pay for what you use (but may not be cheaper than on-prem)
© Cloudera, Inc. All rights reserved.3
Big Data Use Cases for Cloud
● Corporate directive to leverage the cloud
○ C-level has decided to utilize the cloud more
● Disaster Recovery “location” in the cloud
○ Backup all data to the cloud, without a second “physical” location
● On-demand data mart / data engineering
○ Separate environment for new, production workloads
○ Ad-hoc workloads that run intermittently
● Sandbox environment for workloads
○ Environment to test queries and algorithms
© Cloudera, Inc. All rights reserved.4
Cloudera’s Solution for Data Analytics / Engineering in Cloud
• The modern platform for machine learning and analytics
○ Numerous functions for all types of jobs and queries
• with multiple deployment options
○ On-premises, Public cloud (including multi-), and Hybrid
• and one shared data experience
○ Framework for consistent security, governance, and metadata management across
applications and deployments
© Cloudera, Inc. All rights reserved.5
The Modern Platform for Machine Learning & Analytics
OPERATIONAL
DATABASE
DATA
ENGINEERING
DATA
WAREHOUSE
DATA
SCIENCE
DATA PROCESSING
• Cost efficient
• Reliable
• Scalable
• Based on Spark,
MapReduce,
Hive & Pig
• Supported by
Workload
Analytics
FAST BI & SQL
• Flexibility
• Elastic scale
• Go beyond SQL
• Based on
Impala & Hive
• SQL dev enviro
• Supported by
Workload
Analytics
MACHINE LEARNING
• Fast dev to
production
• Secure self-serve
• Based on
Python, R, and
Spark
• ML dev
environment
(CDSW)
ONLINE & REAL-TIME
• High throughput,
low latency
• Strongly consistent
• Based on
Hbase, Kudu
& Spark
streaming
© Cloudera, Inc. All rights reserved. 6
Cloudera’s Vision for AI and Machine Learning
Modern Enterprise Platform, Tools, and Expert Guidance to help you Unlock Business Value with ML /
AI
Agile platform to build,
train, and deploy
scalable ML applications
Enterprise data science
tools to accelerate team
productivity
Expert guidance,
services & training to
fast track value & scale
© Cloudera, Inc. All rights reserved.7
Via Cloudera Altus Director
INFRASTRUCTURE SERVICES
OPERATIONAL
DATABASE
DATA
ENGINEERING
DATA
WAREHOUSE
DATA
SCIENCE
DATA
ENGINEERING
DATA
WAREHOUSE
Via Cloudera Altus Services
With Multiple Deployment Options
Traditional Infrastructure
(combined storage and compute)
Cloud Infrastructure
(decoupled storage and compute)
Cloud Infrastructure
(decoupled storage and compute)
© Cloudera, Inc. All rights reserved.8
Cloudera
Enterprise with
SDX
Benefits for IT infra & ops
●Central control and security
●Focus on curating not
firefighting
Benefits for users
●Value from single source of
truth
●Bring the best tools for each
job
WORKLOADS 3RD PARTY
SERVICES
DATA
ENGINEERING
DATA
SCIENCE
DATA
WAREHOUSE
OPERATIONAL
DATABASE
DATA CATALOG
GOVERNANCESECURITY LIFECYCLE
MANAGEMENT
STORAGE
Microsoft
ADLS
COMMON SERVICES
HDFS
Amazon
S3 KUDU
© Cloudera, Inc. All rights reserved.9
Many Options for Data Analytics / Engineering in the Cloud
Altus Director
Altus
Services
Existing On-
Prem
Deployment
© Cloudera, Inc. All rights reserved.10
Many Options for Data Analytics / Engineering in the Cloud
Altus Director
Altus
Services
Existing On-
Prem
Deployment
Starting New
Deployment
© Cloudera, Inc. All rights reserved.11
Journey from On-Prem Cluster to Cloud
BARE METAL
CLOUDERA CLUSTER
(PERSISTENT)
COMPUTE DATA
CONTEXT
Data
Engineering
Data
Warehouse
Data Science
Security
Metadata
Governance
STORAGE
CLOUD OBJECT STORE
0 - ON PREMISES
HDFS
© Cloudera, Inc. All rights reserved.12
Journey from On-Prem Cluster to Cloud
CUSTOMER CLOUD
CLOUDERA CLUSTER
(PERSISTENT)
COMPUTE DATA
CONTEXT
Data
Engineering
Data
Warehouse
Data Science
Security
Metadata
Governance
STORAGE
CLOUD OBJECT STORE
1 - LIFT AND SHIFT
HDFS
BARE METAL
CLOUDERA CLUSTER
(PERSISTENT)
COMPUTE DATA
CONTEXT
Data
Engineering
Data
Warehouse
Data Science
Security
Metadata
Governance
STORAGE
CLOUD OBJECT STORE
0 - ON PREMISES
HDFS
© Cloudera, Inc. All rights reserved.13
Journey from On-Prem Cluster to Cloud
CUSTOMER CLOUDCUSTOMER CLOUD
CLOUDERA CLUSTER
(PERSISTENT)
COMPUTE DATA
CONTEXT
Data
Engineering
Data
Warehouse
Data Science
Security
Metadata
Governance
STORAGE
CLOUD OBJECT STORE
CLOUDERA CLUSTER
(PERSISTENT)
COMPUTE DATA
CONTEXT
Data
Engineering
Data
Warehouse
Data Science
Security
Metadata
Governance
STORAGE
CLOUD OBJECT STORE
1 - LIFT AND SHIFT 2 - OBJECT STORAGE
HDFS
BARE METAL
CLOUDERA CLUSTER
(PERSISTENT)
COMPUTE DATA
CONTEXT
Data
Engineering
Data
Warehouse
Data Science
Security
Metadata
Governance
STORAGE
CLOUD OBJECT STORE
0 - ON PREMISES
HDFS
© Cloudera, Inc. All rights reserved.14
Journey from On-Prem Cluster to Cloud
CUSTOMER CLOUD CUSTOMER CLOUDCUSTOMER CLOUD
CLOUDERA CLUSTER
(PERSISTENT)
COMPUTE DATA
CONTEXT
Data
Engineering
Data
Warehouse
Data Science
Security
Metadata
Governance
STORAGE
CLOUD OBJECT STORE
CLOUDERA CLUSTER
(PERSISTENT)
COMPUTE DATA
CONTEXT
Data
Engineering
Data
Warehouse
Data Science
Security
Metadata
Governance
STORAGE
CLOUD OBJECT STORE
1 - LIFT AND SHIFT 2 - OBJECT STORAGE
HDFS
CLOUDERA
CLUSTERS
(TRANSIENT–
ALTUS)
COMPUTE
Data
Engineering
CLOUDERA CLOUD
CLOUDERA
ALTUS
CONTROL
PLANE
STORAGE
CLOUD OBJECT STORE
DATA
CONTEXT
CLOUDERA CLUSTER
(PERSISTENT–DIRECTOR)
COMPUTE DATA
CONTEXT
CLOUDERA
CLUSTERS
(TRANSIENT–
ALTUS)
COMPUTE
Data
Warehouse
3 - CLOUD NATIVE ARCHITECTURES
BARE METAL
CLOUDERA CLUSTER
(PERSISTENT)
COMPUTE DATA
CONTEXT
Data
Engineering
Data
Warehouse
Data Science
Security
Metadata
Governance
STORAGE
CLOUD OBJECT STORE
0 - ON PREMISES
HDFS
© Cloudera, Inc. All rights reserved.15
Customer Examples
Many Cloudera customers (Global 5K) used public cloud
• Online retailer
• Over 2,000 nodes with ~2PB of data on AWS running in an active - active configuration
• Transforming data with Spark and then analyzing with Apache Hive
• German chain of coffee retailers and cafés
• 30+ nodes with 50TB of data on AWS
• Modern Cloudera platform with an Impala data warehouse
• Global information company
• 50+ nodes on Microsoft Azure and 20+ nodes on AWS
• Replaced Netezza with Hadoop and leveraging both Impala and Spark for analytics
© Cloudera, Inc. All rights reserved.16
Security Use Case
Cloudera is using cloud as well
Altus based solution saved more than 50% cost compared to initial implementation
© Cloudera, Inc. All rights reserved.17
Cloudera Altus
Key Differentiators
• Multi-function: Unified platform for data engineering, data warehouse, and data
science
• Multi-cloud: Option for on-premises, Public cloud (including multi-), and Hybrid
• SDX: Integrated shared data experience across multi-function clusters
Rohit Pujari, Solutions Architect
AWS Security & Compliance
Why is security traditionally so hard?
Lack of
visibility
Low degree
of automation
ORANDMove fast Stay secure
Before…Now…
Making life easier
Choosing security does not mean
giving up on convenience or introducing
complexity
The most sensitive workloads run on AWS
“We can be even more secure in the AWS cloud
than in our own datacenters.”
—Tom Soderstrom, CTO, NASA JPL
“We knew the cloud was the only way to get the scalability,
speed, and security our customers expect from 3M.”
—Rick Austin, 3M Health Information Systems
“We determined that security in AWS is superior to our on-premises
data center across several dimensions, including patching,
encryption, auditing and logging, entitlements, and compliance.”
—John Brady, CISO, FINRA (Financial Industry Regulatory Authority)
Benefits of a Data Lake - All Data is in One Place
Analyze all of your data,
from all of your sources, in one stored
location
“Why is the data distributed in many
locations? Where is the single source
of truth?”
Durable
Designed for 11 9s
of durability
Available
Designed for
99.99% availability
High performance
▪ Multiple upload
▪ Range GET
▪ Scalable throughput
Scalable
▪ Store as much as you need
▪ Scale storage and compute independently
▪ No minimum usage commitments
Integrated Partner Tools
▪ Cloudera EDH
▪ Cloudera Altus
▪ Cloudera Impala
Easy to use
▪ Simple REST API
▪ AWS SDKs
▪ Simple management tools
▪ Event notification
▪ Lifecycle policies
Why Amazon S3 for a Data Lake?
AWS Direct Connect AWS Snowball ISV Connectors
Kafka/Flume
Amazon Kinesis
Firehose
Amazon S3 Transfer
Acceleration
AWS Storage
Gateway
Data Ingestion into Amazon S3
Encryption ComplianceSecurity
▪ Identity and access
Management (IAM) policies
▪ Bucket policies
▪ Access Control Lists (ACLs)
▪ Private VPC endpoints to
Amazon S3
▪ Amazon S3 object tagging to
manage access policies
▪ SSL endpoints
▪ Server-side encryption
(SSE-S3)
▪ S3 server-side
encryption with provided
keys (SSE-C, SSE-KMS)
▪ Client-side encryption
▪ Buckets access logs
▪ Lifecycle management
policies
▪ Access Control Lists
(ACLs)
▪ Versioning and MFA
deletes
▪ Certifications—HIPAA,
PCI, SOC 1/2/3, etc.
Strong Security Controls
Automate
with deeply integrated
security tools
and services
Inherit
global
security and
compliance
controls
Highest
standards
for privacy
and data
security
Largest
network
of security
partners and solutions
Scale
with superior visibility
and control that
satisfies the most
risk-sensitive orgs
Move to AWS
Strengthen your security posture
Encrypt data in
transit and at rest
with keys managed by
our AWS Key Management
System (KMS) or managing
your own encryption keys
with Cloud HSM using
FIPS 140-2 Level 3
validated HSMs
Meet data
residency requirements
Choose an AWS Region
and AWS will not replicate it
elsewhere unless you choose
to do so
Access services and tools that
enable you to
build GDPR-compliant
infrastructure
on top of AWS
Comply with local
data privacy laws
by controlling who
can access content, its
lifecycle and disposal
Highest standards for privacy
Inherit global security and compliance controls
© Cloudera, Inc. All rights reserved.30
Data Analytics / Engineering with Cloudera
$
• Lower risk of data breach
• Analysts more productive on jobs
• Self-service (no shadow IT) and
more productive
• IT more strategic, less admin time
• Deployment choices and no lock-in
• Same solution as on-premises and multi-
cloud
• Eliminate data copies
• Single security framework with universally
shared metadata
• Easy to track data lineage
• Unified services
+
CLOUDERA
ADVANTAGES
BUSINESS
VALUE
© Cloudera, Inc. All rights reserved.31
Ready to try Data Analytics / Engineering in the Cloud?
Have an existing cluster for DW / DE
• Up to $2K Free AWS Credits*
• Email: awsoffer@cloudera.com
Don’t have an existing cluster
• Free Altus DE / DW Trial
• https://sso.cloudera.com/register.html
*Must work with AWS and Cloudera account managers on POC to be eligible for offer
THANK YOU
© Cloudera, Inc. All rights reserved.33
APPENDIX
© Cloudera, Inc. All rights reserved.34
Cloudera Pricing / Acquisition
• Acquisition Options
• Pay-as-you-go usage-based pricing
• Node-based license subscription
• Free 30-day trial
• Pre-pay of cloud credits
• Free version that can be deployed in the cloud
• Pricing - https://www.cloudera.com/products/pricing.html

Más contenido relacionado

La actualidad más candente

Making Self-Service BI a Reality in the Enterprise
Making Self-Service BI a Reality in the EnterpriseMaking Self-Service BI a Reality in the Enterprise
Making Self-Service BI a Reality in the Enterprise
Cloudera, Inc.
 

La actualidad más candente (20)

Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18
 
Cloudera - The Modern Platform for Analytics
Cloudera - The Modern Platform for AnalyticsCloudera - The Modern Platform for Analytics
Cloudera - The Modern Platform for Analytics
 
What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18
What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18
What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18
 
PaaS or Fail: Rule the Cloud with Altus
PaaS or Fail: Rule the Cloud with AltusPaaS or Fail: Rule the Cloud with Altus
PaaS or Fail: Rule the Cloud with Altus
 
Big data journey to the cloud 5.30.18 asher bartch
Big data journey to the cloud 5.30.18   asher bartchBig data journey to the cloud 5.30.18   asher bartch
Big data journey to the cloud 5.30.18 asher bartch
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 
Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
The 6th Wave of Automation: Automation of Decisions | Cloudera Analytics & Ma...
The 6th Wave of Automation: Automation of Decisions | Cloudera Analytics & Ma...The 6th Wave of Automation: Automation of Decisions | Cloudera Analytics & Ma...
The 6th Wave of Automation: Automation of Decisions | Cloudera Analytics & Ma...
 
Self-service Big Data Analytics on Microsoft Azure
Self-service Big Data Analytics on Microsoft AzureSelf-service Big Data Analytics on Microsoft Azure
Self-service Big Data Analytics on Microsoft Azure
 
Making Self-Service BI a Reality in the Enterprise
Making Self-Service BI a Reality in the EnterpriseMaking Self-Service BI a Reality in the Enterprise
Making Self-Service BI a Reality in the Enterprise
 
How komatsu is driving operational efficiencies using io t and machine learni...
How komatsu is driving operational efficiencies using io t and machine learni...How komatsu is driving operational efficiencies using io t and machine learni...
How komatsu is driving operational efficiencies using io t and machine learni...
 
Big data journey to the cloud maz chaudhri 5.30.18
Big data journey to the cloud   maz chaudhri 5.30.18Big data journey to the cloud   maz chaudhri 5.30.18
Big data journey to the cloud maz chaudhri 5.30.18
 
Big data journey to the cloud rohit pujari 5.30.18
Big data journey to the cloud   rohit pujari 5.30.18Big data journey to the cloud   rohit pujari 5.30.18
Big data journey to the cloud rohit pujari 5.30.18
 
Spark and Deep Learning Frameworks at Scale 7.19.18
Spark and Deep Learning Frameworks at Scale 7.19.18Spark and Deep Learning Frameworks at Scale 7.19.18
Spark and Deep Learning Frameworks at Scale 7.19.18
 
How to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
How to Build Multi-disciplinary Analytics Applications on a Shared Data PlatformHow to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
How to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
 
The Vision & Challenge of Applied Machine Learning
The Vision & Challenge of Applied Machine LearningThe Vision & Challenge of Applied Machine Learning
The Vision & Challenge of Applied Machine Learning
 
How Big Data Can Enable Analytics from the Cloud (Technical Workshop)
How Big Data Can Enable Analytics from the Cloud (Technical Workshop)How Big Data Can Enable Analytics from the Cloud (Technical Workshop)
How Big Data Can Enable Analytics from the Cloud (Technical Workshop)
 
Cloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made EasyCloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made Easy
 

Similar a Leveraging the Cloud for Big Data Analytics 12.11.18

Similar a Leveraging the Cloud for Big Data Analytics 12.11.18 (20)

Multidisziplinäre Analyseanwendungen auf einer gemeinsamen Datenplattform ers...
Multidisziplinäre Analyseanwendungen auf einer gemeinsamen Datenplattform ers...Multidisziplinäre Analyseanwendungen auf einer gemeinsamen Datenplattform ers...
Multidisziplinäre Analyseanwendungen auf einer gemeinsamen Datenplattform ers...
 
Cloudera Altus: Big Data in der Cloud einfach gemacht
Cloudera Altus: Big Data in der Cloud einfach gemachtCloudera Altus: Big Data in der Cloud einfach gemacht
Cloudera Altus: Big Data in der Cloud einfach gemacht
 
A deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloudA deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloud
 
Hybrid is the New Normal
Hybrid is the New NormalHybrid is the New Normal
Hybrid is the New Normal
 
Five Tips for Running Cloudera on AWS
Five Tips for Running Cloudera on AWSFive Tips for Running Cloudera on AWS
Five Tips for Running Cloudera on AWS
 
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
 Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ... Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Comment développer une stratégie Big Data dans le cloud public avec l'offre P...
Comment développer une stratégie Big Data dans le cloud public avec l'offre P...Comment développer une stratégie Big Data dans le cloud public avec l'offre P...
Comment développer une stratégie Big Data dans le cloud public avec l'offre P...
 
Big Data LDN 2018: CONSISTENT SECURITY, GOVERNANCE AND FLEXIBILITY FOR ALL WO...
Big Data LDN 2018: CONSISTENT SECURITY, GOVERNANCE AND FLEXIBILITY FOR ALL WO...Big Data LDN 2018: CONSISTENT SECURITY, GOVERNANCE AND FLEXIBILITY FOR ALL WO...
Big Data LDN 2018: CONSISTENT SECURITY, GOVERNANCE AND FLEXIBILITY FOR ALL WO...
 
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the CloudPart 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
 
Cloudera GoDataFest Deploying Cloudera in the Cloud
Cloudera GoDataFest Deploying Cloudera in the CloudCloudera GoDataFest Deploying Cloudera in the Cloud
Cloudera GoDataFest Deploying Cloudera in the Cloud
 
High-Performance Analytics in the Cloud with Apache Impala
High-Performance Analytics in the Cloud with Apache ImpalaHigh-Performance Analytics in the Cloud with Apache Impala
High-Performance Analytics in the Cloud with Apache Impala
 
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
 
Turning Data into Business Value with a Modern Data Platform
Turning Data into Business Value with a Modern Data PlatformTurning Data into Business Value with a Modern Data Platform
Turning Data into Business Value with a Modern Data Platform
 
How to Build Continuous Ingestion for the Internet of Things
How to Build Continuous Ingestion for the Internet of ThingsHow to Build Continuous Ingestion for the Internet of Things
How to Build Continuous Ingestion for the Internet of Things
 
Supercharge Splunk with Cloudera

Supercharge Splunk with Cloudera
Supercharge Splunk with Cloudera

Supercharge Splunk with Cloudera

 
Get Started with Cloudera’s Cyber Solution
Get Started with Cloudera’s Cyber SolutionGet Started with Cloudera’s Cyber Solution
Get Started with Cloudera’s Cyber Solution
 
Cloud-Native Machine Learning: Emerging Trends and the Road Ahead
Cloud-Native Machine Learning: Emerging Trends and the Road AheadCloud-Native Machine Learning: Emerging Trends and the Road Ahead
Cloud-Native Machine Learning: Emerging Trends and the Road Ahead
 
Introduction to cloud computing
Introduction to cloud computingIntroduction to cloud computing
Introduction to cloud computing
 
Optimize your cloud strategy for machine learning and analytics
Optimize your cloud strategy for machine learning and analyticsOptimize your cloud strategy for machine learning and analytics
Optimize your cloud strategy for machine learning and analytics
 

Más de Cloudera, Inc.

Más de Cloudera, Inc. (16)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
How Cloudera SDX can aid GDPR compliance
How Cloudera SDX can aid GDPR complianceHow Cloudera SDX can aid GDPR compliance
How Cloudera SDX can aid GDPR compliance
 
When SAP alone is not enough
When SAP alone is not enoughWhen SAP alone is not enough
When SAP alone is not enough
 
Multi task learning stepping away from narrow expert models 7.11.18
Multi task learning stepping away from narrow expert models 7.11.18Multi task learning stepping away from narrow expert models 7.11.18
Multi task learning stepping away from narrow expert models 7.11.18
 
Cloudera training secure your cloudera cluster 7.10.18
Cloudera training secure your cloudera cluster 7.10.18Cloudera training secure your cloudera cluster 7.10.18
Cloudera training secure your cloudera cluster 7.10.18
 
The 5 Biggest Data Myths in Telco: Exposed
The 5 Biggest Data Myths in Telco: ExposedThe 5 Biggest Data Myths in Telco: Exposed
The 5 Biggest Data Myths in Telco: Exposed
 
Delivering improved patient outcomes through advanced analytics 6.26.18
Delivering improved patient outcomes through advanced analytics 6.26.18Delivering improved patient outcomes through advanced analytics 6.26.18
Delivering improved patient outcomes through advanced analytics 6.26.18
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 

Leveraging the Cloud for Big Data Analytics 12.11.18

  • 1. BIG DATA ANALYTICS IN THE CLOUD Sushant Rao Cloud Product Marketing @ Cloudera Rohit Pujari Solutions Architect @ Amazon Web Services
  • 2. © Cloudera, Inc. All rights reserved.2 Primary Advantages for Cloud ● Agility ○ Speed of making changes to meet business / technical needs ● Scalable & Elastic ○ Scale up and down quickly ● Reliable ○ Multiple options to ensure infrastructure / services are available ● Cost effectiveness ○ Pay for what you use (but may not be cheaper than on-prem)
  • 3. © Cloudera, Inc. All rights reserved.3 Big Data Use Cases for Cloud ● Corporate directive to leverage the cloud ○ C-level has decided to utilize the cloud more ● Disaster Recovery “location” in the cloud ○ Backup all data to the cloud, without a second “physical” location ● On-demand data mart / data engineering ○ Separate environment for new, production workloads ○ Ad-hoc workloads that run intermittently ● Sandbox environment for workloads ○ Environment to test queries and algorithms
  • 4. © Cloudera, Inc. All rights reserved.4 Cloudera’s Solution for Data Analytics / Engineering in Cloud • The modern platform for machine learning and analytics ○ Numerous functions for all types of jobs and queries • with multiple deployment options ○ On-premises, Public cloud (including multi-), and Hybrid • and one shared data experience ○ Framework for consistent security, governance, and metadata management across applications and deployments
  • 5. © Cloudera, Inc. All rights reserved.5 The Modern Platform for Machine Learning & Analytics OPERATIONAL DATABASE DATA ENGINEERING DATA WAREHOUSE DATA SCIENCE DATA PROCESSING • Cost efficient • Reliable • Scalable • Based on Spark, MapReduce, Hive & Pig • Supported by Workload Analytics FAST BI & SQL • Flexibility • Elastic scale • Go beyond SQL • Based on Impala & Hive • SQL dev enviro • Supported by Workload Analytics MACHINE LEARNING • Fast dev to production • Secure self-serve • Based on Python, R, and Spark • ML dev environment (CDSW) ONLINE & REAL-TIME • High throughput, low latency • Strongly consistent • Based on Hbase, Kudu & Spark streaming
  • 6. © Cloudera, Inc. All rights reserved. 6 Cloudera’s Vision for AI and Machine Learning Modern Enterprise Platform, Tools, and Expert Guidance to help you Unlock Business Value with ML / AI Agile platform to build, train, and deploy scalable ML applications Enterprise data science tools to accelerate team productivity Expert guidance, services & training to fast track value & scale
  • 7. © Cloudera, Inc. All rights reserved.7 Via Cloudera Altus Director INFRASTRUCTURE SERVICES OPERATIONAL DATABASE DATA ENGINEERING DATA WAREHOUSE DATA SCIENCE DATA ENGINEERING DATA WAREHOUSE Via Cloudera Altus Services With Multiple Deployment Options Traditional Infrastructure (combined storage and compute) Cloud Infrastructure (decoupled storage and compute) Cloud Infrastructure (decoupled storage and compute)
  • 8. © Cloudera, Inc. All rights reserved.8 Cloudera Enterprise with SDX Benefits for IT infra & ops ●Central control and security ●Focus on curating not firefighting Benefits for users ●Value from single source of truth ●Bring the best tools for each job WORKLOADS 3RD PARTY SERVICES DATA ENGINEERING DATA SCIENCE DATA WAREHOUSE OPERATIONAL DATABASE DATA CATALOG GOVERNANCESECURITY LIFECYCLE MANAGEMENT STORAGE Microsoft ADLS COMMON SERVICES HDFS Amazon S3 KUDU
  • 9. © Cloudera, Inc. All rights reserved.9 Many Options for Data Analytics / Engineering in the Cloud Altus Director Altus Services Existing On- Prem Deployment
  • 10. © Cloudera, Inc. All rights reserved.10 Many Options for Data Analytics / Engineering in the Cloud Altus Director Altus Services Existing On- Prem Deployment Starting New Deployment
  • 11. © Cloudera, Inc. All rights reserved.11 Journey from On-Prem Cluster to Cloud BARE METAL CLOUDERA CLUSTER (PERSISTENT) COMPUTE DATA CONTEXT Data Engineering Data Warehouse Data Science Security Metadata Governance STORAGE CLOUD OBJECT STORE 0 - ON PREMISES HDFS
  • 12. © Cloudera, Inc. All rights reserved.12 Journey from On-Prem Cluster to Cloud CUSTOMER CLOUD CLOUDERA CLUSTER (PERSISTENT) COMPUTE DATA CONTEXT Data Engineering Data Warehouse Data Science Security Metadata Governance STORAGE CLOUD OBJECT STORE 1 - LIFT AND SHIFT HDFS BARE METAL CLOUDERA CLUSTER (PERSISTENT) COMPUTE DATA CONTEXT Data Engineering Data Warehouse Data Science Security Metadata Governance STORAGE CLOUD OBJECT STORE 0 - ON PREMISES HDFS
  • 13. © Cloudera, Inc. All rights reserved.13 Journey from On-Prem Cluster to Cloud CUSTOMER CLOUDCUSTOMER CLOUD CLOUDERA CLUSTER (PERSISTENT) COMPUTE DATA CONTEXT Data Engineering Data Warehouse Data Science Security Metadata Governance STORAGE CLOUD OBJECT STORE CLOUDERA CLUSTER (PERSISTENT) COMPUTE DATA CONTEXT Data Engineering Data Warehouse Data Science Security Metadata Governance STORAGE CLOUD OBJECT STORE 1 - LIFT AND SHIFT 2 - OBJECT STORAGE HDFS BARE METAL CLOUDERA CLUSTER (PERSISTENT) COMPUTE DATA CONTEXT Data Engineering Data Warehouse Data Science Security Metadata Governance STORAGE CLOUD OBJECT STORE 0 - ON PREMISES HDFS
  • 14. © Cloudera, Inc. All rights reserved.14 Journey from On-Prem Cluster to Cloud CUSTOMER CLOUD CUSTOMER CLOUDCUSTOMER CLOUD CLOUDERA CLUSTER (PERSISTENT) COMPUTE DATA CONTEXT Data Engineering Data Warehouse Data Science Security Metadata Governance STORAGE CLOUD OBJECT STORE CLOUDERA CLUSTER (PERSISTENT) COMPUTE DATA CONTEXT Data Engineering Data Warehouse Data Science Security Metadata Governance STORAGE CLOUD OBJECT STORE 1 - LIFT AND SHIFT 2 - OBJECT STORAGE HDFS CLOUDERA CLUSTERS (TRANSIENT– ALTUS) COMPUTE Data Engineering CLOUDERA CLOUD CLOUDERA ALTUS CONTROL PLANE STORAGE CLOUD OBJECT STORE DATA CONTEXT CLOUDERA CLUSTER (PERSISTENT–DIRECTOR) COMPUTE DATA CONTEXT CLOUDERA CLUSTERS (TRANSIENT– ALTUS) COMPUTE Data Warehouse 3 - CLOUD NATIVE ARCHITECTURES BARE METAL CLOUDERA CLUSTER (PERSISTENT) COMPUTE DATA CONTEXT Data Engineering Data Warehouse Data Science Security Metadata Governance STORAGE CLOUD OBJECT STORE 0 - ON PREMISES HDFS
  • 15. © Cloudera, Inc. All rights reserved.15 Customer Examples Many Cloudera customers (Global 5K) used public cloud • Online retailer • Over 2,000 nodes with ~2PB of data on AWS running in an active - active configuration • Transforming data with Spark and then analyzing with Apache Hive • German chain of coffee retailers and cafés • 30+ nodes with 50TB of data on AWS • Modern Cloudera platform with an Impala data warehouse • Global information company • 50+ nodes on Microsoft Azure and 20+ nodes on AWS • Replaced Netezza with Hadoop and leveraging both Impala and Spark for analytics
  • 16. © Cloudera, Inc. All rights reserved.16 Security Use Case Cloudera is using cloud as well Altus based solution saved more than 50% cost compared to initial implementation
  • 17. © Cloudera, Inc. All rights reserved.17 Cloudera Altus Key Differentiators • Multi-function: Unified platform for data engineering, data warehouse, and data science • Multi-cloud: Option for on-premises, Public cloud (including multi-), and Hybrid • SDX: Integrated shared data experience across multi-function clusters
  • 18. Rohit Pujari, Solutions Architect AWS Security & Compliance
  • 19. Why is security traditionally so hard? Lack of visibility Low degree of automation
  • 20. ORANDMove fast Stay secure Before…Now…
  • 21. Making life easier Choosing security does not mean giving up on convenience or introducing complexity
  • 22. The most sensitive workloads run on AWS “We can be even more secure in the AWS cloud than in our own datacenters.” —Tom Soderstrom, CTO, NASA JPL “We knew the cloud was the only way to get the scalability, speed, and security our customers expect from 3M.” —Rick Austin, 3M Health Information Systems “We determined that security in AWS is superior to our on-premises data center across several dimensions, including patching, encryption, auditing and logging, entitlements, and compliance.” —John Brady, CISO, FINRA (Financial Industry Regulatory Authority)
  • 23. Benefits of a Data Lake - All Data is in One Place Analyze all of your data, from all of your sources, in one stored location “Why is the data distributed in many locations? Where is the single source of truth?”
  • 24. Durable Designed for 11 9s of durability Available Designed for 99.99% availability High performance ▪ Multiple upload ▪ Range GET ▪ Scalable throughput Scalable ▪ Store as much as you need ▪ Scale storage and compute independently ▪ No minimum usage commitments Integrated Partner Tools ▪ Cloudera EDH ▪ Cloudera Altus ▪ Cloudera Impala Easy to use ▪ Simple REST API ▪ AWS SDKs ▪ Simple management tools ▪ Event notification ▪ Lifecycle policies Why Amazon S3 for a Data Lake?
  • 25. AWS Direct Connect AWS Snowball ISV Connectors Kafka/Flume Amazon Kinesis Firehose Amazon S3 Transfer Acceleration AWS Storage Gateway Data Ingestion into Amazon S3
  • 26. Encryption ComplianceSecurity ▪ Identity and access Management (IAM) policies ▪ Bucket policies ▪ Access Control Lists (ACLs) ▪ Private VPC endpoints to Amazon S3 ▪ Amazon S3 object tagging to manage access policies ▪ SSL endpoints ▪ Server-side encryption (SSE-S3) ▪ S3 server-side encryption with provided keys (SSE-C, SSE-KMS) ▪ Client-side encryption ▪ Buckets access logs ▪ Lifecycle management policies ▪ Access Control Lists (ACLs) ▪ Versioning and MFA deletes ▪ Certifications—HIPAA, PCI, SOC 1/2/3, etc. Strong Security Controls
  • 27. Automate with deeply integrated security tools and services Inherit global security and compliance controls Highest standards for privacy and data security Largest network of security partners and solutions Scale with superior visibility and control that satisfies the most risk-sensitive orgs Move to AWS Strengthen your security posture
  • 28. Encrypt data in transit and at rest with keys managed by our AWS Key Management System (KMS) or managing your own encryption keys with Cloud HSM using FIPS 140-2 Level 3 validated HSMs Meet data residency requirements Choose an AWS Region and AWS will not replicate it elsewhere unless you choose to do so Access services and tools that enable you to build GDPR-compliant infrastructure on top of AWS Comply with local data privacy laws by controlling who can access content, its lifecycle and disposal Highest standards for privacy
  • 29. Inherit global security and compliance controls
  • 30. © Cloudera, Inc. All rights reserved.30 Data Analytics / Engineering with Cloudera $ • Lower risk of data breach • Analysts more productive on jobs • Self-service (no shadow IT) and more productive • IT more strategic, less admin time • Deployment choices and no lock-in • Same solution as on-premises and multi- cloud • Eliminate data copies • Single security framework with universally shared metadata • Easy to track data lineage • Unified services + CLOUDERA ADVANTAGES BUSINESS VALUE
  • 31. © Cloudera, Inc. All rights reserved.31 Ready to try Data Analytics / Engineering in the Cloud? Have an existing cluster for DW / DE • Up to $2K Free AWS Credits* • Email: awsoffer@cloudera.com Don’t have an existing cluster • Free Altus DE / DW Trial • https://sso.cloudera.com/register.html *Must work with AWS and Cloudera account managers on POC to be eligible for offer
  • 33. © Cloudera, Inc. All rights reserved.33 APPENDIX
  • 34. © Cloudera, Inc. All rights reserved.34 Cloudera Pricing / Acquisition • Acquisition Options • Pay-as-you-go usage-based pricing • Node-based license subscription • Free 30-day trial • Pre-pay of cloud credits • Free version that can be deployed in the cloud • Pricing - https://www.cloudera.com/products/pricing.html

Notas del editor

  1. Let’s keep this interactive. Please do ask questions as we go along
  2. Start with an overview of our strategy, which has 3 pillars First is a multi-function platform which has both machine learning and analytics. For the work our customers are doing, silo’ed products won’t get it done Next is the flexibility to choose the deployment that best meets the needs of their applications, data, and security / governance Lastly, is a framework to ensure consistency across applications and deployments Let’s go deeper into these
  3. Our customers are comprised of the global 5K and for these companies, the type of complex workloads they are running require more than a point product. So, we provide a platform that covers data engineering, data warehouse, data science and operational analytics. The platform also includes data ingestion such as with Kafka and other components such as Apache Solr which provides capabilities to analyze text and logs. Companies have the option of using these on a pay-as-you-go usage-based pricing, Node-based license subscription, Pre-pay of cloud credits as well as a Free version that can be deployed in the cloud
  4. Hadoop and Spark are the starting point but it’s not everything they need. So, those are some of the kinds of applied machine learning Research & Advising capabilities that Cloudera focuses on to help our clients be successful with enterprise machine learning. We also couple this with Professional Services & Training, and with our modern, unified Data Platform and enterprise Data Science tooling. I’ll spend the rest of this talk focusing on the latter capabilities. *** Old notes / reference *** With our modern, open platform and enterprise tools, we enable clients to build and deploy AI solutions at scale, efficiently and securely, anywhere they want. And we couple that with Cloudera Fast Forward Labs expert guidance to help clients realize their AI future, faster. Ideal Foundation: Agile platform to build, train, and deploy scalable ML applications Cloudera's modern platform with SDX enables secure, shared data access with consistent context, breaking down data & workflow silos Combines data warehousing and ML on a single platform that runs anywhere, at scale Built on open tech for future proof innovation Enterprise ML Made Easy: Enterprise data science tools to accelerate team productivity CDSW eases the machine learning workflow Supports modern, open data science and ML tooling and team collaboration for innovation & agility With enterprise grade data management, security and governance Fast track to value & scale: Expert guidance, services & training to fast track value & scale Cloudera Fast Forward Labs helps you design & execute your ML strategy Enables rapid, practical application of emerging ML technologies to your business Cloudera PS for proven delivery of scalable, production-grade ML systems
  5. So we introduced Cloudera SDX - or shared data experience – the foundations of Cloudera Enterprise. SDX makes it possible for companies to run dozens - hundreds - of analytic applications against a common pool of data. One logical cluster provides a shared data experience to multiple workloads and tenants SDX applies a centralized, consistent framework for catalog, security, governance, management, data ingest and more. It makes it faster, easier, and safer for organizations, teams, people to develop and deploy high-value, multi-function use cases like customer next best offer, clinical prediction, and risk modeling. SDX cuts through silos to unify data, analytics, management, security, and governance, and empowers self-service It combines the strengths of on-premises and cloud only deployments: * multi-function support * shared data experience * information security model * cost management * tenant isolation * workload elasticity * self service * speed of deployment
  6. - CLoudera Infosec wanted to use Apache Spot to analyze security events in our network - Our IT, didn't want them to run their workload on the production cluster due to typical isolation / uptime concerns on business-critical workloads. - They were running on their own cluster, but that was underutilized and a waste of money - So, they migrated the workload to Altus Services - After using Altus Services, the costs dropped by 50% due to better utilization.
  7. Since we’re discussing how to migrate Hadoop workloads to AWS, we’re aware how important it is to break down data silos, and build a well governed data lake to which different business units can subscribe to fulfill their analytics needs. AWS adds global dimension to the concept of data lake, where you can build a policy driven data lake that respects geographic boundaries not just from data storage perspective but also from data processing standpoint
  8. Amazon S3 is a global service that allows you to store the data in 18 regions around the world. S3 is highly available web scale object store that designed for 11 9s of durability. It infinitely scalable data storage infrastructure at very low cost as compared to HDFS. S3 is designed to be highly flexible, you can store any data in any format you want, so you can store Hadoop compatible formats like Parquet, ORC, Avro, JSON, CSV, others. And you can access it variety of ways – like over REST API, command line tools, Hadoop S3A client, etc Almost all AWS partner products that work with data are integrated with S3 including - cloudera EDH, Altus and Impala.
  9. And there are host of options to bring data into s3 – If majority of your data in on-premises you can use direct connect to establish high-throughput dedicated connection from your premises to AWS. Once you have direct connect in place you can use tools of your choice to send the data to S3. If you have data in the range of terabytes to petabytes, and sending data over network is not time-efficient you can use AWS snowball devices for secure physical transport. For streaming data, you can use Cloudera Flume, Kafka and Kinesis to bring to land that data into s3 S3 Transfer acceleration enables fast data transfer over long distances between your client and s3 bucket. So for example if you have a user in Australia who’s trying to upload data to a s3 bucket in US, he can take advantage of s3 Transfer acceleration which makes use of globally distributed edge locations, so once the data arrives edge locations the data is routed to s3 over an optimized network path. You also have an option to use AWS storage gateway - which can expose s3 bucket as NFS mount that you can use to store and retrieve data. You can also use cloud back storage volumes to asynchronously backup point-in-time snapshots of your data to s3 As you can see how s3 allows you to build truly a global policy driven data lake.
  10. Also, you get strong security controls with S3. You can securely send your data to s3 via SSL endpoints You can encrypt data at rest. With S3 server side encryption, you can configure your s3 buckets to automatically encrypt data before storing it. You can use Key Management Service from AWS if you wish to control the encryption keys. In addition to that, you can use your own encryption libraries to encrypt the data before storing it into S3. The are number of ways through which you can control access to your data. You can use IAM Policies and bucket policies – that define which user/group or role can access what resources and data. You can use VPC endpoints allow you to further lock down s3 your buckets to be accessed from your logically isolated section of AWS cloud You can tags to classify your data and define fine grained access control based on that. From compliance perspective, S3 captures the access logs – it’s a full audit trail of who has accessed what data when, and from where You can version your objects, set up MFA for delete as an extra layer of protection. S3 is complaint HIPPA, Pci, SOC 1, 2, and 3 to even more confidence that you can safely store and process sensitive data.