Learn how organizations are deriving unique customer insights, improving product and services efficiency, and reducing business risk with a modern big data architecture powered by Cloudera on AWS. In this webinar, you see how fast and easy it is to deploy a modern data management platform—in your cloud, on your terms.
22. The most sensitive workloads run on AWS
“We can be even more secure in the AWS cloud
than in our own datacenters.”
—Tom Soderstrom, CTO, NASA JPL
“We knew the cloud was the only way to get the scalability,
speed, and security our customers expect from 3M.”
—Rick Austin, 3M Health Information Systems
“We determined that security in AWS is superior to our on-premises
data center across several dimensions, including patching,
encryption, auditing and logging, entitlements, and compliance.”
—John Brady, CISO, FINRA (Financial Industry Regulatory Authority)
23. Benefits of a Data Lake - All Data is in One Place
Analyze all of your data,
from all of your sources, in one stored
location
“Why is the data distributed in many
locations? Where is the single source
of truth?”
24. Durable
Designed for 11 9s
of durability
Available
Designed for
99.99% availability
High performance
▪ Multiple upload
▪ Range GET
▪ Scalable throughput
Scalable
▪ Store as much as you need
▪ Scale storage and compute independently
▪ No minimum usage commitments
Integrated Partner Tools
▪ Cloudera EDH
▪ Cloudera Altus
▪ Cloudera Impala
Easy to use
▪ Simple REST API
▪ AWS SDKs
▪ Simple management tools
▪ Event notification
▪ Lifecycle policies
Why Amazon S3 for a Data Lake?
25. AWS Direct Connect AWS Snowball ISV Connectors
Kafka/Flume
Amazon Kinesis
Firehose
Amazon S3 Transfer
Acceleration
AWS Storage
Gateway
Data Ingestion into Amazon S3
26. Encryption ComplianceSecurity
▪ Identity and access
Management (IAM) policies
▪ Bucket policies
▪ Access Control Lists (ACLs)
▪ Private VPC endpoints to
Amazon S3
▪ Amazon S3 object tagging to
manage access policies
▪ SSL endpoints
▪ Server-side encryption
(SSE-S3)
▪ S3 server-side
encryption with provided
keys (SSE-C, SSE-KMS)
▪ Client-side encryption
▪ Buckets access logs
▪ Lifecycle management
policies
▪ Access Control Lists
(ACLs)
▪ Versioning and MFA
deletes
▪ Certifications—HIPAA,
PCI, SOC 1/2/3, etc.
Strong Security Controls
27. Automate
with deeply integrated
security tools
and services
Inherit
global
security and
compliance
controls
Highest
standards
for privacy
and data
security
Largest
network
of security
partners and solutions
Scale
with superior visibility
and control that
satisfies the most
risk-sensitive orgs
Move to AWS
Strengthen your security posture
28. Encrypt data in
transit and at rest
with keys managed by
our AWS Key Management
System (KMS) or managing
your own encryption keys
with Cloud HSM using
FIPS 140-2 Level 3
validated HSMs
Meet data
residency requirements
Choose an AWS Region
and AWS will not replicate it
elsewhere unless you choose
to do so
Access services and tools that
enable you to
build GDPR-compliant
infrastructure
on top of AWS
Comply with local
data privacy laws
by controlling who
can access content, its
lifecycle and disposal
Highest standards for privacy
Let’s keep this interactive. Please do ask questions as we go along
Start with an overview of our strategy, which has 3 pillars
First is a multi-function platform which has both machine learning and analytics. For the work our customers are doing, silo’ed products won’t get it done
Next is the flexibility to choose the deployment that best meets the needs of their applications, data, and security / governance
Lastly, is a framework to ensure consistency across applications and deployments
Let’s go deeper into these
Our customers are comprised of the global 5K and for these companies, the type of complex workloads they are running require more than a point product. So, we provide a platform that covers data engineering, data warehouse, data science and operational analytics.
The platform also includes data ingestion such as with Kafka and other components such as Apache Solr which provides capabilities to analyze text and logs.
Companies have the option of using these on a pay-as-you-go usage-based pricing, Node-based license subscription, Pre-pay of cloud credits as well as a Free version that can be deployed in the cloud
Hadoop and Spark are the starting point but it’s not everything they need.
So, those are some of the kinds of applied machine learning Research & Advising capabilities that Cloudera focuses on to help our clients be successful with enterprise machine learning.
We also couple this with Professional Services & Training, and with our modern, unified Data Platform and enterprise Data Science tooling.
I’ll spend the rest of this talk focusing on the latter capabilities.
*** Old notes / reference ***
With our modern, open platform and enterprise tools, we enable clients to build and deploy AI solutions at scale, efficiently and securely, anywhere they want. And we couple that with Cloudera Fast Forward Labs expert guidance to help clients realize their AI future, faster.
Ideal Foundation: Agile platform to build, train, and deploy scalable ML applications
Cloudera's modern platform with SDX enables secure, shared data access with consistent context, breaking down data & workflow silos
Combines data warehousing and ML on a single platform that runs anywhere, at scale
Built on open tech for future proof innovation
Enterprise ML Made Easy: Enterprise data science tools to accelerate team productivity
CDSW eases the machine learning workflow
Supports modern, open data science and ML tooling and team collaboration for innovation & agility
With enterprise grade data management, security and governance
Fast track to value & scale: Expert guidance, services & training to fast track value & scale
Cloudera Fast Forward Labs helps you design & execute your ML strategy
Enables rapid, practical application of emerging ML technologies to your business
Cloudera PS for proven delivery of scalable, production-grade ML systems
So we introduced Cloudera SDX - or shared data experience – the foundations of Cloudera Enterprise.
SDX makes it possible for companies to run dozens - hundreds - of analytic applications against a common pool of data. One logical cluster provides a shared data experience to multiple workloads and tenants
SDX applies a centralized, consistent framework for catalog, security, governance, management, data ingest and more.
It makes it faster, easier, and safer for organizations, teams, people to develop and deploy high-value, multi-function use cases like customer next best offer, clinical prediction, and risk modeling.
SDX cuts through silos to unify data, analytics, management, security, and governance, and empowers self-service
It combines the strengths of on-premises and cloud only deployments:
* multi-function support
* shared data experience
* information security model
* cost management
* tenant isolation
* workload elasticity
* self service
* speed of deployment
- CLoudera Infosec wanted to use Apache Spot to analyze security events in our network
- Our IT, didn't want them to run their workload on the production cluster due to typical isolation / uptime concerns on business-critical workloads.
- They were running on their own cluster, but that was underutilized and a waste of money
- So, they migrated the workload to Altus Services
- After using Altus Services, the costs dropped by 50% due to better utilization.
Since we’re discussing how to migrate Hadoop workloads to AWS, we’re aware how important it is to break down data silos, and build a well governed data lake to which different business units can subscribe to fulfill their analytics needs. AWS adds global dimension to the concept of data lake, where you can build a policy driven data lake that respects geographic boundaries not just from data storage perspective but also from data processing standpoint
Amazon S3 is a global service that allows you to store the data in 18 regions around the world. S3 is highly available web scale object store that designed for 11 9s of durability. It infinitely scalable data storage infrastructure at very low cost as compared to HDFS. S3 is designed to be highly flexible, you can store any data in any format you want, so you can store Hadoop compatible formats like Parquet, ORC, Avro, JSON, CSV, others. And you can access it variety of ways – like over REST API, command line tools, Hadoop S3A client, etc
Almost all AWS partner products that work with data are integrated with S3 including - cloudera EDH, Altus and Impala.
And there are host of options to bring data into s3 –
If majority of your data in on-premises you can use direct connect to establish high-throughput dedicated connection from your premises to AWS. Once you have direct connect in place you can use tools of your choice to send the data to S3.
If you have data in the range of terabytes to petabytes, and sending data over network is not time-efficient you can use AWS snowball devices for secure physical transport.
For streaming data, you can use Cloudera Flume, Kafka and Kinesis to bring to land that data into s3
S3 Transfer acceleration enables fast data transfer over long distances between your client and s3 bucket. So for example if you have a user in Australia who’s trying to upload data to a s3 bucket in US, he can take advantage of s3 Transfer acceleration which makes use of globally distributed edge locations, so once the data arrives edge locations the data is routed to s3 over an optimized network path.
You also have an option to use AWS storage gateway - which can expose s3 bucket as NFS mount that you can use to store and retrieve data. You can also use cloud back storage volumes to asynchronously backup point-in-time snapshots of your data to s3
As you can see how s3 allows you to build truly a global policy driven data lake.
Also, you get strong security controls with S3.
You can securely send your data to s3 via SSL endpoints
You can encrypt data at rest. With S3 server side encryption, you can configure your s3 buckets to automatically encrypt data before storing it. You can use Key Management Service from AWS if you wish to control the encryption keys.
In addition to that, you can use your own encryption libraries to encrypt the data before storing it into S3.
The are number of ways through which you can control access to your data.
You can use IAM Policies and bucket policies – that define which user/group or role can access what resources and data.
You can use VPC endpoints allow you to further lock down s3 your buckets to be accessed from your logically isolated section of AWS cloud
You can tags to classify your data and define fine grained access control based on that.
From compliance perspective,
S3 captures the access logs – it’s a full audit trail of who has accessed what data when, and from where
You can version your objects, set up MFA for delete as an extra layer of protection.
S3 is complaint HIPPA, Pci, SOC 1, 2, and 3 to even more confidence that you can safely store and process sensitive data.