The first and possibly most important task you perform when you deploy your Cloudera cluster is securing it. Get it wrong and you may inadvertently and unknowingly have introduced a risk to the business. Getting it right eventually leaves you looking back at wasted efforts and false starts. So how do you get it right first time?
Markets, and customers, can only expand as quickly as the human element is able to support it. Right now we are in a time where the demand is very much outpacing the supply of qualified big data professionals. Maintaining a training function is critical for cloudera because we need to maintain a capable delivery ecosystem that allow our customers to thrive within the hadoop environment.
Recruitment is one option for organizations to overcome this barrier, but that path comes with an additional challenge: finding the right candidates. When it comes to emerging technology skills, it’s a seller’s market. There is significant competition for a finite pool of skilled technologists; and this competition will only increase as the use of this technology increases.
Faced with an ever-tightening supply of qualified job applicants, organizations are finding that the costs to recruit new employees far exceeds the cost to train existing ones, and also that current employees are more than willing to be trained.
The need for IT talent is only going to increase in an ever-expanding range of industries. Consider that by 2020, GE – known primarily as a manufacturer, expects to generate $15 billion from software, which would make it one of the top 10 software companies in the world.
Or consider that 70 percent of Monsanto’s total jobs are already in science, technology, engineering, or math. Certainly many of those are in chemical and crop engineering, but increasingly, many are in IT, analytics, the Internet of Things, and digital operations. Monsanto is competing for skills not just with other agribusinesses but with companies in all industries.
Organizations need to consider the cost of recruitment, and attrition. A majority of analysis around the topic of training confirm that employees that receive training are more likely to remain at their current employers. It allows them to learn new skills, and illustrates their employers are investing in them. For technologists, hadoop… spark… and the other projects that compose our platform open up a world of possibilities and curiosity. It is challenging and rewarding. We have several customers that build out robust hadoop training plans as a benefit to their employees, and the returns they see in the innovation on the platform and employee retention makes the cost of training a major value when viewed the spectrum of both short and long term returns.
The evolution of the data center in the past few decades has mandated that IT decisions are now critical not just for back office operations, but more so critical in nearly every aspect of a business. With regard to “big data”, the technologies leveraged are very linked to an organizations customers and markets. As such, Business leaders are tasked with transforming their business to accommodate the realities of the “data-driven” market. This mean in some cases updating of hardware, and implementation of new software, but also the upgrade of the skills of their internal staff.
If the talent of your staff is a concern, you are not alone.
Cloudera, and analyst firms such as IDC, have polled organizations about enterprise software deployments… not surprisingly one of the primary areas of concern for Cloudera prospects and customers are the skills of their staff. This is a new way of computing, and harnessing the benefits of a Cloudera subscription requires employees familiar with the tools included in the platform, and an understanding of how to best leverage them for their use case.
IDC looked at projects more generally, but solicited input from over 500 managers implementing IT projects on what were the critical factors in the success of a project. Since we are discussing training, and building out a team of experts on this call, I’m guessing you are assuming it was not the software, not clearly defined business objectives, or a solid project plan which predicated success. Overwhelmingly managers ranked the skill and dedication of the project team as the factor which played the largest part in the success of their project.
We want to make sure that customers include the human element needed to role out a successful project as they consider a Cloudera subscription.
I’ve alluded to some of these options early in the presentation; but to ensure there is clarity on our delivery options… we offer both public and private training.
Public training courses are scheduled around the globe by Cloudera and by our Authorized Training Partners. Authorized training partner instructors go through the same procedures as Cloudera instructors, regularly also provide field services in their regions, and allow for local language delivery in areas where we do not have direct coverage.
Public training schedules can be found on Cloudera’s website where you can search by course title and/or location of interest. Public training is a nice option if you have just a few team members that need training, or you need to get someone ramped up in a short timeframe. Students are able to interact with their peers from other organizations implementing Cloudera solutions, and a live instructor.
Private Training is reserved for a customer who wants their entire team to be trained. Normally we say if you have seven or more students who need the same training class, its worth your while to explore our private training option. We’ll send an instructor to a location of your choice to deliver training specific to your needs. Regularly the training is one of the courses that I’ve described earlier in this presentation, but if needed, we can also customize the content to align it with your business objectives. To be clear, “customization” is not new content creation, it is creating an agenda from our portfolio of content that makes sense for the customer. Some examples would be adding Spark ML or JEP to Spark and Hadoop training to make it a five day course, or cutting Pig from Data Analyst training to make it a three day course. We generally recommend not trying to customize a course by looking at disparate topics across many classes – it usually ends up having no flow or connection, and the students leave with more questions than answers. Our courses build on concepts throughout the duration of the class. Customization is encouraged, but shouldn’t be abused.
Private Training courses are available for “up to 10” or “up to 16” students.
Virtual training is live training that is delivered over the internet. Both public and private classes can be delivered in this manner. From a public perspective, it’s a popular option for individuals who are not local to one of our training locations. Private customers with geographically dispersed team also find this means to save on the travel costs it would take to bring the team to a central location.
OnDemand training is a library of pre-recorded training classes, which allows for 24x7, self-paced training in a searchable environment. Our entire portfolio of content is available in this format, and students leverage a cloud-based lab environment to complete the same hands-on exercises we deliver in the live classrooms. Courses can be bought as a library, or by individual title.
Certification, I’ve touched on earlier. Certifications may be bought in bulk via PO, or purchased directly via our website. Certification candidates are remotely monitored, and are not required to go into a testing center to compete the exam. All you need is an internet connection. Prices range from $295 for CCA level exams to $400 for CCP: Data Engineer, or $600 per CCP: Data Science exam.
… and here is what I talked about in the past three slides, in summary. Over time, we will be adding courses to the Administrator training path focused on Security, Cloud, and Architecture – look for those in the next calendar year.
We also have plans to iterate and/or augment our Developer, Data Analyst, and Data Scientist content to reflect the evolution of the technology.
This talk is mainly about security implementation from both an engineering and a support perspective.
Data breach incidents are increasing year by year. This year alone there have been a number of high profile breaches.
Security is built deep in Hadoop, but it does not work out of box.
Rome is not bulit in a day.
As you will learn during your security implementation process, it takes a lot of configurations and best practices to make a secure Hadoop cluster.
Good news: Cloudera Manager and Navigator is there to the rescue!
Cloudera’s platform is built on top of Apache Hadoop technology. It is the first Hadoop platform to achieve PCI-compliance.
New York State Department of Financial Services “紐約州金融服務署”
Breach Notification
Right to Access
Right to be Forgotten
Data portability
Privacy by Design
Data Protection Officers
But obviously it takes more than good people and processes. You need the right technology.
Let’s get down to brass tacks on what the software is about
We’re based on an open source core. A complete, integrated enterprise platform leveraging open source
HOSS business model - core set of platform capabilities – we contribute actively into that community.
and we layer value added software on top - that’s how we run our business.
But what’s truly differentiating about our platform is the enterprise experience you get. It’s why we’re able to claim 7 of the top ten banks and 9 of the top ten telcos are our customers. For regulated industries, the enterprise experience is critical.
Multi-cloud – No vendor lock in. Work in the environment of your choice. Better pricing leverage
Managed TCO – Multiple pricing and deployment options
Integrated – Integrated components with shared metadata, security and operations
Secure - Protect sensitive data from unauthorized access – encryption, key management
Compliance – Full auditing and visibility
Governance – Ensure data veracity
Apps share data, rather than data replicated for apps
Lower costs because less data to replicate
More secure because data is in one central location
Easier to build apps because data is easily accessible
Open architecture to share data with other teams and workloads, including data science
Apps share data, rather than data replicated for apps
Lower costs because less data to replicate
More secure because data is in one central location
Easier to build apps because data is easily accessible
Open architecture to share data with other teams and workloads, including data science
As a customer, you will most likely not interact with Cloudera’s platform directly. Typically customers access Cloudera’s platform indirectly through partner products. To ensure the same security protocol is not breached, we certify partner products with security in mind. For the purpose of this talk, I am going to briefly mention Cloudera’s certification process from a security perspective.
Should also hire Cloudera certified administrators, or hire professional services from Cloudera SI partners
A little bit on partner product certification
https://docs.google.com/a/cloudera.com/document/d/1XwRV_bVZrM90JsPhHxLYAgd6vCdvT7qQ-k8eIQ2QYsk/edit?usp=sharing
Upstream = reports coming from apache project. Each apache project has a private security@ mailing alias.
Obey Apache’s security policy
Internal = reports coming internally from Cloudera. Cloudera Engineering run several security weakness detection tools looking for security issues in the software.
External = reports coming from third party or a customer.
Cloudera works hard to provide security on top of the big data platform.
In this talk, I will present the best practices and common pitfalls of security implementation on Hadoop, based on my experience working with customers.
Source: https://www.cloudera.com/documentation/enterprise/latest/topics/sg_edh_overview.html#topic_ads_t2q_1r
Achieving data security is costly. Depending on use cases and sensitivity of data, enterprise may decide which level of security is desired.
Typically, enterprises choose to implement security on Hadoop step by step.
Or hire Cloudera PS to make a custom security implementation plan and complete these steps in one shot.
https://cloudera.app.box.com/files/0/f/6321638305/1/f_56252438130
TPC-DS
Impact is very little
This is tested with Key Trustee. HSM is currently very slow
AES-NI
As the result shows below the percentage overhead of using encryption on system was 2% in terms of query execution time and 3.1% in CPU time.
A secure system takes more than just a good product. It also requires experienced people to integrate it and operate it. These people must receive the proper training.
Technology: Cloudera’s platform and certified partners’ products, post-sell support
People: Cloudera PS team or SI partners, consulting firms, customer’s admin, users
Process: SOP, documentation, regular audits, compliance plan, not covered in this talk
Depend on existing firewalls.
Leverage existing firewall mechanisms in the enterprise to set up perimeter.
First line of defense
Firewall exposes only: gateway nodes for submitting jobs, and CM and CN interface.
System chart: CM, master node (HA), worker nodes, firewalls,
The Cloudera’s platform does not manage user authentication. Instead, it relies on external authentication mechanism for that purpose, such as Kerberos, LDAPs or AD.
For simple authentication it gets user name from local operating system user name. But it is too much effort trying to ensure accounts are consistent. So use AD + SSSD/Centrify
CDH is composed of many open source projects, and as a result, not all of them support the same set of authentication mechanisms. There are (simple, kerberos, ldap, saml) supported.
AD integration – it is likely your enterprise is already using ActiveDirectory for user identity control.
--- use SSSD instead of LdapGroupsMapping.
--- Create dedicated OU for cluster
--- use LDAP over SSL
Need to select a good base, so that AD returns quickly. A slow lookup can stop all operations.
LDAP authentication can be used for CM, Hue, Hive and Impala. The latency of LDAP request/response is critical for cluster performance.
User identity can be forged easily.
It is okay to have unsecured dev cluster, or PoC cluster.
This should be the _minimal_ security requirement for any production cluster
Kerberos is a cryptographic authentication mechanism.
Key Distribution Center KDC
Kerberos -- Kerberos to user name mapping
Simple authentication = no authentication
Time synchronization -- NTP
Keytab handling – keytab stores password and is required for Hadoop services
https://www.cloudera.com/documentation/enterprise/latest/topics/cm_sg_s3_cm_principal.html
CM makes it extremely easy.
This should be the _minimal_ security requirement for any production cluster
Kerberos is a cryptographic authentication mechanism.
Kerberos -- Kerberos to user name mapping
Simple authentication = no authentication
Time synchronization -- NTP
Keytab handling – keytab stores password and is required for Hadoop services
https://www.cloudera.com/documentation/enterprise/latest/topics/cm_sg_s3_cm_principal.html
CM makes it extremely easy.
This should be the _minimal_ security requirement for any production cluster
Kerberos is a cryptographic authentication mechanism.
Kerberos -- Kerberos to user name mapping
Simple authentication = no authentication
Time synchronization -- NTP
Keytab handling – keytab stores password and is required for Hadoop services
https://www.cloudera.com/documentation/enterprise/latest/topics/cm_sg_s3_cm_principal.html
CM makes it extremely easy.
Authentication is a prerequisite of authorization
Access control lists (ACLs) restrict who can submit work to dynamic resource pools and administer them.
Cloudera Navigator Enable Audit Collection
Audit log retention
Provenance use case
A number of business decisions and transactions rely on the verifiability of the data used in those decisions and transactions. Data-verification questions might include:How was this mortgage credit score computed?
How can I prove that this number on a sales report is correct?
What data sources were used in this calculation?
Auditing use case
What was a specific user doing on a specific day?
Who deleted a particular directory?
What happened to data in a production database, and why is it no longer available?
A backup/DR cluster that is purely for DR purpose
(replicates between multiple untrusted Kerberos realms)
https://blog.cloudera.com/blog/2016/08/considerations-for-production-environments-running-cloudera-backup-and-disaster-recovery-for-apache-hive-and-hdfs/
One Kerberos realm per cluster
BDR runs from destination. Must configure the destination realm to trust source realm
The DR cluster should not be used for any purposes other than DR.
AES/CTR/NoPadding is an encryption algorithm.
At-rest encryption is required by PCI-DSS, FISMA, HIPAA
Separation of duties -- NameNode vs KMS
Hdfs superuser cannot decrypt keys.
At rest encryption is more complex than in-transit, because the key is typically not updated for a long time, so need a more complex mechanism to protect keys
An encryption zone can only be created for an empty directory. There’s a workaround to run hdfs distcp to copy files into the EZ.
Supports at most 256 bit encryption.
”Always-on encryption zone”/”nested encryption zone” support in CDH5.7 but no CM support i.e. doesn’t work end-to-end
https://www.cloudera.com/documentation/enterprise/latest/topics/encryption_ref_arch.html
Deployment consideration: at least 2 KMS proxy. At least 2 keytrustee servers. KTS should be a separate cluster. The two clusters are protected by a firewall.
Keytrustee servers are active-passive. If the active is down, the passive is able to serve reads, but not writes
Keytrustee servers should be on its own box.
KTS HA: if either one fails, only reads are allowed. It does not affect reading/writing encrypted files, but can’t create encryption zones.
May have more than 2 KMS proxies for load balancing purposes. KMS is cpu intensive, so use hardware equivalent to NameNode
hardware security module (HSM)
Resource planning & requirement:
Deployment consideration: at least 2 KMS proxy. At least 2 keytrustee servers. (total of 4 hosts) KTS should be a separate cluster. The two clusters are protected by a firewall.
Keytrustee servers are active-passive. If the active is down, the passive is able to serve reads, but not writes
Keytrustee servers should be on its own box.
KTS HA: if either one fails, only reads are allowed. It does not affect reading/writing encrypted files, but can’t create encryption zones.
May have more than 2 KMS proxies for load balancing purposes. KMS is cpu intensive, so use hardware equivalent to NameNode
hardware security module (HSM)
Deployment consideration: at least 2 KMS proxy. At least 2 keytrustee servers. KTS should be a separate cluster. The two clusters are protected by a firewall.
Keytrustee servers are active-passive. If the active is down, the passive is able to serve reads, but not writes
Keytrustee servers should be on its own box.
KTS HA: if either one fails, only reads are allowed. It does not affect reading/writing encrypted files, but can’t create encryption zones.
May have more than 2 KMS proxies for load balancing purposes. KMS is cpu intensive, so use hardware equivalent to NameNode
hardware security module (HSM)
https://cloudera.app.box.com/files/0/f/6321638305/1/f_56252438130
TPC-DS
Misconfiguration
Use aes/ctr/nopadding, (Data Transfer Encryption Algorithm) default is 128-bits/ 256-bits (managed by CM)
Low entropy : /proc/sys/kernel/random/entropy_avail
Hardware acceleration
Openssl library
Entropy
configuration
One of the characteristics of Hadoop platform, is there are a variety of tools capable of accessing the same set of data.
For example, …MapReduce, Hive, Impala, Pig and 3rd party software can all access HDFS.
A unified access control is crucial.
Pig, Sqoop and Kafka are also supported by Sentry.
If Impala is used, Sentry is a must. By default, Impala can be accessed by user impala
3rd party BI tools may not support Sentry, which must enforce access through HiveServer2.
Migrating from no Sentry to Sentry is a tremendous work, and hard to rollback
In regulated industry, the regulation such as PCI or HIPAA requires redaction of PIIs. (such as SSNs)
https://www.cloudera.com/documentation/enterprise/latest/topics/sg_redaction.html
https://blog.cloudera.com/blog/2015/06/new-in-cdh-5-4-sensitive-data-redaction/
In regulated industry, the regulation such as PCI or HIPAA requires redaction of PIIs. (such as SSNs)
https://www.cloudera.com/documentation/enterprise/latest/topics/sg_redaction.html
https://blog.cloudera.com/blog/2015/06/new-in-cdh-5-4-sensitive-data-redaction/
Intermediate files. Certain services may write spilled data outside HDFS, on local disk. So additional configuration is required to ensure they are encrypted as well.
Navigator Encrypt is a kernel model that intercepts I/O requests to encrypted datastores, including log files, config file, temp file, databases
Other references: https://cloudera.app.box.com/files/0/s/firewall/1/f_202846938208
Ben and Joey were both long time Cloudera Solution Architects