2. 2
# whoami
Global Security SME Lead @hortonworks
Senior Solutions Engineer @hortonworks
Book Author – Virtualizing Hadoop
Co-organizer of Atlanta Hadoop User Group
Regular Speaker at Big Data Conferences
4. 4
DATA – More Volume and More Types
I N C R E A S I N G D ATA V A R I E T Y A N D C O M P L E X I T Y
USER GENERATED CONTENT
MOBILE WEB
SMS/MMS
SENTIMENT
EXTERNAL
DEMOGRAPHICS
HD VIDEO
SPEECH TO TEXT
PRODUCT/
SERVICE LOGS
SOCIAL NETWORK
BUSINESS
DATA FEEDS
USER CLICK STREAM
WEB LOGS
OFFER HISTORY DYNAMIC PRICING
A/B TESTING
AFFILIATE NETWORKS
SEARCH MARKETING
BEHAVIORAL TARGETING
DYNAMIC FUNNELSPAYMENT
RECORD
SUPPORT
CONTACTS
CUSTOMER
TOUCHESPURCHASE DETAIL
PURCHASE
RECORD
SEGMENTATIONOFFER DETAILS
P E TA BY T E S
T E R A BY T E S
G I G A BY T E S
E X A BY T E S
E R P
BIG DATA
WEB
CR M
5. 5
Big Data Ecosystem
Big Data Platform
DATA REPOSITORIES
Risk modeling
Fraud detection
Compliance (AML, KYC)
Bank 3.0
Information security
Single view of customer
Trading applications
Market data management
ANALYSIS & VISUALIZATION
Security
Operations
Governance
&Integration
°1 ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° N
YARN : Data Operating System
Script SQL NoSQL Stream Search Others
HDFS
(Hadoop Distributed File System)
In-Mem
TRADITIONAL SOURCES
EDW
OLAP Datamarts
Column
Databases
CRM
RDBMS
LENDING MARKETS TRADES COMPLIANCE DATA
CREDIT CARD CASH & EQUITY FINANCE & GL RISK DATA
EMERGING & NON-TRADITIONAL SOURCES
SERVER LOGS CALL CENTER EMAILS
WORD
DOCUMENTS
LOCATION DATA SENSOR DATA
CUSTOMER
SENTIMENT
RESEARCH
REPORTS
6. 6
• HIPAA - Health Insurance Portability and Accountability Act of 1996
• HITECH - The Health Information Technology for Economic and Clinical Health Act
• PCI DSS - Payment Card Industry Data Security Standard
• SOX - The Sarbanes-Oxley Act of 2003
• ISO - International Organization Standardization
• COBIT - Control Objectives for Information and Related Technology
• Corporate Security Policies
Compliance Adherences
12. 12
Why Knox?
Simplified Access
• Kerberos encapsulation
• Extends API reach
• Single access point
• Multi-cluster support
• Single SSL certificate
Centralized Control
• Central REST API auditing
• Service-level authorization
• Alternative to SSH “edge node”
Enterprise Integration
• LDAP integration
• Active Directory integration
• SSO integration
• Apache Shiro extensibility
• Custom extensibility
Enhanced Security
• Protect network details
• Partial SSL for non-SSL services
• WebApp vulnerability filter
13. 13
Knox Deployment with Hadoop Cluster
Application Tier
DMZ
Switch Switch
….
Master Nodes
Rack 1
Switch
NN
SNN
….
Slave Nodes
Rack 2
….
Slave Nodes
Rack N
SwitchSwitch
DN DN
Web Tier
LB
Knox
Hadoop CLIs
14. 14
REST API
Hadoop
Services
What does Perimeter Security really mean?
Gateway
Firewall
User
Firewall
required at
perimeter
(today)
Knox Gateway
controls all
Hadoop REST API
access through
firewall
Hadoop
cluster
mostly
unaffected
Firewall only allows
connections
through specific
ports from Knox
host
Hive Host
HBase Host
WebHDFS
HBase Host
HBase Host
20. 20
Kerberos Primer
Page 20
Client
KDC
NN
DN
1. kinit - Login and get Ticket Granting Ticket (TGT)
3. Get NameNode Service Ticket (NN-ST)
2. Client Stores TGT in Ticket Cache
4. Client Stores NN-ST in Ticket Cache
5. Read/write file given NN-ST and
file name; returns block locations,
block IDs and Block Access Tokens
if access permitted
6. Read/write block given
Block Access Token and block ID
Client’s
Kerberos Ticket
Cache
23. 23
Sample Simplified Workflow - HDFS
Policy
Manager
Plugin
Admin sets policies for HDFS
files/folder
Data scientist runs a
map reduce job
User
Application
Users access HDFS data through
application Name Node
IT users access
HDFS through CLI
Namenode uses
Plugin for
Authorization
Audit
Database Audit logs pushed to DB
Namenode provides
resource access to
user/client
1
2
2
2
3
4
5
24. 24
Ranger Stacks
• Apache Ranger v0.5 supports stack-model to enable easier onboarding of new
components, without requiring code changes in Apache Ranger.
Ranger Side Changes
Define Service-type
Secured Components Side Changes
Develop Ranger Authorization Plugin
• Create a JSON file with following
details :
- Resources
- Access types
- Config to connect
• Load the JSON into Ranger.
• Include plugin library in the secure component.
• During initialization of the service: Init RangerBasePlugIn &
RangerDefaultAuditHandler class.
• To authorize access to a resource: Use
RangerAccessRequest.isAccessAllowed()
• To support resource lookup: Implement
RangerBaseService.lookupResource() &
RangerBaseService.validateConfig()
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=53741207
26. 26
Data Protection
Hadoop allows you to apply data protection policy at
two different layers across the Hadoop stack
Layer What? How ?
Storage Encrypt data in disk
Volume level: LUKS (Linux), BitLocker (Windows)
Native in Hadoop: HDFS Encryption
Partners: Voltage, Protegrity, DataGuise, Vormetric
OS level encrypt
Transmission Encrypt data as it moves
Native in Hadoop: SSL & SASL
AES 256 for SSL & DTP with SASL
28. 28
1
°
°
°
°
° °
° °
° °
° °
° N°
HDFS Encryption – How it works
DATA ACCESS
DATA MANAGEMENT
1 ° ° ° ° °
° ° ° ° ° °
° ° ° ° ° °
SECURITY
YARN
HDFS Client
° ° ° ° ° °
° ° ° ° ° °
° °
° °
° °
° °
°HDFS
(Hadoop Distributed File System)
Encryption Zone
(attributes - EZKey ID, version)
HDFS-6134
Encrypted File
(attributes - EDEK, IV)
Name Node
KeyProvider
API
KeyProvider
API
Key Management
System (KMS)
Hadoop-10433
KeyProvider API –
Hadoop-10141
EDEK
DEK
Crypto Stream
(r/w with DEK)
DEKs EZKs
Acronym Description
EZ Encryption Zone (an HDFS directory)
EZK Encryption Zone Key; master key associated with all
files in an EZ
DEK Data Encryption Key, unique key associated with each
file. EZ Key used to generate DEK
EDEK Encrypted DEK, Name Node only has access to
encrypted DEK.
IV Initialization Vector
EDEK
EDEK
29. 29
As HDFS
Admin
HDFS Encryption – Common Commands
• Run KMS Server
– ./kms.sh run
• Create Encryption Key
– hadoop key create key1 -size 128
– # Key size can be 128, 192 or 256. 256 requires unlimited strength JCE file.
• List all Encryption Keys
– hadoop key list –metadata
• As an Admin(hdfs user) create an encryption Zone
– hdfs crypto -createZone -keyName key1 -path /secure1
– Point to an existing & empty directory
• List all Encryption Zones
– hdfs crypto –listZones
• Read/Write to HDFS unchanged
– hdfs dfs -copyFromLocal /tmp/vinay.txt /secure1
– hdfs dfs -cat /securehive/sal.txt
Run this as user not in HDFS admin role
As HDFS
End-user
30. 30
Encrypting Data In-Motion
Page 30
Protocol Communication Point Encryption Mechanism
• REST • WebHDFS (Client to Cluster)
• Client to Knox
• REST over SSL
• Knox Gateway SSL
• SPNEGO - provides a mechanism for extending Kerberos to
Web applications through the standard HTTP protocol
• HTTP • NameNode/JobTracker UI
• MapReduce Shuffle
• HTTPS
• Encrypted MapReduce Shuffle (MAPREDUCE-4117)
• RPC • Hadoop Client (Client to
Cluster, Intra-Cluster)
• SASL – The Hadoop RPC system implements SASL which
provides different QoP including encryption
• JDBC/ODBC • HiveServer2 • SSL
• TCP/IP • Data Transfer (Client to
Cluster, Intra-Cluster)
• Encrypted DataTransfer Protocol available in Hadoop
• Adding SASL support to the DataTransferProtocol