More Related Content Similar to Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protecting Sensitive Data in Hadoop (Dataguise) (20) More from BigDataEverywhere (7) Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protecting Sensitive Data in Hadoop (Dataguise) 1. © 2014 Dataguise Inc. All rights reserved.
Discovering & Protecting
Sensitive Data in Hadoop
jeremy@dataguise.com
2. Goals For Today
Big Data for banking, healthcare, tech, govt,
education, etc. need data security (But few have
workable approaches in production today)
Hadoop security approaches (What works and
doesn’t work from the past, challenges in the
present)
Real world case studies (data-centric protection)
Credit card security
Healthcare data lake (Data-as-a-Service)
Product analytics in the cloud
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
2
4. Data Growth
• 100% growth and 80% unstructured data by 2015
…finding and classifying sensitive data will get
harder
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
4
Exabytes
5. Real-world unstructured data scenarios
Web comment fields and customer
surveys, CRM data
Patient and doctor medical data
in emails, PDFs, doctor’s notes
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
5
Voice-to-txt files in Hadoop
for customer service optimization;
Log data from wellheads and
oil drilling sensors
Web e-Commerce
Pay System
7. Why Security in Big Data
Vertical
Refine
Explore
Enrich
Retail & Web
• Log Analysis Site
Optimization
• Social Network
Analysis
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
• Dynamic Pricing
• Session & Content
Optimization
Retail
• Loyalty Program
Optimization
• Brand & Sentiment
Analysis
• Dynamic Pricing/
Targeted Offer
Intelligence
• Threat Identification • Person of Interest
Discovery
• Cross Jurisdiction
Queries
Finance
• Risk Modeling & Fraud
Identification
• Trade Performance
Analytics
• Surveillance & Fraud
Detection
• Customer Risk
Analysis
• Real-time upsell, cross
sales marketing offers
Energy
• Smart Grid: Production
Optimization
• Grid Failure
Prevention
• Smart Meters
• Individual Power Grid
Manufacturing
• Supply Chain
Optimization
• Customer Churn
Analysis
• Dynamic Delivery
• Replacement Parts
Healthcare &
Payer
• Electronic Medical
Records (EMPI) • Clinical Trials Analysis • Insurance Premium
Determination
8. Why Security in Big Data
Vertical
Refine
Explore
Enrich
Retail & Web
• Log Analysis Site
Optimization
• Social Network
Analysis
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
• Dynamic Pricing
• Session & Content
Optimization
Retail
• Loyalty Program
Optimization
• Brand & Sentiment
Analysis
• Dynamic Pricing/
Targeted Offer
Intelligence
• Threat Identification • Person of Interest
Discovery
• Cross Jurisdiction
Queries
Finance
• Risk Modeling & Fraud
Identification
• Trade Performance
Analytics
• Surveillance & Fraud
Detection
• Customer Risk
Analysis
• Real-time upsell, cross
sales marketing offers
Energy
• Smart Grid: Production
Optimization
• Grid Failure
Prevention
• Smart Meters
• Individual Power Grid
Manufacturing
• Supply Chain
Optimization
PCI or
Financial
• Customer Churn
Analysis
• Dynamic Delivery
• Replacement Parts
Healthcare &
Payer
• Electronic Medical
Records (EMPI) • Clinical Trials Analysis • Insurance Premium
Determination
Privacy
data
PCI or
Financial
Personal
Health (PHI)
Personal
Health (PHI)
Privacy
data
Personal
Health (PHI)
Privacy
data
Privacy
PdCaIt ao r
Financial
PCI or
Financial
Privacy
data
9. Three Critical Considerations
1. Ensuring Compliance
• The Big Ps (PCI, HIPAA, Privacy), data residency,
FERPA,FISMA, FERC , etc.
• 1200 laws in 63 countries
2. Reducing Breach Risk
3. Quantifying both
1. How much sensitive
data? (“un-announced”)
2. Who is adding? (ad hoc user directories)
3. Who is accessing? (sharing, selling, re-purposing)
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
9
10. Lab Project
• Hadoop as
R&D
• Strictly data
science
• Zero $$$ or
selection of
Distribution
• Zero
recognition of
sensitive data
or exposure
Proof Stage
• Achieving
value
• Data lake cost
savings
• Line of
business
ownership
• Nodal
expansion
• Security
elements?
(unknown to
InfoSec)
ROI Validity
• ROI and TCO
validity
• Distribution
selection and
purchase
• The Security
‘A- Ha’
moment
• Solved with
legacy or
penalty box
Hadoop
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
On Demand
Hadoop
• Full scale
production
• Ad hoc new
uses
• Go Faster:
Spark,
Kafka
• Security
sanctified
The Evolution of Hadoop Projects
11. On-Demand Hadoop.
• Without adequate sensitive
data protection, customers
left to “Penalty Boxing”
Hadoop
» “Security zones” imposed by
InfoSec
» Slows business, costly and
cumbersome
• Data-centric protection can
set those assets free
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
11
12. Data Protection
In Hadoop
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL 12
13. Security in Hadoop In Summary
• Like Cloud, Mobile, Virtualization… Big Data
drives fundamental new rules in security
» Ad hoc computing, wide open data sets
» Extended users and usages, sharing and selling
» 3 Vs moving to 6 Vs (automation, non-blocking)
• Problem #1 is compliance
» Reporting/auditing/monitoring as/more important than
data security
• Data-centric protection can help
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
13
14. Hadoop Security Framework
Access'
Defining%what%users%
and%applicaHons%can%do%
with%data%
Technical'Concepts:'
Permissions%
AuthorizaHon%
Perimeter'
%
Guarding%access%to%the%
cluster%itself%
%
%%%
Technical'Concepts:'
AuthenHcaHon%
Network%isolaHon%
%
Perimeter'
Guarding%access%to%the%
cluster%itself%
%
Technical'Concepts:'
AuthenHcaHon%
Network%isolaHon%
%
ReporHng%on%where%
data%came%from%and%
how%it’s%being%used%
%
Technical'Concepts:'
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
Data'
ProtecHng%data%in%the%
cluster%from%
unauthorized%visibility%
%
Technical'Concepts:'
EncrypHon,%TokenizaHon,%
Data%masking%
%
Visibility'
AudiHng%
Lineage%
%
• The 4 approaches to address security within Hadoop (Perimeter,
Data, Access, Visibility)
• Dataguise discovers & protects at the data layer and provides visibility
for audit reporting and data lineage
15. Kerberos on Hadoop
• Kerberos (developed at MIT) has been the de-facto
standard for strong authentication/authz
» Protection against user and service spoofing
attacks, and allows for enforcement of user HDFS
access permissions
• What does Kerberos Do?
» Establishes identity for clients, hosts, and services
» Prevents impersonation, passwords are never sent over
the wire
» Tickets grant cryptographic “permissions” to resources
• Kerberos is core of authentication in native
Apache Hadoop from 2010
» Used for access ecosystem services HDFS, JT, Oozie.,
for server to server traffic auth. etc. BUT complex to
manage!
» Lots of steps for example:
http://www.cloudera.com/content/cloudera--content/cloudera--
docs/CDH4/4.3.0/CDH4--Security--Guide/cdh4sg_topic_3.html
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
15
Access'
Defining%what%users%
and%applicaHons%can%do%
with%data%
%
Technical'Concepts:'
Permissions%
AuthorizaHon%
%
16. MapR Improvements on Auth/Authz
• Vastly simpler
» But no requirements for Kerberos in core
» Identity represented using a ticket which is issued by
MapR CLDB servers (Container Location DataBase)
» Core services secured by default
• Easier integration
» User identity independent of host or operating system
» Local to MapR (no external Kerberos required)
• Faster
» Leverage Intel accelerated hardware crypto
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
16
17. Elements of Data Centric Protection
• 1. Identify which elements you want to protect
via:
» Delimiters (structured data), name-value pairs (semi-structured)
or data discovery service (unstructured)
• 2. Automated Protection Options:
» Automatically apply protection via:
» Format preserving encryption (FPE)
» Masking (replace, randomize, intellimask, static)
» Redaction (nullify)
• 3. Audit Strategy
» Sensitive data protection/access/lineage
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
17
18. Discovery
• Within HDFS
» Search for sensitive data per company policy – PII, PCI,…
» Handle complex data types such as addresses
» Process incrementally (default) to handle only the new content
• In-flight
» Processing data on the fly as they are ingested into Hadoop HDFS
» Plug-in solution for FTP, Flume,
Sqoop
» Search for sensitive data
per policy – PII, PCI, HIPAA…
» NEXT UP: Kafka
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
18
19. How Discovery Works
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
19
• MapReduce or Flume/FTP/Sqoop Agent
» Root directories and drill downs
» Can scan entire dataset or incrementally (watermarking)
• Runs pattern, logic, context, algorithm, and ontology filters
• Can utilize white/black lists and reference sets
20. Protection Measures
• Protection plan should start with
cutting
» What data can we delete/cut?
» What data can be redacted?
» Masking choices
• Consistency
• Realistic looking data
• Partial reveal (Intellimask)
Credit Card # 4541 **** **** 3241
• What data needs reversibility
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
20
21. Encryption “vs” Masking
• Encryption:
+ Reversible
+ Trusted with security proofs
+ The first hammer
+ De-centralized architectures
- Complex
- Key management
- Useless without robust
authentication and authorization
- Data value destruction
- Needs both encrypt-decrypt
tooling
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
21
• Masking:
+ Highest security
+ Realistic data
+ Range and value preserving
+ Once and done
+Scale-out and distributed
+ No performance impact on usage
+ Zero need for authentication and
authorization and key management
- Not as well marketed
- Not reversible
22. Encryption “vs” Masking
• Masking:
+ Highest security
+ Realistic data
+ Range and value preserving
+ Format-preserving and partial
reveals
+Scale-out and distributed
+ No performance impact on usage
+ Zero need for authentication and
authorization and key management
- Not as well marketed
- Not reversible
- Perceived to grow data
• Encryption:
+ Reversible
+ Trusted with security proofs
+ Format-preserving and partial
reveals
+Scale-out and distributed
+ The first hammer
+ De-centralized architectures
- Complex
- Key management
- Useless without robust
authentication and authorization
- Data value destruction
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
22
The fundamental decision between masking and
encryption comes down to reversibility:
Some elements in analytics must resolve to original:
(e.g. 66.249.22.145 or $34,332.12)
Some elements ideal for psuedonyms:
Social Security Numbers
Credit Card Numbers
Names
23. Real-World Performance
• Leveraging the power of MapReduce to run
distributed encryption or masking
• Data volume: 2.2 TB
• Run Time: 23 min
• Sensitive Data %: 8/50 Columns in 2.2 Bn rows
• Run on 360 node MapR system
• In old-word database technology, this would type
of job would have taken days/week(s)
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
23
24. Audit Strategy
• Essential to all goals: Compliance, breach
protection, visibility and metrics
• Avoids the “gotcha” moment
» Show all sensitive elements (count, location)
» Remediation applied
» Dashboard for fast access to critical policies and drill-downs
for file and user action
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
24
25. How It works: Detection and Protection
In-flight or @Rest
RDBMS
Xaction
Data
warehouse
Site
WEB
FTP
FlDugmFel uAmgee nt
Plug-in
DgFlume Agent
1. Detect sensitive data
2. Protect applying
masking/encryption
policies
Production
Cluster
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
Hadoop API
Discover/
Mask/Encrypt
DgHDFS Agent
1. Detect sensitive data
2. Protect applying
masking/encryption
policies
Hadoop API
DGHive, HDFS bulk
decryption/Java app
1. Selective decryption
based on user/role
and policy
1 Data Discovery and
protection while
loaded into HDFS
2 Data masked or
encrypted in HDFS
with Map/Reduce job
3 Users can now
access data
DGDiscover-Masker
1. In DB (Oracle, SQL..
SharePoint, Files)
2. Protect applying
masking/encryption
policies
Sqoop
DgScoop Agent
1. Detect sensitive data
2. Protect applying
masking/encryption
policies
26. Case Studies
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL 26
27. Protecting sensitive data in
top credit card firm
Source Data Protection Analysis
Credit Card
Transactions
Omniture Files
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
Selective access to
sensitive data
based on role and
app
27
Objectives
! Consolidate existing payment risk
analysis inside high-scale, lower cost
Hadoop
! Provide tiered access authorization
for multiple business apps (fraud, risk,
cross-sell
Solution
! MapR Hadoop for single, reliable, high
performance data analysis platform
! Dataguise consistent masking enables
analysis and unique index key values for
de-identified data
! Unique ability to output protected data in
adjacent column or appended with
delimiter inside existing column to protect
data while
governing access via authorization rules
Incremental updates
to HDFS automatically
protected
Results Benefits
• Continuous real-time protection
(job runs every 5 mins on ingest)
• Analytics draws on the secure
purchasing data of 90 million credit card
holders across 127 countries
28. Protecting personal health info (PHI)
in aggregate data lake
DG FTP Agent
SQL
Data
FTP
Health
Records
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
HDFS Authorization controlled
through group membership
in Active Directory
28
Objectives
! Reduce costly and preventable
readmission, decrease mortality rates, and
improve the quality of life for patients
! Internal data service model DAaaS (Data
Architecture as a Service)
Solution
! Solution needs to protect structured and
unstructured source data in database, data
warehouse, and flat file structures
! Customer required customization of
encryption and key management to fit into
their existing corp infrastructure and security
policies
! Dataguise dashboard gives admins easy way to
identify directories/files containing sensitive data
Results Benefits
• Delivered a cost-effective and easy way to
determine where sensitive data resides within
the cluster, and how it’s been protected
! Seamless access to encrypted data from a
variety of data access methods {Hive, Pig,
Analytic tools}
29. Global Tech Product Analytics
DG Flume Agent
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
29
Objectives
! Aggregate logging data (product, usage,
user configuration) for all smartphones
worldwide t
! De-identify personal user info to ensure
privacy and compliance with European/US
Privacy
Solution
• Customer routes all device logging data
into 7 Global AWS clouds
• Uses Dataguise Flume agent to protect all
sensitive data being written to Amazon S3
• Runs Dataguise in AWS, also utilizes
Dataguise EMR security agents to
selectively decrypt for authorized analytics
in AWS
Results Benefits
Apache
Flume
• On-demand Hadoop for product
analytics, user behavior, supply chain optimization
• High scale-out, high performance and high scale-out
paramount
• 100% cloud based security
Virtualized
DG Secure
Protected Data Amazon
S3
Smartphone
Device Log
Collectors
AWS Clouds in Korea, Singapore,
US (3), UK, and Ireland
30. Hadoop Data-Protection Checklist
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
30
Discover sensitive data
Automate protective measures
Integrate into Hadoop authorization
With continuous real-time tracking
Dashboards, Reports Auditing
Automated Risk Assessment/Scoring
Automated inference protection
(roadmap)
31. Thank You
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
31
Jeremy Stieglitz
VP Products
jeremy@dataguise.com