SlideShare una empresa de Scribd logo
1 de 31
End-to-End Security and Auditing in a
Big-Data-as-a-Service (BDaaS) Deployment
Nanda Vijaydev - BlueData
Abhiraj Butala - BlueData
“A mechanism for the delivery of statistical analysis tools and
information that helps organizations understand and use insights
gained from large information sets in order to gain a competitive
advantage.”
On-Demand, Self-Service, Elastic
Big Data Infrastructure, Applications,
Analytics
Source: www.semantikoz.com/blog/big-data-as-a-service-definition-classification
Big-Data-as-a-Service (BDaaS)
Multi-Tenant Big-Data-as-a-Service
Data/Storage
Prod
2.2
Dev/Test
2.4
POC
2.3
Prod
2.3
Dev/Test
2.4
MARKETING R&D MANUFACTURING
360 Customer View Log Analysis Predictive Maintenance
Data LakeStaging
Multiple
compute
services
(Hadoop, BI,
Spark)
There is a
shared Data
Lake (Shared
HDFS)
Why BDaaS? – Compute Side Of The Story
• Set of applications that interact with
Hadoop keeps growing
• Various versions of the same app/distro
run in parallel
• Enterprises have need to scale compute
up and down based on usage
• A model similar to Amazon AWS with S3
as storage and applications on EC2
Why BDaaS? – Data Side Of The Story
• Production cluster access takes time and
is generally restricted
• Staging clusters may not have all the data
• Data exists on other storage systems such
as NFS Isilon is common
• Users also want to upload arbitrary files
for analysis
Hadoop – A Collection Of Services
Hadoop is a collection of storage and compute services such as HDFS, HBase,
Hive, Yarn, Solr, Kafka
Security In Hadoop
• Authenticate user into Hadoop ecosystem
– Each service has its own integration with LDAP/AD for
authentication
• Authorize and limit their actions to selected services.
Authorization is granted separately for each service.
Example:
– Folder “/user/customer” in HDFS has ‘r-x’ to user ‘alice’, and ‘-
wx’ to user ‘bob’
– Enable column level access to a Hive Table. “Customer.Name”
& “Customer.PhoneNumber” is only accessible by some users
and groups
Ranger – A Pluggable Security Framework
• Ranger works with a common user DB (LDAP/AD) for authentication
• Provides a plug-in for individual Hadoop services to enable
authorization
• Allows users to define policies in a central location, using WEB UI or
APIs
• Users can define their own plug-in for a custom service and manage
them centrally via Ranger Admin
Defining HDFS Ranger Policies
HDFS Policy List
Marketing Policy Drill Down
Security Considerations in BDaaS
Data/Storage
Prod
2.2
Dev/Test
2.4
POC
2.3
Prod
2.3
Dev/Test
2.4
MARKETING R&D MANUFACTURING
360 Customer View Log Analysis Predictive Maintenance
Data LakeStaging
1. User Identity – Data Lake
2. User Identity - Application Level
3. User Identity propagation to Data Layer
1. User identity
within a Data
Lake
2. User identity
in application
layer
3. Prevent data
duplication &
maintain user
integrity
across layers
1. Securing The Data Lake
LDAPKDC
Data/Storage
Prod
2.2
Dev/Test
2.4
POC
2.3
Prod
2.3
Dev/Test
2.4
MARKETING R&D MANUFACTURING
360 Customer View Log Analysis Predictive Maintenance
Data LakeStaging
1. Authentication & Authorization – Data Lake
2. User Identity - Application Level
3. User Identity propagation to Data Layer
2. Securing The App Layer
LDAP
KDC
Data/Storage
Prod
2.2
Dev/Test
2.4
POC
2.3
Prod
2.3
Dev/Test
2.4
MARKETING R&D MANUFACTURING
360 Customer View Log Analysis Predictive Maintenance
Data LakeStaging
1. Authentication & Authorization – Data Lake
2. User Identity - Application Level
3. User Identity propagation to Data Layer
App containers are integrated with LDAP
KDC
AliceBob Tom
3. Identity Propagation to Data Layer
LDAP
KDC
Data/Storage
Prod
2.2
Dev/Test
2.4
POC
2.3
Prod
2.3
Dev/Test
2.4
MARKETING R&D MANUFACTURING
360 Customer View Log Analysis Predictive Maintenance
Data LakeStaging
1. Authentication & Authorization – Data Lake
2. User Identity - Application Level
3. User Identity propagation to Data Layer
KDC
AliceBob Tom
User Identity Propagation
Two Ways
–Users connect directly to HDFS
• Simple Authentication
• Kerberos Authentication
–Users connect to HDFS via a Super-user
(Impersonation)
HDFS Direct Connections
LDAP
KDC
Prod
2.2
Dev/Test
2.4
POC
2.3
Prod
2.3
Dev/Test
2.4
MARKETING R&D MANUFACTURING
360 Customer View Log Analysis Predictive Maintenance
KDC
Alice BobTom
HDFS
Data Lake
HDFS Direct Connections..
– hdfs-audit.log
– Ranger policies are enforced for alice and bob as they are
the effective users
HDFS Direct Connections..
• Single Hadoop Setup
– Ideal
• Multi-tenant, Multi-application Setup
– Kerberized HDFS needs kerberized compute and services
– May not want to kerberize Dev/QA setups
– Hadoop versions should be compatible all across
– Data duplication
HDFS Super-user Connections
• Super-users perform actions on behalf of other users
(Impersonation/Proxying)
• Adding a new super-user is easy
– core-site.xml
HDFS Super-user Connections..
LDAP
KDC
Prod
2.2
Dev/Test
2.4
POC
2.3
Prod
2.3
Dev/Test
2.4
MARKETING R&D MANUFACTURING
360 Customer View Log Analysis Predictive Maintenance
KDC
Alice BobTom
HDFS
Data Lake
DataTap Caching Service
via – super-user
HDFS Super-user Connections..
– hdfs-audit.log
– Ranger Authorization policies still enforced, as alice and bob
are effective users
HDFS Super-user Connections..
Multi-tenant, Multi-application Setup
– Works for applications which don’t support Kerberos (yet)
– Dev/Test setups need not be kerberized
– DataTap service can abstract version incompatibilities
– Can help avoid data duplication
– Need tight LDAP/AD integration though!
Ranger in Action
Hue Example
HDFS Permissions on Data Lake
• Set HDFS file
access for
‘/user/secret’ to
strict mode
• Set umask to ‘077’
HDFS Ranger Policies
DataTap Caching Service
Create Table via Hue
Query table via Hue - Success
Query table via Hue - Failure
Ranger Audit Logs
Key Takeaways
• BDaaS is more than Hadoop-as-a-Service
– Includes BI / ETL / Analytics + Data Science tools
• Security is an important consideration in BDaaS
• Data duplication is not an option
• Global user authentication using a centralized DB like LDAP/AD is a must
• Apache Ranger helps in enforcing global policies, provided user identities
are propagated correctly
Q & A
www.bluedata.com
Nanda Vijaydev
@nandavijaydev
Abhiraj Butala
@abhirajbutala

Más contenido relacionado

La actualidad más candente

Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
DataWorks Summit
 

La actualidad más candente (20)

Hybrid Data Platform
Hybrid Data Platform Hybrid Data Platform
Hybrid Data Platform
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4
 
Deploying Docker applications on YARN via Slider
Deploying Docker applications on YARN via SliderDeploying Docker applications on YARN via Slider
Deploying Docker applications on YARN via Slider
 
Big Data Ready Enterprise
Big Data Ready Enterprise Big Data Ready Enterprise
Big Data Ready Enterprise
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeN
 
Operationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the CloudOperationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the Cloud
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleManaging Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
 
Scheduling Policies in YARN
Scheduling Policies in YARNScheduling Policies in YARN
Scheduling Policies in YARN
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
 
Curb your insecurity with HDP
Curb your insecurity with HDPCurb your insecurity with HDP
Curb your insecurity with HDP
 
Securing data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache RangerSecuring data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache Ranger
 
The Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-ServiceThe Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-Service
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive
 
Cloudy with a Chance of Hadoop - Real World Considerations
Cloudy with a Chance of Hadoop - Real World ConsiderationsCloudy with a Chance of Hadoop - Real World Considerations
Cloudy with a Chance of Hadoop - Real World Considerations
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
 
Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture
 

Destacado

Destacado (20)

Stream Processing made simple with Kafka
Stream Processing made simple with KafkaStream Processing made simple with Kafka
Stream Processing made simple with Kafka
 
Security and Audit for Big Data
Security and Audit for Big DataSecurity and Audit for Big Data
Security and Audit for Big Data
 
Big Data Security and Governance
Big Data Security and GovernanceBig Data Security and Governance
Big Data Security and Governance
 
Automated Analytics at Scale
Automated Analytics at ScaleAutomated Analytics at Scale
Automated Analytics at Scale
 
Simplified Cluster Operation & Troubleshooting
Simplified Cluster Operation & TroubleshootingSimplified Cluster Operation & Troubleshooting
Simplified Cluster Operation & Troubleshooting
 
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
 
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
Bridging the gap of Relational to Hadoop using Sqoop @ ExpediaBridging the gap of Relational to Hadoop using Sqoop @ Expedia
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
 
Extreme Analytics @ eBay
Extreme Analytics @ eBayExtreme Analytics @ eBay
Extreme Analytics @ eBay
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger
 
Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
A Multi Colored YARN
A Multi Colored YARNA Multi Colored YARN
A Multi Colored YARN
 
Bare-metal performance for Big Data workloads on Docker containers
Bare-metal performance for Big Data workloads on Docker containersBare-metal performance for Big Data workloads on Docker containers
Bare-metal performance for Big Data workloads on Docker containers
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
 
May 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data OutMay 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data Out
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to Spark
 
Apache Hive authorization models
Apache Hive authorization modelsApache Hive authorization models
Apache Hive authorization models
 
BlueData EPIC on AWS - Spec Sheet
BlueData EPIC on AWS - Spec SheetBlueData EPIC on AWS - Spec Sheet
BlueData EPIC on AWS - Spec Sheet
 
BlueData DataSheet
BlueData DataSheetBlueData DataSheet
BlueData DataSheet
 
Knowledge from Noise
Knowledge from Noise Knowledge from Noise
Knowledge from Noise
 

Similar a End-to-End Security and Auditing in a Big Data as a Service Deployment

Informatica big data relational topics and presentation
Informatica big data relational topics and presentationInformatica big data relational topics and presentation
Informatica big data relational topics and presentation
Janardhan Reddy
 

Similar a End-to-End Security and Auditing in a Big Data as a Service Deployment (20)

Big SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on HadoopBig SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on Hadoop
 
2014.07.11 biginsights data2014
2014.07.11 biginsights data20142014.07.11 biginsights data2014
2014.07.11 biginsights data2014
 
Hadoop Security Features That make your risk officer happy
Hadoop Security Features That make your risk officer happyHadoop Security Features That make your risk officer happy
Hadoop Security Features That make your risk officer happy
 
Hadoop Security Features that make your risk officer happy
Hadoop Security Features that make your risk officer happyHadoop Security Features that make your risk officer happy
Hadoop Security Features that make your risk officer happy
 
InfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experienceInfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experience
 
Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanVmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps Ironfan
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
 
Informatica big data relational topics and presentation
Informatica big data relational topics and presentationInformatica big data relational topics and presentation
Informatica big data relational topics and presentation
 
Securing Hadoop in an Enterprise Context (v2)
Securing Hadoop in an Enterprise Context (v2)Securing Hadoop in an Enterprise Context (v2)
Securing Hadoop in an Enterprise Context (v2)
 
Securing Hadoop in an Enterprise Context
Securing Hadoop in an Enterprise ContextSecuring Hadoop in an Enterprise Context
Securing Hadoop in an Enterprise Context
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
 
The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...
The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...
The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...
 
hadoop exp
hadoop exphadoop exp
hadoop exp
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the Organization
 
IBM - Introduction to Cloudant
IBM - Introduction to CloudantIBM - Introduction to Cloudant
IBM - Introduction to Cloudant
 
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part20812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
 
zData Inc. Big Data Consulting and Services - Overview and Summary
zData Inc. Big Data Consulting and Services - Overview and SummaryzData Inc. Big Data Consulting and Services - Overview and Summary
zData Inc. Big Data Consulting and Services - Overview and Summary
 
Cisco Big Data Warehouse Expansion Solution data sheet
Cisco Big Data Warehouse Expansion Solution data sheetCisco Big Data Warehouse Expansion Solution data sheet
Cisco Big Data Warehouse Expansion Solution data sheet
 
Deploying Big-Data-as-a-Service (BDaaS) in the Enterprise
Deploying Big-Data-as-a-Service (BDaaS) in the EnterpriseDeploying Big-Data-as-a-Service (BDaaS) in the Enterprise
Deploying Big-Data-as-a-Service (BDaaS) in the Enterprise
 
Actian Analytics Platform - Hadoop SQL Edition
Actian Analytics Platform - Hadoop SQL EditionActian Analytics Platform - Hadoop SQL Edition
Actian Analytics Platform - Hadoop SQL Edition
 

Más de DataWorks Summit/Hadoop Summit

How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 

Más de DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Último (20)

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 

End-to-End Security and Auditing in a Big Data as a Service Deployment

  • 1. End-to-End Security and Auditing in a Big-Data-as-a-Service (BDaaS) Deployment Nanda Vijaydev - BlueData Abhiraj Butala - BlueData
  • 2. “A mechanism for the delivery of statistical analysis tools and information that helps organizations understand and use insights gained from large information sets in order to gain a competitive advantage.” On-Demand, Self-Service, Elastic Big Data Infrastructure, Applications, Analytics Source: www.semantikoz.com/blog/big-data-as-a-service-definition-classification Big-Data-as-a-Service (BDaaS)
  • 3. Multi-Tenant Big-Data-as-a-Service Data/Storage Prod 2.2 Dev/Test 2.4 POC 2.3 Prod 2.3 Dev/Test 2.4 MARKETING R&D MANUFACTURING 360 Customer View Log Analysis Predictive Maintenance Data LakeStaging Multiple compute services (Hadoop, BI, Spark) There is a shared Data Lake (Shared HDFS)
  • 4. Why BDaaS? – Compute Side Of The Story • Set of applications that interact with Hadoop keeps growing • Various versions of the same app/distro run in parallel • Enterprises have need to scale compute up and down based on usage • A model similar to Amazon AWS with S3 as storage and applications on EC2
  • 5. Why BDaaS? – Data Side Of The Story • Production cluster access takes time and is generally restricted • Staging clusters may not have all the data • Data exists on other storage systems such as NFS Isilon is common • Users also want to upload arbitrary files for analysis
  • 6. Hadoop – A Collection Of Services Hadoop is a collection of storage and compute services such as HDFS, HBase, Hive, Yarn, Solr, Kafka
  • 7. Security In Hadoop • Authenticate user into Hadoop ecosystem – Each service has its own integration with LDAP/AD for authentication • Authorize and limit their actions to selected services. Authorization is granted separately for each service. Example: – Folder “/user/customer” in HDFS has ‘r-x’ to user ‘alice’, and ‘- wx’ to user ‘bob’ – Enable column level access to a Hive Table. “Customer.Name” & “Customer.PhoneNumber” is only accessible by some users and groups
  • 8. Ranger – A Pluggable Security Framework • Ranger works with a common user DB (LDAP/AD) for authentication • Provides a plug-in for individual Hadoop services to enable authorization • Allows users to define policies in a central location, using WEB UI or APIs • Users can define their own plug-in for a custom service and manage them centrally via Ranger Admin
  • 9. Defining HDFS Ranger Policies HDFS Policy List Marketing Policy Drill Down
  • 10. Security Considerations in BDaaS Data/Storage Prod 2.2 Dev/Test 2.4 POC 2.3 Prod 2.3 Dev/Test 2.4 MARKETING R&D MANUFACTURING 360 Customer View Log Analysis Predictive Maintenance Data LakeStaging 1. User Identity – Data Lake 2. User Identity - Application Level 3. User Identity propagation to Data Layer 1. User identity within a Data Lake 2. User identity in application layer 3. Prevent data duplication & maintain user integrity across layers
  • 11. 1. Securing The Data Lake LDAPKDC Data/Storage Prod 2.2 Dev/Test 2.4 POC 2.3 Prod 2.3 Dev/Test 2.4 MARKETING R&D MANUFACTURING 360 Customer View Log Analysis Predictive Maintenance Data LakeStaging 1. Authentication & Authorization – Data Lake 2. User Identity - Application Level 3. User Identity propagation to Data Layer
  • 12. 2. Securing The App Layer LDAP KDC Data/Storage Prod 2.2 Dev/Test 2.4 POC 2.3 Prod 2.3 Dev/Test 2.4 MARKETING R&D MANUFACTURING 360 Customer View Log Analysis Predictive Maintenance Data LakeStaging 1. Authentication & Authorization – Data Lake 2. User Identity - Application Level 3. User Identity propagation to Data Layer App containers are integrated with LDAP KDC AliceBob Tom
  • 13. 3. Identity Propagation to Data Layer LDAP KDC Data/Storage Prod 2.2 Dev/Test 2.4 POC 2.3 Prod 2.3 Dev/Test 2.4 MARKETING R&D MANUFACTURING 360 Customer View Log Analysis Predictive Maintenance Data LakeStaging 1. Authentication & Authorization – Data Lake 2. User Identity - Application Level 3. User Identity propagation to Data Layer KDC AliceBob Tom
  • 14. User Identity Propagation Two Ways –Users connect directly to HDFS • Simple Authentication • Kerberos Authentication –Users connect to HDFS via a Super-user (Impersonation)
  • 15. HDFS Direct Connections LDAP KDC Prod 2.2 Dev/Test 2.4 POC 2.3 Prod 2.3 Dev/Test 2.4 MARKETING R&D MANUFACTURING 360 Customer View Log Analysis Predictive Maintenance KDC Alice BobTom HDFS Data Lake
  • 16. HDFS Direct Connections.. – hdfs-audit.log – Ranger policies are enforced for alice and bob as they are the effective users
  • 17. HDFS Direct Connections.. • Single Hadoop Setup – Ideal • Multi-tenant, Multi-application Setup – Kerberized HDFS needs kerberized compute and services – May not want to kerberize Dev/QA setups – Hadoop versions should be compatible all across – Data duplication
  • 18. HDFS Super-user Connections • Super-users perform actions on behalf of other users (Impersonation/Proxying) • Adding a new super-user is easy – core-site.xml
  • 19. HDFS Super-user Connections.. LDAP KDC Prod 2.2 Dev/Test 2.4 POC 2.3 Prod 2.3 Dev/Test 2.4 MARKETING R&D MANUFACTURING 360 Customer View Log Analysis Predictive Maintenance KDC Alice BobTom HDFS Data Lake DataTap Caching Service via – super-user
  • 20. HDFS Super-user Connections.. – hdfs-audit.log – Ranger Authorization policies still enforced, as alice and bob are effective users
  • 21. HDFS Super-user Connections.. Multi-tenant, Multi-application Setup – Works for applications which don’t support Kerberos (yet) – Dev/Test setups need not be kerberized – DataTap service can abstract version incompatibilities – Can help avoid data duplication – Need tight LDAP/AD integration though!
  • 23. HDFS Permissions on Data Lake • Set HDFS file access for ‘/user/secret’ to strict mode • Set umask to ‘077’
  • 27. Query table via Hue - Success
  • 28. Query table via Hue - Failure
  • 30. Key Takeaways • BDaaS is more than Hadoop-as-a-Service – Includes BI / ETL / Analytics + Data Science tools • Security is an important consideration in BDaaS • Data duplication is not an option • Global user authentication using a centralized DB like LDAP/AD is a must • Apache Ranger helps in enforcing global policies, provided user identities are propagated correctly
  • 31. Q & A www.bluedata.com Nanda Vijaydev @nandavijaydev Abhiraj Butala @abhirajbutala

Notas del editor

  1. Tom There are many definitions of BDaaS. Some say it is the combo of s/w & data- that can be hard to grasp. We say it is functionality stack:
  2. This is how the audit logs for direct connections will look like. Bob and alice will have entry as highlighted above. Ranger Authorization policies are enforced.
  3. Finally, to summarize the use of direct HDFS connections. Works best in a Single Hadoop Setup. Single Hadoop distro, kerberos everywhere, tight coupling. May not want to kerberize Dev/QA setups. May not be practical.
  4. Standard feature supported by Hadoop eco-system components to access HDFS data A super user performs operations on behalf of other users. Also known as impersonation. Typical configuration.
  5. This is how the audit logs for connections via super-users will look like. Bob and alice will have entries as highlighted above. Please note that, Ranger policies are still enforced for Bob and Alice, as they are the effective users!
  6. Finally, lets see what are the pros and cons of using supers-users.
  7. Finally, lets demonstrate all this by taking an example of Hue. Here, Hue is running in one of the compute nodes in a multi-tenant environment. It is trying to access data from HDFS, for which Ranger policies are enforced. Also, note that, Hue is LDAP integrated.
  8. Here, HDFS path /user/secret has restricted access Also, HDFS umask is set to 077, so it only allows the owner to access the data.
  9. This is how Ranger policies are defined for HDFS. We are defining who can access /user/secret path. Describe users nanda, abhiraj
  10. In our product, the HDFS caching service (DataTap), also supports impersonation. We won’t go into its details for the purpose of this talk. Typically, it is used to load remote HDFS backends as DataTaps, as shown in this picture.
  11. Using Hive Editor in Hue, we create a table using the path provided. Explain dtap:// path. User here is nanda, who was read/write permissions. This will succeed as Ranger policies will allow it.
  12. Now, the same user nanda queries the table and it succeeds. Note that, even though the permissions are 000, Ranger allows access to nanda. So it goes through.
  13. Next, the same operation is performed by user abhiraj. Here, it fails, because Ranger does not allow abhiraj to read. Thus, ranger policies are enforced.
  14. Finally, this is how the audit logs would look like. As you can see, nanda is allowed read access. Abhiraj is denied access. So, this shows that even though we use impersonation from remote clusters, the policies are still enforced. This is because, effective users are still ‘nanda’ and ‘abhiraj’.