End-to-End Security and Auditing in a Big Data as a Service Deployment

End-to-End Security and Auditing in a
Big-Data-as-a-Service (BDaaS) Deployment
Nanda Vijaydev - BlueData
Abhiraj Butala - BlueData

“A mechanism for the delivery of statistical analysis tools and
information that helps organizations understand and use insights
gained from large information sets in order to gain a competitive
advantage.”
On-Demand, Self-Service, Elastic
Big Data Infrastructure, Applications,
Analytics
Source: www.semantikoz.com/blog/big-data-as-a-service-definition-classification
Big-Data-as-a-Service (BDaaS)

Multi-Tenant Big-Data-as-a-Service
Data/Storage
Prod
2.2
Dev/Test
2.4
POC
2.3
Prod
2.3
Dev/Test
2.4
MARKETING R&D MANUFACTURING
360 Customer View Log Analysis Predictive Maintenance
Data LakeStaging
Multiple
compute
services
(Hadoop, BI,
Spark)
There is a
shared Data
Lake (Shared
HDFS)

Why BDaaS? – Compute Side Of The Story
• Set of applications that interact with
Hadoop keeps growing
• Various versions of the same app/distro
run in parallel
• Enterprises have need to scale compute
up and down based on usage
• A model similar to Amazon AWS with S3
as storage and applications on EC2

Why BDaaS? – Data Side Of The Story
• Production cluster access takes time and
is generally restricted
• Staging clusters may not have all the data
• Data exists on other storage systems such
as NFS Isilon is common
• Users also want to upload arbitrary files
for analysis

Hadoop – A Collection Of Services
Hadoop is a collection of storage and compute services such as HDFS, HBase,
Hive, Yarn, Solr, Kafka

Security In Hadoop
• Authenticate user into Hadoop ecosystem
– Each service has its own integration with LDAP/AD for
authentication
• Authorize and limit their actions to selected services.
Authorization is granted separately for each service.
Example:
– Folder “/user/customer” in HDFS has ‘r-x’ to user ‘alice’, and ‘-
wx’ to user ‘bob’
– Enable column level access to a Hive Table. “Customer.Name”
& “Customer.PhoneNumber” is only accessible by some users
and groups

Ranger – A Pluggable Security Framework
• Ranger works with a common user DB (LDAP/AD) for authentication
• Provides a plug-in for individual Hadoop services to enable
authorization
• Allows users to define policies in a central location, using WEB UI or
APIs
• Users can define their own plug-in for a custom service and manage
them centrally via Ranger Admin

Defining HDFS Ranger Policies
HDFS Policy List
Marketing Policy Drill Down

Security Considerations in BDaaS
Data/Storage
Prod
2.2
Dev/Test
2.4
POC
2.3
Prod
2.3
Dev/Test
2.4
Data LakeStaging
1. User Identity – Data Lake
2. User Identity - Application Level
3. User Identity propagation to Data Layer
1. User identity
within a Data
Lake
2. User identity
in application
layer
3. Prevent data
duplication &
maintain user
integrity
across layers

1. Securing The Data Lake
LDAPKDC
Data/Storage
Prod
2.2
Dev/Test
2.4
POC
2.3
Prod
2.3
Dev/Test
2.4
Data LakeStaging
1. Authentication & Authorization – Data Lake

2. Securing The App Layer
LDAP
KDC
Data/Storage
Prod
2.2
Dev/Test
2.4
POC
2.3
Prod
2.3
Dev/Test
2.4
Data LakeStaging
App containers are integrated with LDAP
KDC
AliceBob Tom

3. Identity Propagation to Data Layer
LDAP
KDC
Data/Storage
Prod
2.2
Dev/Test
2.4
POC
2.3
Prod
2.3
Dev/Test
2.4
Data LakeStaging
KDC
AliceBob Tom

User Identity Propagation
Two Ways
–Users connect directly to HDFS
• Simple Authentication
• Kerberos Authentication
–Users connect to HDFS via a Super-user
(Impersonation)

HDFS Direct Connections
LDAP
KDC
Prod
2.2
Dev/Test
2.4
POC
2.3
Prod
2.3
Dev/Test
2.4
KDC
Alice BobTom
HDFS
Data Lake

HDFS Direct Connections..
– hdfs-audit.log
– Ranger policies are enforced for alice and bob as they are
the effective users

HDFS Direct Connections..
• Single Hadoop Setup
– Ideal
• Multi-tenant, Multi-application Setup
– Kerberized HDFS needs kerberized compute and services
– May not want to kerberize Dev/QA setups
– Hadoop versions should be compatible all across
– Data duplication

HDFS Super-user Connections
• Super-users perform actions on behalf of other users
(Impersonation/Proxying)
• Adding a new super-user is easy
– core-site.xml

HDFS Super-user Connections..
LDAP
KDC
Prod
2.2
Dev/Test
2.4
POC
2.3
Prod
2.3
Dev/Test
2.4
KDC
Alice BobTom
HDFS
Data Lake
DataTap Caching Service
via – super-user

– hdfs-audit.log
– Ranger Authorization policies still enforced, as alice and bob
are effective users

Multi-tenant, Multi-application Setup
– Works for applications which don’t support Kerberos (yet)
– Dev/Test setups need not be kerberized
– DataTap service can abstract version incompatibilities
– Can help avoid data duplication
– Need tight LDAP/AD integration though!

HDFS Permissions on Data Lake
• Set HDFS file
access for
‘/user/secret’ to
strict mode
• Set umask to ‘077’

Key Takeaways
• BDaaS is more than Hadoop-as-a-Service
– Includes BI / ETL / Analytics + Data Science tools
• Security is an important consideration in BDaaS
• Data duplication is not an option
• Global user authentication using a centralized DB like LDAP/AD is a must
• Apache Ranger helps in enforcing global policies, provided user identities
are propagated correctly

Q & A
www.bluedata.com
Nanda Vijaydev
@nandavijaydev
Abhiraj Butala
@abhirajbutala

End-to-End Security and Auditing in a Big Data as a Service Deployment

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a End-to-End Security and Auditing in a Big Data as a Service Deployment

Similar a End-to-End Security and Auditing in a Big Data as a Service Deployment (20)

Más de DataWorks Summit/Hadoop Summit

Más de DataWorks Summit/Hadoop Summit (20)

Último

Último (20)

End-to-End Security and Auditing in a Big Data as a Service Deployment

Notas del editor