SlideShare una empresa de Scribd logo
1 de 35
Securing Spark Applications
Kostas Sakellis
Marcelo Vanzin
What is Security?
• Security has many facets
• This talk will focus on three areas:
– Encryption
– Authentication
– Authorization
Why do I need security?
• Multi-tenancy
• Application isolation
• User identification
• Access control enforcement
• Compliance with government regulations
Before we go further...
• Set up Kerberos
• Use HDFS (or another secure filesystem)
• Use YARN!
• Configure them for security (enable auth, encryption).
Kerberos, HDFS, and YARN provide the security backbone
for Spark.
Encryption
• In a secure cluster, data should not be visible in the clear
• Very important to financial / government institutions
What a Spark app looks like
RM NM NM
AM / Driver Executor
Executor
SparkSubmit
Control RPC
File Download
Shuffle / Cached Blocks
Shuffle
Service
Shuffle
Service
Shuffle Blocks
UI
Shuffle Blocks / Metadata
Data Flow in Spark
Every connection in the previous slide can transmit sensitive
data!
• Input data transmitted via broadcast variables
• Computed data during shuffles
• Data in serialized tasks, files uploaded with the job
How to prevent other users from seeing this data?
Encryption in Spark
• Almost all channels support encryption.
– Exception 1: UI (SPARK-2750)
– Exception 2: local shuffle / cache files (SPARK-5682)
For local files, set up YARN local dirs to point at local
encrypted disk(s) if desired. (SPARK-5682)
Encryption: Current State
Different channel, different method.
• Shuffle protocol uses SASL
• RPC / File download use SSL
SSL can be hard to set up.
• Need certificates readable on every node
• Sharing certificates not as secure
• Hard to have per-user certificate
Encryption: The Goal
SASL everywhere for wire encryption (except UI).
• Minimum configuration (one boolean config)
• Uses built-in JVM libraries
• SPARK-6017
For UI:
• Support for SSL
• Or audit UI to remove sensitive info (e.g. information on
environment page).
Authentication
Who is reading my data?
• Spark uses Kerberos
– the necessary evil
• Ubiquitous among other services
– YARN, HDFS, Hive, HBase etc.
Who’s reading my data?
Kerberos provides secure authentication.
KDC
Application
Hi I’m Bob.
Hello Bob. Here’s your TGT.
Here’s my TGT. I want to talk to HDFS.
Here’s your HDFS ticket.
User
Now with a distributed app...
KDC
Executor
Executor
Executor
Executor
Executor
Executor
Executor
Executor
Hi I’m Bob.
Hi I’m Bob.
Hi I’m Bob.
Hi I’m Bob.
Hi I’m Bob.
Hi I’m Bob.
Hi I’m Bob.
Hi I’m Bob.
Something
is wrong.
Kerberos in Hadoop / Spark
KDCs do not allow multiple concurrent logins at the scale
distributed applications need. Hadoop services use
delegation tokens instead.
Driver
NameNode
Executor
DataNode
Delegation Tokens
Like Kerberos tickets, they have a TTL.
• OK for most batch applications.
• Not OK for long running applications
– Streaming
– Spark SQL Thrift Server
Delegation Tokens
Since 1.4, Spark can manage delegation tokens!
• Restricted to HDFS currently
• Requires user’s keytab to be deployed with application
• Still some remaining issues in client deploy mode
Authorization
How can I share my data?
Simplest form of authorization: file permissions.
• Use Unix-style permissions or ACLs to let others read
from and / or write to files and directories
• Simple, but high maintenance. Set permissions /
ownership for new files, mess with umask, etc.
More than just FS semantics...
Authorization becomes more complicated as abstractions
are created.
• Tables, columns, partitions instead of files and
directories
• Semantic gap
• Need a trusted entity to enforce access control
Trusted Service: Hive
Hive has a trusted service (“HiveServer2”) for enforcing
authorization.
• HS2 parses queries and makes sure users have access
to the data they’re requesting / modifying.
HS2 runs as a trusted user with access to the whole
warehouse. Users don’t run code directly in HS2*, so there’s
no danger of code escaping access checks.
Untrusted Apps: Spark
Each Spark app runs as the requesting user, and needs
access to the underlying files.
• Spark itself cannot enforce access control, since it’s
running as the user and is thus untrusted.
• Restricted to file system permission semantics.
How to bridge the two worlds?
Apache Sentry
• Role-based access control to resources
• Integrates with Hive / HS2 to control access to data
• Fine-grained (up to column level) controls
Hive data and HDFS data have different semantics. How to
bridge that?
The Sentry HDFS Plugin
Synchronize HDFS file permissions with higher-level
abstractions.
• Permission to read table = permission to read table’s
files
• Permission to create table = permission to write to
database’s directory
Uses HDFS ACLs for fine-grained user permissions.
Still restricted to FS view of the world!
• Files, directories, etc…
• Cannot provide column-level and row-level access
control.
• Whole table or nothing.
Still, it goes a long way in allowing Spark applications to
work well with Hive data in a shared, secure environment.
But...
Future: RecordService
A distributed, scalable, data access service for unified
authorization in Hadoop.
RecordService
RecordService
• Drop in replacement for InputFormats
• SparkSQL: Integration with Data Sources API
– Predicate pushdown, projection
RecordService
• Assume we had a table tpch.nation
column_name column_type
n_nationkey smallint
n_name string
n_regionkey smallint
n_comment string
import com.cloudera.recordservice.spark._
val context = new org.apache.spark.sql.SQLContext(sc)
val df = context.load("tpch.nation",
"com.cloudera.recordservice.spark")
val results = df.groupBy("n_regionkey")
.count()
.collect()
RecordService
RecordService
• Users can enforce Sentry permissions using views
• Allows column and row level security
> CREATE ROLE restrictedrole;
> GRANT ROLE restrictedrole to GROUP restrictedgroup;
> USE tpch;
> CREATE VIEW nation_names AS
SELECT n_nationkey, n_name
FROM tpch.nation;
> GRANT SELECT ON TABLE tpch.nation_names TO ROLE restrictedrole;
...
val df = context.load("tpch.nation",
"com.cloudera.recordservice.spark")
val results = df.collect()
>> TRecordServiceException(code:INVALID_REQUEST, message:Could not plan
request., detail:AuthorizationException: User 'kostas' does not have
privileges to execute 'SELECT' on: tpch.nation)
RecordService
...
val df = context.load("tpch.nation_names",
"com.cloudera.recordservice.spark")
val results = df.collect()
RecordService
RecordService
• Documentation: http://cloudera.github.io/RecordServiceClient/
• Beta Download:
http://www.cloudera.com/content/cloudera/en/downloads/betas/recordservic
e/0-1-0.html
Takeaways
• Spark can be made secure today!
• Benefits from a lot of existing Hadoop platform work
• Still work to be done
– Ease of use
– Better integration with Sentry / RecordService
References
• Encryption: SPARK-6017, SPARK-5682
• Delegation tokens: SPARK-5342
• Sentry: http://sentry.apache.org/
– HDFS synchronization: SENTRY-432
• RecordService:
http://cloudera.github.io/RecordServiceClient/
Thanks!
Questions?

Más contenido relacionado

La actualidad más candente

しばちょう先生による特別講義! RMANバックアップの運用と高速化チューニング
しばちょう先生による特別講義! RMANバックアップの運用と高速化チューニングしばちょう先生による特別講義! RMANバックアップの運用と高速化チューニング
しばちょう先生による特別講義! RMANバックアップの運用と高速化チューニングオラクルエンジニア通信
 
[C13] フラッシュドライブで挑むOracle超高速化と信頼性の両立 by Masashi Fukui
[C13] フラッシュドライブで挑むOracle超高速化と信頼性の両立 by Masashi Fukui[C13] フラッシュドライブで挑むOracle超高速化と信頼性の両立 by Masashi Fukui
[C13] フラッシュドライブで挑むOracle超高速化と信頼性の両立 by Masashi FukuiInsight Technology, Inc.
 
[Oracle DBA & Developer Day 2016] しばちょう先生の特別講義!!ストレージ管理のベストプラクティス ~ASMからExada...
[Oracle DBA & Developer Day 2016] しばちょう先生の特別講義!!ストレージ管理のベストプラクティス ~ASMからExada...[Oracle DBA & Developer Day 2016] しばちょう先生の特別講義!!ストレージ管理のベストプラクティス ~ASMからExada...
[Oracle DBA & Developer Day 2016] しばちょう先生の特別講義!!ストレージ管理のベストプラクティス ~ASMからExada...オラクルエンジニア通信
 
SparkとCassandraの美味しい関係
SparkとCassandraの美味しい関係SparkとCassandraの美味しい関係
SparkとCassandraの美味しい関係datastaxjp
 
まだ統計固定で消耗してるの? - Bind Peek をもっと使おうぜ! 2015 Edition -
まだ統計固定で消耗してるの? - Bind Peek をもっと使おうぜ! 2015 Edition -まだ統計固定で消耗してるの? - Bind Peek をもっと使おうぜ! 2015 Edition -
まだ統計固定で消耗してるの? - Bind Peek をもっと使おうぜ! 2015 Edition -歩 柴田
 
レプリケーション遅延の監視について(第40回PostgreSQLアンカンファレンス@オンライン 発表資料)
レプリケーション遅延の監視について(第40回PostgreSQLアンカンファレンス@オンライン 発表資料)レプリケーション遅延の監視について(第40回PostgreSQLアンカンファレンス@オンライン 発表資料)
レプリケーション遅延の監視について(第40回PostgreSQLアンカンファレンス@オンライン 発表資料)NTT DATA Technology & Innovation
 
PostgreSQL初心者がパッチを提案してからコミットされるまで(第20回PostgreSQLアンカンファレンス@オンライン 発表資料)
PostgreSQL初心者がパッチを提案してからコミットされるまで(第20回PostgreSQLアンカンファレンス@オンライン 発表資料)PostgreSQL初心者がパッチを提案してからコミットされるまで(第20回PostgreSQLアンカンファレンス@オンライン 発表資料)
PostgreSQL初心者がパッチを提案してからコミットされるまで(第20回PostgreSQLアンカンファレンス@オンライン 発表資料)NTT DATA Technology & Innovation
 
Elasticsearch as a Distributed System
Elasticsearch as a Distributed SystemElasticsearch as a Distributed System
Elasticsearch as a Distributed SystemSatoyuki Tsukano
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkKazuaki Ishizaki
 
SAP HANAのソースエンドポイントとしての利用
SAP HANAのソースエンドポイントとしての利用SAP HANAのソースエンドポイントとしての利用
SAP HANAのソースエンドポイントとしての利用QlikPresalesJapan
 
トランザクション処理可能な分散DB 「YugabyteDB」入門(Open Source Conference 2022 Online/Fukuoka 発...
トランザクション処理可能な分散DB 「YugabyteDB」入門(Open Source Conference 2022 Online/Fukuoka 発...トランザクション処理可能な分散DB 「YugabyteDB」入門(Open Source Conference 2022 Online/Fukuoka 発...
トランザクション処理可能な分散DB 「YugabyteDB」入門(Open Source Conference 2022 Online/Fukuoka 発...NTT DATA Technology & Innovation
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeDatabricks
 
PostgreSQLのfull_page_writesについて(第24回PostgreSQLアンカンファレンス@オンライン 発表資料)
PostgreSQLのfull_page_writesについて(第24回PostgreSQLアンカンファレンス@オンライン 発表資料)PostgreSQLのfull_page_writesについて(第24回PostgreSQLアンカンファレンス@オンライン 発表資料)
PostgreSQLのfull_page_writesについて(第24回PostgreSQLアンカンファレンス@オンライン 発表資料)NTT DATA Technology & Innovation
 
しばちょう先生が語る!オラクルデータベースの進化の歴史と最新技術動向#3
しばちょう先生が語る!オラクルデータベースの進化の歴史と最新技術動向#3しばちょう先生が語る!オラクルデータベースの進化の歴史と最新技術動向#3
しばちょう先生が語る!オラクルデータベースの進化の歴史と最新技術動向#3オラクルエンジニア通信
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleBucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleDatabricks
 
Java仮想マシンの実装技術
Java仮想マシンの実装技術Java仮想マシンの実装技術
Java仮想マシンの実装技術Kiyokuni Kawachiya
 
PostgreSQL開発コミュニティに参加しよう! ~2022年版~(Open Source Conference 2022 Online/Kyoto 発...
PostgreSQL開発コミュニティに参加しよう! ~2022年版~(Open Source Conference 2022 Online/Kyoto 発...PostgreSQL開発コミュニティに参加しよう! ~2022年版~(Open Source Conference 2022 Online/Kyoto 発...
PostgreSQL開発コミュニティに参加しよう! ~2022年版~(Open Source Conference 2022 Online/Kyoto 発...NTT DATA Technology & Innovation
 
PostgreSQLのリカバリ超入門(もしくはWAL、CHECKPOINT、オンラインバックアップの仕組み)
PostgreSQLのリカバリ超入門(もしくはWAL、CHECKPOINT、オンラインバックアップの仕組み)PostgreSQLのリカバリ超入門(もしくはWAL、CHECKPOINT、オンラインバックアップの仕組み)
PostgreSQLのリカバリ超入門(もしくはWAL、CHECKPOINT、オンラインバックアップの仕組み)Hironobu Suzuki
 

La actualidad más candente (20)

しばちょう先生による特別講義! RMANバックアップの運用と高速化チューニング
しばちょう先生による特別講義! RMANバックアップの運用と高速化チューニングしばちょう先生による特別講義! RMANバックアップの運用と高速化チューニング
しばちょう先生による特別講義! RMANバックアップの運用と高速化チューニング
 
Vacuum徹底解説
Vacuum徹底解説Vacuum徹底解説
Vacuum徹底解説
 
[C13] フラッシュドライブで挑むOracle超高速化と信頼性の両立 by Masashi Fukui
[C13] フラッシュドライブで挑むOracle超高速化と信頼性の両立 by Masashi Fukui[C13] フラッシュドライブで挑むOracle超高速化と信頼性の両立 by Masashi Fukui
[C13] フラッシュドライブで挑むOracle超高速化と信頼性の両立 by Masashi Fukui
 
[Oracle DBA & Developer Day 2016] しばちょう先生の特別講義!!ストレージ管理のベストプラクティス ~ASMからExada...
[Oracle DBA & Developer Day 2016] しばちょう先生の特別講義!!ストレージ管理のベストプラクティス ~ASMからExada...[Oracle DBA & Developer Day 2016] しばちょう先生の特別講義!!ストレージ管理のベストプラクティス ~ASMからExada...
[Oracle DBA & Developer Day 2016] しばちょう先生の特別講義!!ストレージ管理のベストプラクティス ~ASMからExada...
 
SparkとCassandraの美味しい関係
SparkとCassandraの美味しい関係SparkとCassandraの美味しい関係
SparkとCassandraの美味しい関係
 
まだ統計固定で消耗してるの? - Bind Peek をもっと使おうぜ! 2015 Edition -
まだ統計固定で消耗してるの? - Bind Peek をもっと使おうぜ! 2015 Edition -まだ統計固定で消耗してるの? - Bind Peek をもっと使おうぜ! 2015 Edition -
まだ統計固定で消耗してるの? - Bind Peek をもっと使おうぜ! 2015 Edition -
 
レプリケーション遅延の監視について(第40回PostgreSQLアンカンファレンス@オンライン 発表資料)
レプリケーション遅延の監視について(第40回PostgreSQLアンカンファレンス@オンライン 発表資料)レプリケーション遅延の監視について(第40回PostgreSQLアンカンファレンス@オンライン 発表資料)
レプリケーション遅延の監視について(第40回PostgreSQLアンカンファレンス@オンライン 発表資料)
 
PostgreSQL初心者がパッチを提案してからコミットされるまで(第20回PostgreSQLアンカンファレンス@オンライン 発表資料)
PostgreSQL初心者がパッチを提案してからコミットされるまで(第20回PostgreSQLアンカンファレンス@オンライン 発表資料)PostgreSQL初心者がパッチを提案してからコミットされるまで(第20回PostgreSQLアンカンファレンス@オンライン 発表資料)
PostgreSQL初心者がパッチを提案してからコミットされるまで(第20回PostgreSQLアンカンファレンス@オンライン 発表資料)
 
Elasticsearch as a Distributed System
Elasticsearch as a Distributed SystemElasticsearch as a Distributed System
Elasticsearch as a Distributed System
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
SAP HANAのソースエンドポイントとしての利用
SAP HANAのソースエンドポイントとしての利用SAP HANAのソースエンドポイントとしての利用
SAP HANAのソースエンドポイントとしての利用
 
トランザクション処理可能な分散DB 「YugabyteDB」入門(Open Source Conference 2022 Online/Fukuoka 発...
トランザクション処理可能な分散DB 「YugabyteDB」入門(Open Source Conference 2022 Online/Fukuoka 発...トランザクション処理可能な分散DB 「YugabyteDB」入門(Open Source Conference 2022 Online/Fukuoka 発...
トランザクション処理可能な分散DB 「YugabyteDB」入門(Open Source Conference 2022 Online/Fukuoka 発...
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
 
PostgreSQLのfull_page_writesについて(第24回PostgreSQLアンカンファレンス@オンライン 発表資料)
PostgreSQLのfull_page_writesについて(第24回PostgreSQLアンカンファレンス@オンライン 発表資料)PostgreSQLのfull_page_writesについて(第24回PostgreSQLアンカンファレンス@オンライン 発表資料)
PostgreSQLのfull_page_writesについて(第24回PostgreSQLアンカンファレンス@オンライン 発表資料)
 
しばちょう先生が語る!オラクルデータベースの進化の歴史と最新技術動向#3
しばちょう先生が語る!オラクルデータベースの進化の歴史と最新技術動向#3しばちょう先生が語る!オラクルデータベースの進化の歴史と最新技術動向#3
しばちょう先生が語る!オラクルデータベースの進化の歴史と最新技術動向#3
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleBucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
 
Java仮想マシンの実装技術
Java仮想マシンの実装技術Java仮想マシンの実装技術
Java仮想マシンの実装技術
 
PostgreSQL開発コミュニティに参加しよう! ~2022年版~(Open Source Conference 2022 Online/Kyoto 発...
PostgreSQL開発コミュニティに参加しよう! ~2022年版~(Open Source Conference 2022 Online/Kyoto 発...PostgreSQL開発コミュニティに参加しよう! ~2022年版~(Open Source Conference 2022 Online/Kyoto 発...
PostgreSQL開発コミュニティに参加しよう! ~2022年版~(Open Source Conference 2022 Online/Kyoto 発...
 
Oracle Spatial 概要説明資料
Oracle Spatial 概要説明資料Oracle Spatial 概要説明資料
Oracle Spatial 概要説明資料
 
PostgreSQLのリカバリ超入門(もしくはWAL、CHECKPOINT、オンラインバックアップの仕組み)
PostgreSQLのリカバリ超入門(もしくはWAL、CHECKPOINT、オンラインバックアップの仕組み)PostgreSQLのリカバリ超入門(もしくはWAL、CHECKPOINT、オンラインバックアップの仕組み)
PostgreSQLのリカバリ超入門(もしくはWAL、CHECKPOINT、オンラインバックアップの仕組み)
 

Similar a Securing Your Apache Spark Applications

IBM Spectrum Scale Security
IBM Spectrum Scale Security IBM Spectrum Scale Security
IBM Spectrum Scale Security Sandeep Patil
 
BigData Security - A Point of View
BigData Security - A Point of ViewBigData Security - A Point of View
BigData Security - A Point of ViewKaran Alang
 
Open Source Security Tools for Big Data
Open Source Security Tools for Big DataOpen Source Security Tools for Big Data
Open Source Security Tools for Big DataRommel Garcia
 
Open Source Security Tools for Big Data
Open Source Security Tools for Big DataOpen Source Security Tools for Big Data
Open Source Security Tools for Big DataGreat Wide Open
 
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing SparkDon't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing SparkDataWorks Summit
 
Owasp Indy Q2 2012 Cheat Sheet Overview
Owasp Indy Q2 2012 Cheat Sheet OverviewOwasp Indy Q2 2012 Cheat Sheet Overview
Owasp Indy Q2 2012 Cheat Sheet Overviewowaspindy
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop SecurityChris Nauroth
 
Hadoop Security, Cloudera - Todd Lipcon and Aaron Myers - Hadoop World 2010
Hadoop Security, Cloudera - Todd Lipcon and Aaron Myers - Hadoop World 2010Hadoop Security, Cloudera - Todd Lipcon and Aaron Myers - Hadoop World 2010
Hadoop Security, Cloudera - Todd Lipcon and Aaron Myers - Hadoop World 2010Cloudera, Inc.
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop SecurityDataWorks Summit
 
Stream processing on mobile networks
Stream processing on mobile networksStream processing on mobile networks
Stream processing on mobile networkspbelko82
 
Hadoop Security: Overview
Hadoop Security: OverviewHadoop Security: Overview
Hadoop Security: OverviewCloudera, Inc.
 
Combat Cyber Threats with Cloudera Impala & Apache Hadoop
Combat Cyber Threats with Cloudera Impala & Apache HadoopCombat Cyber Threats with Cloudera Impala & Apache Hadoop
Combat Cyber Threats with Cloudera Impala & Apache HadoopCloudera, Inc.
 
2014 sept 4_hadoop_security
2014 sept 4_hadoop_security2014 sept 4_hadoop_security
2014 sept 4_hadoop_securityAdam Muise
 
Geek Night 15.0 - Touring the Dark-Side of the Internet
Geek Night 15.0 - Touring the Dark-Side of the InternetGeek Night 15.0 - Touring the Dark-Side of the Internet
Geek Night 15.0 - Touring the Dark-Side of the InternetGeekNightHyderabad
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFAmazon Web Services
 
Security in the cloud Workshop HSTC 2014
Security in the cloud Workshop HSTC 2014Security in the cloud Workshop HSTC 2014
Security in the cloud Workshop HSTC 2014Akash Mahajan
 
Hadoop and Data Access Security
Hadoop and Data Access SecurityHadoop and Data Access Security
Hadoop and Data Access SecurityCloudera, Inc.
 

Similar a Securing Your Apache Spark Applications (20)

Securing Spark Applications
Securing Spark ApplicationsSecuring Spark Applications
Securing Spark Applications
 
IBM Spectrum Scale Security
IBM Spectrum Scale Security IBM Spectrum Scale Security
IBM Spectrum Scale Security
 
BigData Security - A Point of View
BigData Security - A Point of ViewBigData Security - A Point of View
BigData Security - A Point of View
 
Open Source Security Tools for Big Data
Open Source Security Tools for Big DataOpen Source Security Tools for Big Data
Open Source Security Tools for Big Data
 
Open Source Security Tools for Big Data
Open Source Security Tools for Big DataOpen Source Security Tools for Big Data
Open Source Security Tools for Big Data
 
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing SparkDon't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
 
Owasp Indy Q2 2012 Cheat Sheet Overview
Owasp Indy Q2 2012 Cheat Sheet OverviewOwasp Indy Q2 2012 Cheat Sheet Overview
Owasp Indy Q2 2012 Cheat Sheet Overview
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
 
From 0 to syncing
From 0 to syncingFrom 0 to syncing
From 0 to syncing
 
Hadoop Security, Cloudera - Todd Lipcon and Aaron Myers - Hadoop World 2010
Hadoop Security, Cloudera - Todd Lipcon and Aaron Myers - Hadoop World 2010Hadoop Security, Cloudera - Todd Lipcon and Aaron Myers - Hadoop World 2010
Hadoop Security, Cloudera - Todd Lipcon and Aaron Myers - Hadoop World 2010
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
 
Stream processing on mobile networks
Stream processing on mobile networksStream processing on mobile networks
Stream processing on mobile networks
 
Hadoop Security: Overview
Hadoop Security: OverviewHadoop Security: Overview
Hadoop Security: Overview
 
Combat Cyber Threats with Cloudera Impala & Apache Hadoop
Combat Cyber Threats with Cloudera Impala & Apache HadoopCombat Cyber Threats with Cloudera Impala & Apache Hadoop
Combat Cyber Threats with Cloudera Impala & Apache Hadoop
 
2014 sept 4_hadoop_security
2014 sept 4_hadoop_security2014 sept 4_hadoop_security
2014 sept 4_hadoop_security
 
Geek Night 15.0 - Touring the Dark-Side of the Internet
Geek Night 15.0 - Touring the Dark-Side of the InternetGeek Night 15.0 - Touring the Dark-Side of the Internet
Geek Night 15.0 - Touring the Dark-Side of the Internet
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Big data security
Big data securityBig data security
Big data security
 
Security in the cloud Workshop HSTC 2014
Security in the cloud Workshop HSTC 2014Security in the cloud Workshop HSTC 2014
Security in the cloud Workshop HSTC 2014
 
Hadoop and Data Access Security
Hadoop and Data Access SecurityHadoop and Data Access Security
Hadoop and Data Access Security
 

Más de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Más de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Último

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfryanfarris8
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfproinshot.com
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024Mind IT Systems
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...software pro Development
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionOnePlan Solutions
 

Último (20)

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 

Securing Your Apache Spark Applications

  • 1. Securing Spark Applications Kostas Sakellis Marcelo Vanzin
  • 2. What is Security? • Security has many facets • This talk will focus on three areas: – Encryption – Authentication – Authorization
  • 3. Why do I need security? • Multi-tenancy • Application isolation • User identification • Access control enforcement • Compliance with government regulations
  • 4. Before we go further... • Set up Kerberos • Use HDFS (or another secure filesystem) • Use YARN! • Configure them for security (enable auth, encryption). Kerberos, HDFS, and YARN provide the security backbone for Spark.
  • 5. Encryption • In a secure cluster, data should not be visible in the clear • Very important to financial / government institutions
  • 6. What a Spark app looks like RM NM NM AM / Driver Executor Executor SparkSubmit Control RPC File Download Shuffle / Cached Blocks Shuffle Service Shuffle Service Shuffle Blocks UI Shuffle Blocks / Metadata
  • 7. Data Flow in Spark Every connection in the previous slide can transmit sensitive data! • Input data transmitted via broadcast variables • Computed data during shuffles • Data in serialized tasks, files uploaded with the job How to prevent other users from seeing this data?
  • 8. Encryption in Spark • Almost all channels support encryption. – Exception 1: UI (SPARK-2750) – Exception 2: local shuffle / cache files (SPARK-5682) For local files, set up YARN local dirs to point at local encrypted disk(s) if desired. (SPARK-5682)
  • 9. Encryption: Current State Different channel, different method. • Shuffle protocol uses SASL • RPC / File download use SSL SSL can be hard to set up. • Need certificates readable on every node • Sharing certificates not as secure • Hard to have per-user certificate
  • 10. Encryption: The Goal SASL everywhere for wire encryption (except UI). • Minimum configuration (one boolean config) • Uses built-in JVM libraries • SPARK-6017 For UI: • Support for SSL • Or audit UI to remove sensitive info (e.g. information on environment page).
  • 11. Authentication Who is reading my data? • Spark uses Kerberos – the necessary evil • Ubiquitous among other services – YARN, HDFS, Hive, HBase etc.
  • 12. Who’s reading my data? Kerberos provides secure authentication. KDC Application Hi I’m Bob. Hello Bob. Here’s your TGT. Here’s my TGT. I want to talk to HDFS. Here’s your HDFS ticket. User
  • 13. Now with a distributed app... KDC Executor Executor Executor Executor Executor Executor Executor Executor Hi I’m Bob. Hi I’m Bob. Hi I’m Bob. Hi I’m Bob. Hi I’m Bob. Hi I’m Bob. Hi I’m Bob. Hi I’m Bob. Something is wrong.
  • 14. Kerberos in Hadoop / Spark KDCs do not allow multiple concurrent logins at the scale distributed applications need. Hadoop services use delegation tokens instead. Driver NameNode Executor DataNode
  • 15. Delegation Tokens Like Kerberos tickets, they have a TTL. • OK for most batch applications. • Not OK for long running applications – Streaming – Spark SQL Thrift Server
  • 16. Delegation Tokens Since 1.4, Spark can manage delegation tokens! • Restricted to HDFS currently • Requires user’s keytab to be deployed with application • Still some remaining issues in client deploy mode
  • 17. Authorization How can I share my data? Simplest form of authorization: file permissions. • Use Unix-style permissions or ACLs to let others read from and / or write to files and directories • Simple, but high maintenance. Set permissions / ownership for new files, mess with umask, etc.
  • 18. More than just FS semantics... Authorization becomes more complicated as abstractions are created. • Tables, columns, partitions instead of files and directories • Semantic gap • Need a trusted entity to enforce access control
  • 19. Trusted Service: Hive Hive has a trusted service (“HiveServer2”) for enforcing authorization. • HS2 parses queries and makes sure users have access to the data they’re requesting / modifying. HS2 runs as a trusted user with access to the whole warehouse. Users don’t run code directly in HS2*, so there’s no danger of code escaping access checks.
  • 20. Untrusted Apps: Spark Each Spark app runs as the requesting user, and needs access to the underlying files. • Spark itself cannot enforce access control, since it’s running as the user and is thus untrusted. • Restricted to file system permission semantics. How to bridge the two worlds?
  • 21. Apache Sentry • Role-based access control to resources • Integrates with Hive / HS2 to control access to data • Fine-grained (up to column level) controls Hive data and HDFS data have different semantics. How to bridge that?
  • 22. The Sentry HDFS Plugin Synchronize HDFS file permissions with higher-level abstractions. • Permission to read table = permission to read table’s files • Permission to create table = permission to write to database’s directory Uses HDFS ACLs for fine-grained user permissions.
  • 23. Still restricted to FS view of the world! • Files, directories, etc… • Cannot provide column-level and row-level access control. • Whole table or nothing. Still, it goes a long way in allowing Spark applications to work well with Hive data in a shared, secure environment. But...
  • 24. Future: RecordService A distributed, scalable, data access service for unified authorization in Hadoop.
  • 26. RecordService • Drop in replacement for InputFormats • SparkSQL: Integration with Data Sources API – Predicate pushdown, projection
  • 27. RecordService • Assume we had a table tpch.nation column_name column_type n_nationkey smallint n_name string n_regionkey smallint n_comment string
  • 28. import com.cloudera.recordservice.spark._ val context = new org.apache.spark.sql.SQLContext(sc) val df = context.load("tpch.nation", "com.cloudera.recordservice.spark") val results = df.groupBy("n_regionkey") .count() .collect() RecordService
  • 29. RecordService • Users can enforce Sentry permissions using views • Allows column and row level security > CREATE ROLE restrictedrole; > GRANT ROLE restrictedrole to GROUP restrictedgroup; > USE tpch; > CREATE VIEW nation_names AS SELECT n_nationkey, n_name FROM tpch.nation; > GRANT SELECT ON TABLE tpch.nation_names TO ROLE restrictedrole;
  • 30. ... val df = context.load("tpch.nation", "com.cloudera.recordservice.spark") val results = df.collect() >> TRecordServiceException(code:INVALID_REQUEST, message:Could not plan request., detail:AuthorizationException: User 'kostas' does not have privileges to execute 'SELECT' on: tpch.nation) RecordService
  • 31. ... val df = context.load("tpch.nation_names", "com.cloudera.recordservice.spark") val results = df.collect() RecordService
  • 32. RecordService • Documentation: http://cloudera.github.io/RecordServiceClient/ • Beta Download: http://www.cloudera.com/content/cloudera/en/downloads/betas/recordservic e/0-1-0.html
  • 33. Takeaways • Spark can be made secure today! • Benefits from a lot of existing Hadoop platform work • Still work to be done – Ease of use – Better integration with Sentry / RecordService
  • 34. References • Encryption: SPARK-6017, SPARK-5682 • Delegation tokens: SPARK-5342 • Sentry: http://sentry.apache.org/ – HDFS synchronization: SENTRY-432 • RecordService: http://cloudera.github.io/RecordServiceClient/