SlideShare una empresa de Scribd logo
1 de 28
Running Apache Spark & Apache
Zeppelin in Production
Director, Product Management
August 31, 2016
Twitter: @neomythos
Vinay Shukla
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Who am i?
 Product Management
 Spark for 2.5 + years, Hadoop for 3+ years
 Recovering Programmer
 Blog at www.vinayshukla.com
 Twitter: @neomythos
 Addicted to Yoga, Hiking, & Coffee
 Smallest contributor to Apache Zeppelin
Vinay Shukla
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Security: Rings of Defense
Perimeter Level Security
•Network Security (i.e. Firewalls)
Data Protection
•Wire encryption
•HDFS TDE/Dare
•Others
Authentication
•Kerberos
•Knox (Other Gateways)
OS Security
Authorization
•Apache Ranger/Sentry
•HDFS Permissions
•HDFS ACLs
•YARN ACL
Page 3
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Interacting with Spark
Ex
Spark on YARN
Zeppelin
Spark-
Shell
Ex
Spark
Thrift
Server
Driver
REST
ServerDriver
Driver
Driver
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Context: Spark Deployment Modes
• Spark on YARN
–Spark driver (SparkContext) in YARN AM(yarn-cluster)
–Spark driver (SparkContext) in local (yarn-client):
• Spark Shell & Spark Thrift Server runs in yarn-client only
Client
Executor
App
MasterSpark Driver
Client
Executor
App Master
Spark Driver
YARN-Client YARN-Cluster
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark on YARN
Spark Submit
John Doe
Spark
AM
Spark
AM
1
Hadoop Cluster
HDFS
Executor
YARN
RM
YARN
RM
4
2 3
Node
Manager
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark – Security – Four Pillars
 Authentication
 Authorization
 Audit
 Encryption
Spark leverages Kerberos on YARN
Ensure network is secure
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Kerberos authentication within Spark
KDC
Use Spark ST, submit
Spark Job
Spark gets Namenode (NN)
service ticket
YARN launches Spark
Executors using John
Doe’s identity
Get service ticket for
Spark,
John Doe
Spark AMSpark AM
NNNN
Executor reads from HDFS using
John Doe’s delegation token
kinit
1
2
3
4
5
6
7
Hadoop Cluster
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark – Kerberos - Example
kinit -kt /etc/security/keytabs/johndoe.keytab johndoe@
EXAMPLE.COM
./bin/spark-submit --class org.apache.spark.examples.SparkPi
--master yarn-cluster --num-executors 3 --driver-memory 512m
--executor-memory 512m --executor-cores 1 lib/spark-
examples*.jar 10
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDFS
Spark – Authorization
YARN Cluster
A B C
KDC
Use Spark ST,
submit Spark Job
Get Namenode (NN)
service ticket
Executors
read from
HDFS
Client gets service
ticket for Spark
RangerRangerCan John launch this job?
Can John read this file
John Doe
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Encryption: Spark – Communication Channels
Spark
Submit
RM
Shuffle
Service
AM
Driver
NM
Ex 1 Ex N
Shuffle Data
Control/RPC
Shuffle
BlockTransfer
Data
Source
Read/Write
Data
FS – Broadcast,
File Download
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Communication Encryption Settings
Shuffle Data
Control/RPC
Shuffle
BlockTransfer
Read/Write
Data
FS – Broadcast,
File Download
spark.authenticate.enableSaslEncryption= true
spark.authenticate = true. Leverage YARN to distribute keys
Depends on Data Source, For HDFS RPC (RC4 | 3DES) or SSL for WebHDFS
NM > Ex leverages YARN based SSL
spark.ssl.enabled = true
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sharp Edges with Spark Security
 SparkSQL – Only coarse grain access control today
 Client -> Spark Thrift Server > Spark Executors – No identity propagation on 2nd
hop
– Lowers security, forces STS to run as Hive user to read all data
– Use SparkSQL via shell or programmatic API
– https://issues.apache.org/jira/browse/SPARK-5159
 Spark Stream + Kafka + Kerberos
– Issues fixed in HDP 2.4.x
– No SSL support yet
 Spark Shuffle > Only SASL, no SSL support
 Spark Shuffle > No encryption for spill to disk or intermediate data
Fine grained Security to
SparkSQL
http://bit.ly/2bLghGz
http://bit.ly/2bTX7Pm
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin Security
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin: Authentication + SSL
Spark on YARN
Ex Ex
LDAP
John Doe
1
2
3
SSL
Firewall
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Zeppelin + Livy E2E Security
Zeppelin
Spark
Yarn
Livy
Ispark Group
Interpreter
SPNego: Kerberos Kerberos/RPC
Livy APIs
LDAP
John Doe
Job runs as John Doe
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin: Authorization
 Note level authorization
 Grant Permissions (Owner, Reader, Writer)
to users/groups on Notes
 LDAP Group integration
 Zeppelin UI Authorization
 Allow only admins to configure interpreter
 Configured in shiro.ini
 For Spark with Zeppelin > Livy > Spark
– Identity Propagation Jobs run as End-User
 For Hive with Zeppelin > JDBC interpreter
 Shell Interpreter
– Runs as end-user
Authorization in Zeppelin Authorization at Data Level
[urls]
/api/interpreter/** = authc, roles[admin]
/api/configurations/** = authc, roles[admin]
/api/credential/** = authc, roles[admin]
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin: Credentials
 LDAP/AD account
 Zeppelin leverages Hadoop Credential API
 Interpreter Credentials
 Not solved yet
 Credentials
Credentials in Zeppelin
This is
still an
open
issue
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin: AD Authentication
1. /etc/zeppelin/conf/shiro.ini
[urls]
/api/version = anon
/** = authc
Configure Zeppelin to Authenticate users
Zeppelin leverages
Apache Shiro for
authentication/authoriza
tion
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin: AD Authentication
1. Create an entry for AD credential
– Zeppelin leverages Hadoop Credential API
– >hadoop credential create
– activeDirectoryRealm.systemPassword -provider jceks://etc/zeppelin/conf/credentials.jceks
– chmod 400 with only Zeppelin process r/w access, no other user allowed accessCredentials
1. Configure Zeppelin to use AD
activeDirectoryRealm = org.apache.zeppelin.server.ActiveDirectoryGroupRealm
activeDirectoryRealm.systemUsername = CN=Administrator,CN=Users,DC=HWQE,DC=HORTONWORKS,DC=COM
#activeDirectoryRealm.systemPassword = Password1!
activeDirectoryRealm.hadoopSecurityCredentialPath = jceks://etc/zeppelin/conf/credentials.jceks
activeDirectoryRealm.searchBase = CN=Users,DC=HWQE,DC=HORTONWORKS,DC=COM
activeDirectoryRealm.url = ldap://ad-nano.qe.hortonworks.com:389
activeDirectoryRealm.groupRolesMap =
"CN=admin,OU=groups,DC=HWQE,DC=HORTONWORKS,DC=COM":"admin","CN=finance,OU=groups,DC=HWQE,DC=HORTONWORK
S,DC=COM":"finance","CN=zeppelin,OU=groups,DC=HWQE,DC=HORTONWORKS,DC=COM":"zeppelin”
activeDirectoryRealm.authorizationCachingEnabled = true
Active Directory Authentication
Skip step 1 if securing
LDAP password is not an
issue
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Performance
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
#1 Big or Small Executor ?
spark-submit --master yarn 
--deploy-mode client 
--num-executors ? 
--executor-cores ? 
--executor-memory ? 
--class MySimpleApp 
mySimpleApp.jar 
arg1 arg2
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Show the Details
 Cluster resource: 72 cores + 288GB Memory
24 cores
96GB Memory
24 cores
96GB Memory
24 cores
96GB Memory
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Benchmark with Different Combinations
Cores
Per E#
Memory
Per E
(GB)
E Per
Node
1 6 18
2 12 9
3 18 6
6 36 3
9 54 2
18 108 1
#E stands for Executor
Except too small or too big executor, the
performance is relatively close
18 cores 108GB memory per node# The lower the better
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Big Or Small Executor ?
 Avoid too small or too big executor.
– Small executor will decrease CPU and memory efficiency.
– Big executor will introduce heavy GC overhead.
 Usually 3 ~ 6 cores and 10 ~ 40 GB of memory per executor is a preferable choice.
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Any Other Thing ?
 Executor memory != Container memory
 Container memory = executor memory + overhead memory (10% of executory memory)
 Leave some resources to os and other services
 Enable CPU scheduling if you want to constrain CPU usage#
#CPU scheduling -
http://hortonworks.com/blog/managing-cpu-resources-in-y
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Multi-tenancy for Spark
 Leverage YARN queues
– Set user quotas
– Set default yarn-queue in spark-defaults
– User can override for each job
 Leverage Dynamic Resource Allocation
– Specify range of executors a job uses
– This needs shuffle service to be used
Cluster resource Utilization
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank You
Vinay Shukla
@neomythos

Más contenido relacionado

La actualidad más candente

Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkDatabricks
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!
 
Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in SparkDatabricks
 
Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayDataWorks Summit
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDatabricks
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesDatabricks
 
A Practical Enterprise Feature Store on Delta Lake
A Practical Enterprise Feature Store on Delta LakeA Practical Enterprise Feature Store on Delta Lake
A Practical Enterprise Feature Store on Delta LakeDatabricks
 
Embracing Open Source: Practice and Experience from Alibaba
Embracing Open Source: Practice and Experience from AlibabaEmbracing Open Source: Practice and Experience from Alibaba
Embracing Open Source: Practice and Experience from AlibabaWensong Zhang
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKai Wähner
 
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...HostedbyConfluent
 
Procella: A fast versatile SQL query engine powering data at Youtube
Procella: A fast versatile SQL query engine powering data at YoutubeProcella: A fast versatile SQL query engine powering data at Youtube
Procella: A fast versatile SQL query engine powering data at YoutubeDataWorks Summit
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 

La actualidad más candente (20)

Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 
Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox Gateway
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on Kubernetes
 
A Practical Enterprise Feature Store on Delta Lake
A Practical Enterprise Feature Store on Delta LakeA Practical Enterprise Feature Store on Delta Lake
A Practical Enterprise Feature Store on Delta Lake
 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
 
Embracing Open Source: Practice and Experience from Alibaba
Embracing Open Source: Practice and Experience from AlibabaEmbracing Open Source: Practice and Experience from Alibaba
Embracing Open Source: Practice and Experience from Alibaba
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid Cloud
 
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
 
Procella: A fast versatile SQL query engine powering data at Youtube
Procella: A fast versatile SQL query engine powering data at YoutubeProcella: A fast versatile SQL query engine powering data at Youtube
Procella: A fast versatile SQL query engine powering data at Youtube
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 

Similar a Running Apache Spark & Apache Zeppelin in Production

Running Apache Zeppelin production
Running Apache Zeppelin productionRunning Apache Zeppelin production
Running Apache Zeppelin productionVinay Shukla
 
Running Zeppelin in Enterprise
Running Zeppelin in EnterpriseRunning Zeppelin in Enterprise
Running Zeppelin in EnterpriseDataWorks Summit
 
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing SparkDon't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing SparkDataWorks Summit
 
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3
Deep learning on yarn  running distributed tensorflow etc on hadoop cluster v3Deep learning on yarn  running distributed tensorflow etc on hadoop cluster v3
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3DataWorks Summit
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov
 
Dataworks Berlin Summit 18' - Deep learning On YARN - Running Distributed Te...
Dataworks Berlin Summit 18' - Deep learning On YARN -  Running Distributed Te...Dataworks Berlin Summit 18' - Deep learning On YARN -  Running Distributed Te...
Dataworks Berlin Summit 18' - Deep learning On YARN - Running Distributed Te...Wangda Tan
 
Hadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopHadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopYifeng Jiang
 
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data AnalysisApache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data AnalysisDataWorks Summit/Hadoop Summit
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...DataWorks Summit
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object StoresSteve Loughran
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionDataWorks Summit/Hadoop Summit
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementDataWorks Summit/Hadoop Summit
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)Chris Nauroth
 

Similar a Running Apache Spark & Apache Zeppelin in Production (20)

Running Apache Zeppelin production
Running Apache Zeppelin productionRunning Apache Zeppelin production
Running Apache Zeppelin production
 
Running Zeppelin in Enterprise
Running Zeppelin in EnterpriseRunning Zeppelin in Enterprise
Running Zeppelin in Enterprise
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Running Spark in Production
Running Spark in ProductionRunning Spark in Production
Running Spark in Production
 
Curb Your Insecurity - Tips for a Secure Cluster (with Spark too)!!
Curb Your Insecurity - Tips for a Secure Cluster (with Spark too)!!Curb Your Insecurity - Tips for a Secure Cluster (with Spark too)!!
Curb Your Insecurity - Tips for a Secure Cluster (with Spark too)!!
 
Curb your insecurity with HDP
Curb your insecurity with HDPCurb your insecurity with HDP
Curb your insecurity with HDP
 
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache SparkRow/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
 
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing SparkDon't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3
Deep learning on yarn  running distributed tensorflow etc on hadoop cluster v3Deep learning on yarn  running distributed tensorflow etc on hadoop cluster v3
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
Dataworks Berlin Summit 18' - Deep learning On YARN - Running Distributed Te...
Dataworks Berlin Summit 18' - Deep learning On YARN -  Running Distributed Te...Dataworks Berlin Summit 18' - Deep learning On YARN -  Running Distributed Te...
Dataworks Berlin Summit 18' - Deep learning On YARN - Running Distributed Te...
 
Hadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopHadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise Hadoop
 
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data AnalysisApache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object Stores
 
Spark Security
Spark SecuritySpark Security
Spark Security
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop Management
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
 

Más de DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors DataWorks Summit/Hadoop Summit
 

Más de DataWorks Summit/Hadoop Summit (20)

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
 

Último

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 

Último (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 

Running Apache Spark & Apache Zeppelin in Production

  • 1. Running Apache Spark & Apache Zeppelin in Production Director, Product Management August 31, 2016 Twitter: @neomythos Vinay Shukla
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Who am i?  Product Management  Spark for 2.5 + years, Hadoop for 3+ years  Recovering Programmer  Blog at www.vinayshukla.com  Twitter: @neomythos  Addicted to Yoga, Hiking, & Coffee  Smallest contributor to Apache Zeppelin Vinay Shukla
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Security: Rings of Defense Perimeter Level Security •Network Security (i.e. Firewalls) Data Protection •Wire encryption •HDFS TDE/Dare •Others Authentication •Kerberos •Knox (Other Gateways) OS Security Authorization •Apache Ranger/Sentry •HDFS Permissions •HDFS ACLs •YARN ACL Page 3
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Interacting with Spark Ex Spark on YARN Zeppelin Spark- Shell Ex Spark Thrift Server Driver REST ServerDriver Driver Driver
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Context: Spark Deployment Modes • Spark on YARN –Spark driver (SparkContext) in YARN AM(yarn-cluster) –Spark driver (SparkContext) in local (yarn-client): • Spark Shell & Spark Thrift Server runs in yarn-client only Client Executor App MasterSpark Driver Client Executor App Master Spark Driver YARN-Client YARN-Cluster
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark on YARN Spark Submit John Doe Spark AM Spark AM 1 Hadoop Cluster HDFS Executor YARN RM YARN RM 4 2 3 Node Manager
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark – Security – Four Pillars  Authentication  Authorization  Audit  Encryption Spark leverages Kerberos on YARN Ensure network is secure
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Kerberos authentication within Spark KDC Use Spark ST, submit Spark Job Spark gets Namenode (NN) service ticket YARN launches Spark Executors using John Doe’s identity Get service ticket for Spark, John Doe Spark AMSpark AM NNNN Executor reads from HDFS using John Doe’s delegation token kinit 1 2 3 4 5 6 7 Hadoop Cluster
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark – Kerberos - Example kinit -kt /etc/security/keytabs/johndoe.keytab johndoe@ EXAMPLE.COM ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark- examples*.jar 10
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HDFS Spark – Authorization YARN Cluster A B C KDC Use Spark ST, submit Spark Job Get Namenode (NN) service ticket Executors read from HDFS Client gets service ticket for Spark RangerRangerCan John launch this job? Can John read this file John Doe
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Encryption: Spark – Communication Channels Spark Submit RM Shuffle Service AM Driver NM Ex 1 Ex N Shuffle Data Control/RPC Shuffle BlockTransfer Data Source Read/Write Data FS – Broadcast, File Download
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Communication Encryption Settings Shuffle Data Control/RPC Shuffle BlockTransfer Read/Write Data FS – Broadcast, File Download spark.authenticate.enableSaslEncryption= true spark.authenticate = true. Leverage YARN to distribute keys Depends on Data Source, For HDFS RPC (RC4 | 3DES) or SSL for WebHDFS NM > Ex leverages YARN based SSL spark.ssl.enabled = true
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sharp Edges with Spark Security  SparkSQL – Only coarse grain access control today  Client -> Spark Thrift Server > Spark Executors – No identity propagation on 2nd hop – Lowers security, forces STS to run as Hive user to read all data – Use SparkSQL via shell or programmatic API – https://issues.apache.org/jira/browse/SPARK-5159  Spark Stream + Kafka + Kerberos – Issues fixed in HDP 2.4.x – No SSL support yet  Spark Shuffle > Only SASL, no SSL support  Spark Shuffle > No encryption for spill to disk or intermediate data Fine grained Security to SparkSQL http://bit.ly/2bLghGz http://bit.ly/2bTX7Pm
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Zeppelin Security
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Zeppelin: Authentication + SSL Spark on YARN Ex Ex LDAP John Doe 1 2 3 SSL Firewall
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Zeppelin + Livy E2E Security Zeppelin Spark Yarn Livy Ispark Group Interpreter SPNego: Kerberos Kerberos/RPC Livy APIs LDAP John Doe Job runs as John Doe
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Zeppelin: Authorization  Note level authorization  Grant Permissions (Owner, Reader, Writer) to users/groups on Notes  LDAP Group integration  Zeppelin UI Authorization  Allow only admins to configure interpreter  Configured in shiro.ini  For Spark with Zeppelin > Livy > Spark – Identity Propagation Jobs run as End-User  For Hive with Zeppelin > JDBC interpreter  Shell Interpreter – Runs as end-user Authorization in Zeppelin Authorization at Data Level [urls] /api/interpreter/** = authc, roles[admin] /api/configurations/** = authc, roles[admin] /api/credential/** = authc, roles[admin]
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Zeppelin: Credentials  LDAP/AD account  Zeppelin leverages Hadoop Credential API  Interpreter Credentials  Not solved yet  Credentials Credentials in Zeppelin This is still an open issue
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Zeppelin: AD Authentication 1. /etc/zeppelin/conf/shiro.ini [urls] /api/version = anon /** = authc Configure Zeppelin to Authenticate users Zeppelin leverages Apache Shiro for authentication/authoriza tion
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Zeppelin: AD Authentication 1. Create an entry for AD credential – Zeppelin leverages Hadoop Credential API – >hadoop credential create – activeDirectoryRealm.systemPassword -provider jceks://etc/zeppelin/conf/credentials.jceks – chmod 400 with only Zeppelin process r/w access, no other user allowed accessCredentials 1. Configure Zeppelin to use AD activeDirectoryRealm = org.apache.zeppelin.server.ActiveDirectoryGroupRealm activeDirectoryRealm.systemUsername = CN=Administrator,CN=Users,DC=HWQE,DC=HORTONWORKS,DC=COM #activeDirectoryRealm.systemPassword = Password1! activeDirectoryRealm.hadoopSecurityCredentialPath = jceks://etc/zeppelin/conf/credentials.jceks activeDirectoryRealm.searchBase = CN=Users,DC=HWQE,DC=HORTONWORKS,DC=COM activeDirectoryRealm.url = ldap://ad-nano.qe.hortonworks.com:389 activeDirectoryRealm.groupRolesMap = "CN=admin,OU=groups,DC=HWQE,DC=HORTONWORKS,DC=COM":"admin","CN=finance,OU=groups,DC=HWQE,DC=HORTONWORK S,DC=COM":"finance","CN=zeppelin,OU=groups,DC=HWQE,DC=HORTONWORKS,DC=COM":"zeppelin” activeDirectoryRealm.authorizationCachingEnabled = true Active Directory Authentication Skip step 1 if securing LDAP password is not an issue
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Performance
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved #1 Big or Small Executor ? spark-submit --master yarn --deploy-mode client --num-executors ? --executor-cores ? --executor-memory ? --class MySimpleApp mySimpleApp.jar arg1 arg2
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Show the Details  Cluster resource: 72 cores + 288GB Memory 24 cores 96GB Memory 24 cores 96GB Memory 24 cores 96GB Memory
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Benchmark with Different Combinations Cores Per E# Memory Per E (GB) E Per Node 1 6 18 2 12 9 3 18 6 6 36 3 9 54 2 18 108 1 #E stands for Executor Except too small or too big executor, the performance is relatively close 18 cores 108GB memory per node# The lower the better
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Big Or Small Executor ?  Avoid too small or too big executor. – Small executor will decrease CPU and memory efficiency. – Big executor will introduce heavy GC overhead.  Usually 3 ~ 6 cores and 10 ~ 40 GB of memory per executor is a preferable choice.
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Any Other Thing ?  Executor memory != Container memory  Container memory = executor memory + overhead memory (10% of executory memory)  Leave some resources to os and other services  Enable CPU scheduling if you want to constrain CPU usage# #CPU scheduling - http://hortonworks.com/blog/managing-cpu-resources-in-y
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Multi-tenancy for Spark  Leverage YARN queues – Set user quotas – Set default yarn-queue in spark-defaults – User can override for each job  Leverage Dynamic Resource Allocation – Specify range of executors a job uses – This needs shuffle service to be used Cluster resource Utilization
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank You Vinay Shukla @neomythos

Notas del editor

  1. John Doe first authenticates to Kerberos before launching Spark Shell kinit -kt /etc/security/keytabs/johndoe.keytab johndoe@EXAMPLE.COM ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10
  2. The first step of security is network security The second step of security is Authentication Most Hadoop echo system projects rely on Kerberos for Authentication Kerberos – 3 Headed Guard Dog : https://en.wikipedia.org/wiki/Cerberus
  3. John Doe first authenticates to Kerberos before launching Spark Shell kinit -kt /etc/security/keytabs/johndoe.keytab johndoe@EXAMPLE.COM ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10
  4. Controlling HDFS Authorization is easy/Done Controlling Hive row/column level authorization in Spark is WIP
  5. For HDFS as Data Source can use RPC or use SSL with WebHDFS For NM Shuffle Data – Use YARN SSL Spark support SSL for FS (Broadcast or File download) Shuffle Block Transfer supports SASL based encryption – SSL coming
  6. Thank you Prasad Wagle (Twitter) & Prabhjot Singh (Hortonworks)
  7. Thank you Prasad Wagle (Twitter) & Prabhjot Singh (Hortonworks)
  8. Thank you Prasad Wagle (Twitter) & Prabhjot Singh (Hortonworks)
  9. Thank you Prasad Wagle (Twitter) & Prabhjot Singh (Hortonworks)
  10. All Images from Flicker Commons