This document provides instructions for setting up an interactive data analysis framework using a Cloudera Spark cluster with Kerberos authentication, a JupyterHub machine, and LDAP authentication. The key steps are:
1. Install Anaconda, Jupyter, and dependencies on the JupyterHub machine.
2. Configure JupyterHub to use LDAP for authentication via plugins like ldapcreateusers and sudospawner.
3. Set up a PySpark kernel that uses Kerberos authentication to allow users to run Spark jobs on the cluster via proxy impersonation.
4. Optional: Configure JupyterLab as the default interface and enable R, Hive, and Impala kernels.
How to create a multi tenancy for an interactive data analysis with jupyter hub and ldap
1. How-to create a multi tenancy for
an interactive data analysis with
JupyterHub & LDAP
Spark Cluster + Jupyter + LDAP
2. Introduction
With this presentation you should be able to create an architecture for a framework of an
interactive data analysis by using a Cloudera Spark Cluster with Kerberos, a Jupyter
machine with JupyterHub and authentication via LDAP.
3. Architecture
This architecture enables the following:
● Transparent data-science development
● User Impersonation
● Authentication via LDAP
● Upgrades on Cluster won’t affect the developments.
● Controlled access to the data and resources by Kerberos/Sentry.
● Several coding API’s (Scala, R, Python, PySpark, etc…).
● Two layers of security with Kerberos & LDAP
5. Pre-Assumptions
1. Cluster hostname: cm1.localdomain Jupyter hostname: cm3.localdomain
2. Cluster Python version: 3.7.1
3. Cluster Manager: Cloudera Manager 5.12.2
4. Service Yarn & PIP Installed
5. Cluster Authentication Pre-Installed: Kerberos
a. Kerberos Realm DOMAIN.COM
6. Chosen IDE: Jupyter
7. JupyterHub Machine Authentication Not-Installed: Kerberos
8. AD Machine Installed with hostname: ad.localdomain
9. Java 1.8 installed in Both Machines
10. Cluster Spark version 2.2.0
6. Anaconda
Download and installation
su - root
wget https://repo.continuum.io/archive/Anaconda3-2018.12-Linux-x86_64.sh
chmod +x Anaconda3-2018.12-Linux-x86_64.sh
./Anaconda3-2018.12-Linux-x86_64.sh
Note 1: Change with your hostname and domain in the highlighted field.
Note 2: Due to the package SudoSpawner - that requires Anaconda be installed with the root user!
Note 3: JupyterHub requires Python 3.X, therefore it will be installed Anaconda 3
8. Anaconda
Validate installation
anaconda-navigator
Update Conda (Only if needed)
conda update -n base -c defaults conda
Start Jupyter Notebook (If non root)
jupyter-notebook --ip='10.111.22.333' --port 9001 --debug > /opt/anaconda3/log.txt 2>&1
Start Jupyter Notebook (if root)
jupyter-notebook --ip='10.111.22.333' --port 9001 --debug --allow-root > /opt/anaconda3/log.txt 2>&1
Note: it’s only necessary to change the highlighted, ex: for your ip.
9. Jupyter or JupyterHub?
JupyterHub it’s a multi-purpose notebook that:
● Manages authentication.
● Spawns single-user notebook on-demand.
● Gives each user a complete notebook
server.
How to choose?
10. JupyterHub
Install JupyterHub Package (with Http-Proxy)
conda install -c conda-forge jupyterhub
Validate Installation
jupyterhub -h
Start JupyterHub Server
jupyterhub --ip='10.111.22.333' --port 9001 --debug > /opt/anaconda3/log.txt 2>&1
Note: it’s only necessary to change the highlighted, ex: for your ip.
11. JupyterHub With LDAP
Install Simple LDAP Authenticator Plugin for JupyterHub
conda install -c conda-forge jupyterhub-ldapauthenticator
Install SudoSpawner
conda install -c conda-forge sudospawner
Install Package LDAP to be able to Create Users Locally
pip install jupyterhub-ldapcreateusers
Generate JupyterHub Config File
jupyterhub --generate-config
Note 1: it’s only necessary to change the highlighted, ex: for your ip.
Note 2: Sudospawner enables JupyterHub to spawn single-user servers without being root
12. JupyterHub With LDAP
Configure JupyterHub Config File
nano /opt/anaconda3/jupyterhub_config.py
import os
import pwd
import subprocess
# Function to Create User Home
def create_dir_hook(spawner):
if not os.path.exists(os.path.join('/home/', spawner.user.name)):
subprocess.call(["sudo", "/sbin/mkhomedir_helper", spawner.user.name])
c.Spawner.pre_spawn_hook = create_dir_hook
c.JupyterHub.authenticator_class = 'ldapcreateusers.LocalLDAPCreateUsers'
c.LocalLDAPCreateUsers.server_address = 'ad.localdomain'
c.LocalLDAPCreateUsers.server_port = 3268
c.LocalLDAPCreateUsers.use_ssl = False
c.LocalLDAPCreateUsers.lookup_dn = True
# Instructions to Define LDAP Search - Doesn't have in consideration possible group users
c.LocalLDAPCreateUsers.bind_dn_template = ['CN={username},DC=ad,DC=localdomain']
c.LocalLDAPCreateUsers.user_search_base = 'DC=ad,DC=localdomain'
13. JupyterHub With LDAP
c.LocalLDAPCreateUsers.lookup_dn_search_user = 'admin'
c.LocalLDAPCreateUsers.lookup_dn_search_password = 'passWord'
c.LocalLDAPCreateUsers.lookup_dn_user_dn_attribute = 'CN'
c.LocalLDAPCreateUsers.user_attribute = 'sAMAccountName'
c.LocalLDAPCreateUsers.escape_userdn = False
c.JupyterHub.hub_ip = '10.111.22.333’
c.JupyterHub.port = 9001
# Instructions Required to Add User Home
c.LocalAuthenticator.add_user_cmd = ['useradd', '-m']
c.LocalLDAPCreateUsers.create_system_users = True
c.Spawner.debug = True
c.Spawner.default_url = 'tree/home/{username}'
c.Spawner.notebook_dir = '/'
c.PAMAuthenticator.open_sessions = True
Start JupyterHub Server With Config File
jupyterhub -f /opt/anaconda3/jupyterhub_config.py --debug
Note: it’s only necessary to change the highlighted, ex: for your ip.
14. JupyterHub with LDAP + ProxyUser
Has a reminder, to have ProxyUser working, you will require on both Machines (Cluster and JupyterHub): Java 1.8 and
same Spark version, for this example it will be used the 2.2.0.
[Cluster] Confirm Cluster Spark & Hadoop Version
spark-shell
hadoop version
[JupyterHub] Download Spark & Create Symbolic link
cd /tmp/
wget https://archive.apache.org/dist/spark/spark-2.2.0/spark-2.2.0-bin-hadoop2.6.tgz
tar zxvf spark-2.2.0-bin-hadoop2.6.tgz
mv spark-2.2.0-bin-hadoop2.6 /opt/spark-2.2.0
ln -s /opt/spark-2.2.0 /opt/spark
Note: change with your Spark and Hadoop version in the highlighted field.
15. Jupyter Hub with LDAP + ProxyUser
[Cluster] Copy Hadoop/Hive/Spark Config files
cd /etc/spark2/conf.cloudera.spark2_on_yarn/
scp * root@10.111.22.333:/etc/hadoop/conf/
[Cluster] HDFS ProxyUser
Note: change with your IP and directory’s in the highlighted field.
[JupyterHub] Create hadoop config files directory
mkdir -p /etc/hadoop/conf/
ln -s /etc/hadoop/conf/ conf.cloudera.yarn
[JupyterHub] Create spark-events directory
mkdir /tmp/spark-events
chown spark:spark spark-events
chmod 777 /tmp/spark-events
[JupyterHub] Test Spark 2
spark-submit --class org.apache.spark.examples.SparkPi
--master yarn
--num-executors 1 --driver-memory 512m --executor-memory 512m
--executor-cores 1 --deploy-mode cluster
--proxy-user tpsimoes --keytab /root/jupyter.keytab
--conf spark.eventLog.enabled=true
/opt/spark-2.2.0/examples/jars/spark-examples_2.11-2.2.0.jar 10;
17. Jupyter Hub with LDAP + ProxyUser
The python.sh script was elaborated due to the limitations on JupyterHub Kernel configurations that isn't able to get the
Kerberos Credentials and also due to LDAP package that doesn't allow the proxyUser has is possible with Zeppelin. Therefore
with this architecture solution you are able to:
● Add a new step of security, that requires the IDE keytab
● Enable the usage of proxyUser by using the flag from spark --proxy-user ${KERNEL_USERNAME}
Edit PySpark Script
touch /opt/anaconda3/share/jupyter/kernels/pyspark/python.sh;
nano /opt/anaconda3/share/jupyter/kernels/pyspark/python.sh;
# !/usr/bin/env bash
# setup environment variable, etc.
PROXY_USER="$(whoami)"
export JAVA_HOME=/usr/java/jdk1.8.0_181-amd64
export SPARK_HOME=/opt/spark
export SPARK_MASTER_IP=10.111.22.333
export HADOOP_HOME=/etc/hadoop/conf
18. Jupyter Hub with LDAP + ProxyUser
Edit PySpark Script
export YARN_CONF_DIR=/etc/hadoop/conf
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HIVE_HOME=/etc/hadoop/conf
export PYTHONPATH=/opt/spark-2.2.0/python:/opt/spark-2.2.0/python/lib/py4j-0.10.4-src.zip
export PYTHONSTARTUP=/opt/spark-2.2.0/python/pyspark/shell.py
export PYSPARK_PYTHON=/usr/src/Python-3.7.1/python
export PYSPARK_SUBMIT_ARGS="-v --master yarn --deploy-mode client --conf
spark.serializer=org.apache.spark.serializer.KryoSerializer --num-executors 2 --driver-memory 1024m --executor-memory 1024m
--executor-cores 2 --proxy-user "${PROXY_USER}" --keytab /tmp/jupyter.keytab pyspark-shell"
# Kinit User/Keytab defined por the ProxyUser on the Cluster/HDFS
kinit -kt /tmp/jupyter.keytab jupyter/cm1.localdomain@DOMAIN.COM
# run the ipykernel
exec /opt/anaconda3/bin/python -m ipykernel $@
Note: change with your IP and directories in the highlighted field.
20. To use JupyterLab without it being the default interface, you just have to
swap on your browser url the “tree” with Lab!
http://10.111.22.333:9001/user/tpsimoes/lab
JupyterLab
JupyterLab it’s the next-generation web-based
interface for Jupyter.
Install JupyterLab
conda install -c conda-forge jupyterlab
Install JupyterLab Launcher
conda install -c conda-forge jupyterlab_launcher
21. JupyterLab
To be able to use the JupyterLab interface as default on Jupyter it requires additional changes.
● Change the JupyterHub Config File
● Additional extensions (for the Hub Menu)
● Create config file for JupyterLab
Edit PySpark Script
nano /opt/anaconda3/jupyterhub_config.py
...
# Change the values on this Flags
c.Spawner.default_url = '/lab'
c.Spawner.notebook_dir = '/home/{username}'
# Add this Flag
c.Spawner.cmd = ['jupyter-labhub']
24. R, Hive and Impala on JupyterHub
On this section the focus will reside on R, Hive, Impala and Kerberized Kernel.
With R Kernel, it requires libs on both Machines (Cluster and Jupyter)
[Cluster & Jupyter] Install R Libs
yum install -y openssl-devel openssl libcurl-devel libssh2-devel
[Jupyter] Create SymLinks for R libs
ln -s /opt/anaconda3/lib/libssl.so.1.0.0 /usr/lib64/libssl.so.1.0.0;
ln -s /opt/anaconda3/lib/libcrypto.so.1.0.0 /usr/lib64/libcrypto.so.1.0.0;
[Cluster & Jupyter] To use SparkR
devtools::install_github('apache/spark@v2.2.0', subdir='R/pkg')
Note: Change with your values in the highlighted field.
[Cluster & Jupyter] Start R & Install Packages
R
install.packages('git2r')
install.packages('devtools')
install.packages('repr')
install.packages('IRdisplay')
install.packages('crayon')
install.packages('pbdZMQ')
25. R, Hive and Impala on JupyterHub
To interact with Hive metadata and the direct use of the sintax, the my recommendation is the HiveQL.
Install Developer Toolset Libs
yum install cyrus-sasl-devel.x86_64 cyrus-sasl-gssapi.x86_64 cyrus-sasl-sql.x86_64 cyrus-sasl-plain.x86_64 gcc-c++
Install Python + Hive interface (SQLAlchemy interface for Hive)
pip install pyhive
Install HiveQL Kernel
pip install --upgrade hiveqlKernel
jupyter hiveql install
Confirm HiveQL Kernel installation
jupyter kernelspec list
26. R, Hive and Impala on JupyterHub
Edit HiveQL Kernel
cd /usr/local/share/jupyter/kernels/hiveql
nano kernel.json
{"argv":
["/usr/local/share/jupyter/kernels/hiveql/hiveql.sh", "-f", "{connection_file}"],
"display_name": "HiveQL", "language": "hiveql", "name": "hiveql"}
Create and Edit HiveQL script
touch /opt/anaconda3/share/jupyter/kernels/hiveql/hiveql.sh;
nano /opt/anaconda3/share/jupyter/kernels/hiveql/hiveql.sh;
# !/usr/bin/env bash
# setup environment variable, etc.
PROXY_USER="$(whoami)"
27. R, Hive and Impala on JupyterHub
Edit HiveQL script
export JAVA_HOME=/usr/java/jdk1.8.0_181-amd64
export SPARK_HOME=/opt/spark
export HADOOP_HOME=/etc/hadoop/conf
export YARN_CONF_DIR=/etc/hadoop/conf
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HIVE_HOME=/etc/hadoop/conf
export PYTHONPATH=/opt/spark-2.2.0/python:/opt/spark-2.2.0/python/lib/py4j-0.10.4-src.zip
export PYTHONSTARTUP=/opt/spark-2.2.0/python/pyspark/shell.py
export PYSPARK_PYTHON=/usr/src/Python-3.7.1/python
export HIVE_AUX_JARS_PATH=/etc/hadoop/postgresql-9.0-801.jdbc4.jar
export HADOOP_CLIENT_OPTS="-Xmx2147483648 -XX:MaxPermSize=512M -Djava.net.preferIPv4Stack=true"
# Kinit User/Keytab defined por the ProxyUser on the Cluster/HDFS
kinit -kt /tmp/jupyter.keytab jupyter/cm1.localdomain@DOMAIN.COM
# run the ipykernel
exec /opt/anaconda3/bin/python -m ipykernel $@
Note 1: change with your IP. directories and versions in the highlighted field.
Note 2: add your users keytab to a chosen directory so that is possible to run with proxyuser
28. R, Hive and Impala on JupyterHub
To interact with Impala metadata, my recommendation is the Impyla, but there’s a catch, because due to a specific version of a
lib (thrift_sasl), the HiveQL will stop working, because hiveqlkernel 1.0.13 has the requirement thrift-sasl==0.3.*.
Install Developer Toolset Libs
yum install cyrus-sasl-devel.x86_64 cyrus-sasl-gssapi.x86_64 cyrus-sasl-sql.x86_64 cyrus-sasl-plain.x86_64 gcc-c++
Install additional Libs for Impyla
pip install thrift_sasl==0.2.1: pip install sasl;
Install ipython-sql
conda install -c conda-forge ipython-sql
Install impyla
pip install impyla==0.15a1
Note: it was installed a alfa version for impyla due to an incompatibility with python versions superior to 3.7.
29. R, Hive and Impala on JupyterHub
If you require to have access to Hive & Impala metadata, you can use Python + Hive with a kerberized custom kernel.
Install Jaydebeapi package
conda install -c conda-forge jaydebeapi
Create Python Kerberized Kernel
mkdir -p /usr/share/jupyter/kernels/pythonKerb
cd /usr/share/jupyter/kernels/pythonKerb
touch kernel.json
touch pythonKerb.sh
chmod a+x /usr/share/jupyter/kernels/pythonKerb/pythonKerb.sh
Note: Change with your values in the highlighted field.
Edit Kerberized Kernel
nano /usr/share/jupyter/kernels/kernel.json
{"argv":
["/usr/local/share/jupyter/kernels/pythonKerb/pythonKerb.sh
", "-f", "{connection_file}"],
"display_name": "PythonKerberized", "language": "python",
"name": "pythonKerb"}
Edit Kerberized Kernel script
nano /usr/share/jupyter/kernels/pythonKerb/pythonKerb.sh
30. R, Hive and Impala on JupyterHub
Edit Kerberized Kernel script
PROXY_USER="$(whoami)"
export JAVA_HOME=/usr/java/jdk1.8.0_181-amd64
export SPARK_HOME=/opt/spark
export HADOOP_HOME=/etc/hadoop/conf
export YARN_CONF_DIR=/etc/hadoop/conf
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HIVE_HOME=/etc/hadoop/conf
export PYTHONPATH=/opt/spark-2.2.0/python:/opt/spark-2.2.0/python/lib/py4j-0.10.4-src.zip
export PYTHONSTARTUP=/opt/spark-2.2.0/python/pyspark/shell.py
export PYSPARK_PYTHON=/usr/src/Python-3.7.1/python
export HIVE_AUX_JARS_PATH=/etc/hadoop/postgresql-9.0-801.jdbc4.jar
export HADOOP_CLIENT_OPTS="-Xmx2147483648 -XX:MaxPermSize=512M -Djava.net.preferIPv4Stack=true"
export CLASSPATH=$CLASSPATH:`hadoop classpath`:/etc/hadoop/*:/tmp/*
export PYTHONPATH=$PYTHONPATH:/opt/anaconda3/lib/python3.7/site-packages/jaydebeapi
# Kinit User/Keytab defined por the ProxyUser on the Cluster/HDFS
kinit -kt /tmp/${PROXY_USER}.keytab ${PROXY_USER}@DOMAIN.COM
# run the ipykernel
exec /opt/anaconda3/bin/python -m ipykernel_launcher $@
31. R, Hive and Impala on JupyterHub
Assuming that you don't have Impyla installed, or if so, you have created an environment for it!
HiveQL it’s the best Kernel to access to hive metadata and it has support.
Install Developer Toolset Libs
yum install cyrus-sasl-devel.x86_64 cyrus-sasl-gssapi.x86_64 cyrus-sasl-sql.x86_64 cyrus-sasl-plain.x86_64 gcc-c++
Install Hive interface & HiveQL Kernel
pip install pyhive; pip install --upgrade hiveqlKernel;
Jupyter Install Kernel
jupyter hiveql install
Check kernel installation
jupyter kernelspec list
32. R, Hive and Impala on JupyterHub
To access to a kerberized Cluster you will require a Kerberos Ticket in cache, therefore the solution will be the following:
Edit Kerberized Kernel
nano /usr/local/share/jupyter/kernels/hiveql/kernel.json
{"argv":
["/usr/local/share/jupyter/kernels/hiveql/hiveql.sh", "-f", "{connection_file}"],
"display_name": "HiveQL", "language": "hiveql", "name": "hiveql"}
Edit Kerberized Kernel script
touch /usr/local/share/jupyter/kernels/hiveql/hiveql.sh
nano /usr/local/share/jupyter/kernels/hiveql/hiveql.sh
Note: Change with your values in the highlighted field.
33. R, Hive and Impala on JupyterHub
Edit Kerberized Kernel script
PROXY_USER="$(whoami)"
export JAVA_HOME=/usr/java/jdk1.8.0_181-amd64
export SPARK_HOME=/opt/spark
export HADOOP_HOME=/etc/hadoop/conf
export YARN_CONF_DIR=/etc/hadoop/conf
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HIVE_HOME=/etc/hadoop/conf
export PYTHONPATH=/opt/spark-2.2.0/python:/opt/spark-2.2.0/python/lib/py4j-0.10.4-src.zip
export PYTHONSTARTUP=/opt/spark-2.2.0/python/pyspark/shell.py
export PYSPARK_PYTHON=/usr/src/Python-3.7.1/python
export HIVE_AUX_JARS_PATH=/etc/hadoop/postgresql-9.0-801.jdbc4.jar
export HADOOP_CLIENT_OPTS="-Xmx2147483648 -XX:MaxPermSize=512M -Djava.net.preferIPv4Stack=true"
# Kinit User/Keytab defined por the ProxyUser on the Cluster/HDFS
kinit -kt /tmp/${PROXY_USER}.keytab ${PROXY_USER}@DOMAIN.COM
# run the ipykernel
exec /opt/anaconda3/bin/python -m hiveql $@
Note: Change with your values in the highlighted field.
34. Interact with JupyterHub Kernels
The following information will serve as base of knowledge, how to interact with previous configured kernels with a
kerberized Cluster.
[HiveQL] Create Connection
$$ url=hive://hive@cm1.localdomain:10000/
$$ connect_args={"auth": "KERBEROS","kerberos_service_name": "hive"}
$$ pool_size=5
$$ max_overflow=10
[Impyla] Create Connection
from impala.dbapi import connect
conn = connect(host='cm1.localdomain', port=21050, kerberos_service_name='impala', auth_mechanism='GSSAPI')
Note: Change with your values in the highlighted field.
35. Interact with JupyterHub Kernels
[Impyla] Create Connection via SQLMagic
%load_ext sql
%config SqlMagic.autocommit=False
%sql impala://tpsimoes:welcome1@cm1.localdomain:21050/db?kerberos_service_name=impala&auth_mechanism=GSSAPI
[Python] Create Connection
import jaydebeapi
import pandas as pd
conn_hive =
jaydebeapi.connect("org.apache.hive.jdbc.HiveDriver","jdbc:hive2://cm1.localdomain:10000/db;AuthMech=1;KrbRealm=DOMAIN.
COM;KrbHostFQDN=cm1.localdomain;KrbServiceName=hive;KrbAuthType=2")
[Python] Kinit Keytab
import subprocess
result = subprocess.run(['kinit', '-kt','/tmp/tpsimoes.keytab',tpsimoes/cm1.localdomain@DOMAIN.COM'],
stdout=subprocess.PIPE)
result.stdout
Note: Change with your values in the highlighted field.