SlideShare una empresa de Scribd logo
1 de 36
How-to create a multi tenancy for
an interactive data analysis with
JupyterHub & LDAP
Spark Cluster + Jupyter + LDAP
Introduction
With this presentation you should be able to create an architecture for a framework of an
interactive data analysis by using a Cloudera Spark Cluster with Kerberos, a Jupyter
machine with JupyterHub and authentication via LDAP.
Architecture
This architecture enables the following:
● Transparent data-science development
● User Impersonation
● Authentication via LDAP
● Upgrades on Cluster won’t affect the developments.
● Controlled access to the data and resources by Kerberos/Sentry.
● Several coding API’s (Scala, R, Python, PySpark, etc…).
● Two layers of security with Kerberos & LDAP
Architecture
Pre-Assumptions
1. Cluster hostname: cm1.localdomain Jupyter hostname: cm3.localdomain
2. Cluster Python version: 3.7.1
3. Cluster Manager: Cloudera Manager 5.12.2
4. Service Yarn & PIP Installed
5. Cluster Authentication Pre-Installed: Kerberos
a. Kerberos Realm DOMAIN.COM
6. Chosen IDE: Jupyter
7. JupyterHub Machine Authentication Not-Installed: Kerberos
8. AD Machine Installed with hostname: ad.localdomain
9. Java 1.8 installed in Both Machines
10. Cluster Spark version 2.2.0
Anaconda
Download and installation
su - root
wget https://repo.continuum.io/archive/Anaconda3-2018.12-Linux-x86_64.sh
chmod +x Anaconda3-2018.12-Linux-x86_64.sh
./Anaconda3-2018.12-Linux-x86_64.sh
Note 1: Change with your hostname and domain in the highlighted field.
Note 2: Due to the package SudoSpawner - that requires Anaconda be installed with the root user!
Note 3: JupyterHub requires Python 3.X, therefore it will be installed Anaconda 3
Anaconda
Path environment variables
export PATH=/opt/anaconda3/bin:$PATH
Java environment variables
export JAVA_HOME=/usr/java/jdk1.8.0_181-amd64/;
Spark environment variables
export SPARK_HOME=/opt/spark;
export SPARK_MASTER_IP=10.191.38.83;
Yarn environment variables
export YARN_CONF_DIR=/etc/hadoop/conf
Yarn environment variables
export PYTHONPATH=/opt/spark-2.2.0/python:/opt/spark-2.2.0/python/lib/py4j-0.10.4-src.zip;
export PYTHONSTARTUP=/opt/spark-2.2.0/python/pyspark/shell.py;
export PYSPARK_PYTHON=/usr/src/Python-3.7.1/python;
Note: Change with your values in the highlighted field.
Hadoop environment variables
export HADOOP_HOME=/etc/hadoop/conf;
export HADOOP_CONF_DIR=/etc/hadoop/conf;
Hive environment variables
export HIVE_HOME=/etc/hadoop/conf;
Anaconda
Validate installation
anaconda-navigator
Update Conda (Only if needed)
conda update -n base -c defaults conda
Start Jupyter Notebook (If non root)
jupyter-notebook --ip='10.111.22.333' --port 9001 --debug > /opt/anaconda3/log.txt 2>&1
Start Jupyter Notebook (if root)
jupyter-notebook --ip='10.111.22.333' --port 9001 --debug --allow-root > /opt/anaconda3/log.txt 2>&1
Note: it’s only necessary to change the highlighted, ex: for your ip.
Jupyter or JupyterHub?
JupyterHub it’s a multi-purpose notebook that:
● Manages authentication.
● Spawns single-user notebook on-demand.
● Gives each user a complete notebook
server.
How to choose?
JupyterHub
Install JupyterHub Package (with Http-Proxy)
conda install -c conda-forge jupyterhub
Validate Installation
jupyterhub -h
Start JupyterHub Server
jupyterhub --ip='10.111.22.333' --port 9001 --debug > /opt/anaconda3/log.txt 2>&1
Note: it’s only necessary to change the highlighted, ex: for your ip.
JupyterHub With LDAP
Install Simple LDAP Authenticator Plugin for JupyterHub
conda install -c conda-forge jupyterhub-ldapauthenticator
Install SudoSpawner
conda install -c conda-forge sudospawner
Install Package LDAP to be able to Create Users Locally
pip install jupyterhub-ldapcreateusers
Generate JupyterHub Config File
jupyterhub --generate-config
Note 1: it’s only necessary to change the highlighted, ex: for your ip.
Note 2: Sudospawner enables JupyterHub to spawn single-user servers without being root
JupyterHub With LDAP
Configure JupyterHub Config File
nano /opt/anaconda3/jupyterhub_config.py
import os
import pwd
import subprocess
# Function to Create User Home
def create_dir_hook(spawner):
if not os.path.exists(os.path.join('/home/', spawner.user.name)):
subprocess.call(["sudo", "/sbin/mkhomedir_helper", spawner.user.name])
c.Spawner.pre_spawn_hook = create_dir_hook
c.JupyterHub.authenticator_class = 'ldapcreateusers.LocalLDAPCreateUsers'
c.LocalLDAPCreateUsers.server_address = 'ad.localdomain'
c.LocalLDAPCreateUsers.server_port = 3268
c.LocalLDAPCreateUsers.use_ssl = False
c.LocalLDAPCreateUsers.lookup_dn = True
# Instructions to Define LDAP Search - Doesn't have in consideration possible group users
c.LocalLDAPCreateUsers.bind_dn_template = ['CN={username},DC=ad,DC=localdomain']
c.LocalLDAPCreateUsers.user_search_base = 'DC=ad,DC=localdomain'
JupyterHub With LDAP
c.LocalLDAPCreateUsers.lookup_dn_search_user = 'admin'
c.LocalLDAPCreateUsers.lookup_dn_search_password = 'passWord'
c.LocalLDAPCreateUsers.lookup_dn_user_dn_attribute = 'CN'
c.LocalLDAPCreateUsers.user_attribute = 'sAMAccountName'
c.LocalLDAPCreateUsers.escape_userdn = False
c.JupyterHub.hub_ip = '10.111.22.333’
c.JupyterHub.port = 9001
# Instructions Required to Add User Home
c.LocalAuthenticator.add_user_cmd = ['useradd', '-m']
c.LocalLDAPCreateUsers.create_system_users = True
c.Spawner.debug = True
c.Spawner.default_url = 'tree/home/{username}'
c.Spawner.notebook_dir = '/'
c.PAMAuthenticator.open_sessions = True
Start JupyterHub Server With Config File
jupyterhub -f /opt/anaconda3/jupyterhub_config.py --debug
Note: it’s only necessary to change the highlighted, ex: for your ip.
JupyterHub with LDAP + ProxyUser
Has a reminder, to have ProxyUser working, you will require on both Machines (Cluster and JupyterHub): Java 1.8 and
same Spark version, for this example it will be used the 2.2.0.
[Cluster] Confirm Cluster Spark & Hadoop Version
spark-shell
hadoop version
[JupyterHub] Download Spark & Create Symbolic link
cd /tmp/
wget https://archive.apache.org/dist/spark/spark-2.2.0/spark-2.2.0-bin-hadoop2.6.tgz
tar zxvf spark-2.2.0-bin-hadoop2.6.tgz
mv spark-2.2.0-bin-hadoop2.6 /opt/spark-2.2.0
ln -s /opt/spark-2.2.0 /opt/spark
Note: change with your Spark and Hadoop version in the highlighted field.
Jupyter Hub with LDAP + ProxyUser
[Cluster] Copy Hadoop/Hive/Spark Config files
cd /etc/spark2/conf.cloudera.spark2_on_yarn/
scp * root@10.111.22.333:/etc/hadoop/conf/
[Cluster] HDFS ProxyUser
Note: change with your IP and directory’s in the highlighted field.
[JupyterHub] Create hadoop config files directory
mkdir -p /etc/hadoop/conf/
ln -s /etc/hadoop/conf/ conf.cloudera.yarn
[JupyterHub] Create spark-events directory
mkdir /tmp/spark-events
chown spark:spark spark-events
chmod 777 /tmp/spark-events
[JupyterHub] Test Spark 2
spark-submit --class org.apache.spark.examples.SparkPi 
--master yarn 
--num-executors 1 --driver-memory 512m --executor-memory 512m 
--executor-cores 1 --deploy-mode cluster 
--proxy-user tpsimoes --keytab /root/jupyter.keytab 
--conf spark.eventLog.enabled=true 
/opt/spark-2.2.0/examples/jars/spark-examples_2.11-2.2.0.jar 10;
Check available kernel specs
jupyter kernelspec list
Install PySpark Kernel
conda install -c conda-forge pyspark
Confirm kernel installation
jupyter kernelspec list
Edit PySpark kernel
nano /opt/anaconda3/share/jupyter/kernels/pyspark/kernel.json
{"argv":
["/opt/anaconda3/share/jupyter/kernels/pyspark/python.sh", "-f", "{connection_file}"],
"display_name": "PySpark (Spark 2.2.0)", "language":"python" }
Create PySpark Script
cd /opt/anaconda3/share/jupyter/kernels/pyspark;
touch python.sh;
chmod a+x python.sh;
Jupyter Hub with LDAP + ProxyUser
Jupyter Hub with LDAP + ProxyUser
The python.sh script was elaborated due to the limitations on JupyterHub Kernel configurations that isn't able to get the
Kerberos Credentials and also due to LDAP package that doesn't allow the proxyUser has is possible with Zeppelin. Therefore
with this architecture solution you are able to:
● Add a new step of security, that requires the IDE keytab
● Enable the usage of proxyUser by using the flag from spark --proxy-user ${KERNEL_USERNAME}
Edit PySpark Script
touch /opt/anaconda3/share/jupyter/kernels/pyspark/python.sh;
nano /opt/anaconda3/share/jupyter/kernels/pyspark/python.sh;
# !/usr/bin/env bash
# setup environment variable, etc.
PROXY_USER="$(whoami)"
export JAVA_HOME=/usr/java/jdk1.8.0_181-amd64
export SPARK_HOME=/opt/spark
export SPARK_MASTER_IP=10.111.22.333
export HADOOP_HOME=/etc/hadoop/conf
Jupyter Hub with LDAP + ProxyUser
Edit PySpark Script
export YARN_CONF_DIR=/etc/hadoop/conf
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HIVE_HOME=/etc/hadoop/conf
export PYTHONPATH=/opt/spark-2.2.0/python:/opt/spark-2.2.0/python/lib/py4j-0.10.4-src.zip
export PYTHONSTARTUP=/opt/spark-2.2.0/python/pyspark/shell.py
export PYSPARK_PYTHON=/usr/src/Python-3.7.1/python
export PYSPARK_SUBMIT_ARGS="-v --master yarn --deploy-mode client --conf
spark.serializer=org.apache.spark.serializer.KryoSerializer --num-executors 2 --driver-memory 1024m --executor-memory 1024m
--executor-cores 2 --proxy-user "${PROXY_USER}" --keytab /tmp/jupyter.keytab pyspark-shell"
# Kinit User/Keytab defined por the ProxyUser on the Cluster/HDFS
kinit -kt /tmp/jupyter.keytab jupyter/cm1.localdomain@DOMAIN.COM
# run the ipykernel
exec /opt/anaconda3/bin/python -m ipykernel $@
Note: change with your IP and directories in the highlighted field.
Interact with JupyterHub
Login
http://10.111.22.333:9001/hub/login
Notebook Kernel
To use JupyterLab without it being the default interface, you just have to
swap on your browser url the “tree” with Lab!
http://10.111.22.333:9001/user/tpsimoes/lab
JupyterLab
JupyterLab it’s the next-generation web-based
interface for Jupyter.
Install JupyterLab
conda install -c conda-forge jupyterlab
Install JupyterLab Launcher
conda install -c conda-forge jupyterlab_launcher
JupyterLab
To be able to use the JupyterLab interface as default on Jupyter it requires additional changes.
● Change the JupyterHub Config File
● Additional extensions (for the Hub Menu)
● Create config file for JupyterLab
Edit PySpark Script
nano /opt/anaconda3/jupyterhub_config.py
...
# Change the values on this Flags
c.Spawner.default_url = '/lab'
c.Spawner.notebook_dir = '/home/{username}'
# Add this Flag
c.Spawner.cmd = ['jupyter-labhub']
JupyterLab
Install jupyterlab-hub extension
jupyter labextension install @jupyterlab/hub-extension
Create JupyterLab Config File
cd /opt/anaconda3/share/jupyter/lab/settings/
nano page_config.json
{
"hub_prefix": "/jupyter"
}
JupyterLab
The final architecture:
R, Hive and Impala on JupyterHub
On this section the focus will reside on R, Hive, Impala and Kerberized Kernel.
With R Kernel, it requires libs on both Machines (Cluster and Jupyter)
[Cluster & Jupyter] Install R Libs
yum install -y openssl-devel openssl libcurl-devel libssh2-devel
[Jupyter] Create SymLinks for R libs
ln -s /opt/anaconda3/lib/libssl.so.1.0.0 /usr/lib64/libssl.so.1.0.0;
ln -s /opt/anaconda3/lib/libcrypto.so.1.0.0 /usr/lib64/libcrypto.so.1.0.0;
[Cluster & Jupyter] To use SparkR
devtools::install_github('apache/spark@v2.2.0', subdir='R/pkg')
Note: Change with your values in the highlighted field.
[Cluster & Jupyter] Start R & Install Packages
R
install.packages('git2r')
install.packages('devtools')
install.packages('repr')
install.packages('IRdisplay')
install.packages('crayon')
install.packages('pbdZMQ')
R, Hive and Impala on JupyterHub
To interact with Hive metadata and the direct use of the sintax, the my recommendation is the HiveQL.
Install Developer Toolset Libs
yum install cyrus-sasl-devel.x86_64 cyrus-sasl-gssapi.x86_64 cyrus-sasl-sql.x86_64 cyrus-sasl-plain.x86_64 gcc-c++
Install Python + Hive interface (SQLAlchemy interface for Hive)
pip install pyhive
Install HiveQL Kernel
pip install --upgrade hiveqlKernel
jupyter hiveql install
Confirm HiveQL Kernel installation
jupyter kernelspec list
R, Hive and Impala on JupyterHub
Edit HiveQL Kernel
cd /usr/local/share/jupyter/kernels/hiveql
nano kernel.json
{"argv":
["/usr/local/share/jupyter/kernels/hiveql/hiveql.sh", "-f", "{connection_file}"],
"display_name": "HiveQL", "language": "hiveql", "name": "hiveql"}
Create and Edit HiveQL script
touch /opt/anaconda3/share/jupyter/kernels/hiveql/hiveql.sh;
nano /opt/anaconda3/share/jupyter/kernels/hiveql/hiveql.sh;
# !/usr/bin/env bash
# setup environment variable, etc.
PROXY_USER="$(whoami)"
R, Hive and Impala on JupyterHub
Edit HiveQL script
export JAVA_HOME=/usr/java/jdk1.8.0_181-amd64
export SPARK_HOME=/opt/spark
export HADOOP_HOME=/etc/hadoop/conf
export YARN_CONF_DIR=/etc/hadoop/conf
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HIVE_HOME=/etc/hadoop/conf
export PYTHONPATH=/opt/spark-2.2.0/python:/opt/spark-2.2.0/python/lib/py4j-0.10.4-src.zip
export PYTHONSTARTUP=/opt/spark-2.2.0/python/pyspark/shell.py
export PYSPARK_PYTHON=/usr/src/Python-3.7.1/python
export HIVE_AUX_JARS_PATH=/etc/hadoop/postgresql-9.0-801.jdbc4.jar
export HADOOP_CLIENT_OPTS="-Xmx2147483648 -XX:MaxPermSize=512M -Djava.net.preferIPv4Stack=true"
# Kinit User/Keytab defined por the ProxyUser on the Cluster/HDFS
kinit -kt /tmp/jupyter.keytab jupyter/cm1.localdomain@DOMAIN.COM
# run the ipykernel
exec /opt/anaconda3/bin/python -m ipykernel $@
Note 1: change with your IP. directories and versions in the highlighted field.
Note 2: add your users keytab to a chosen directory so that is possible to run with proxyuser
R, Hive and Impala on JupyterHub
To interact with Impala metadata, my recommendation is the Impyla, but there’s a catch, because due to a specific version of a
lib (thrift_sasl), the HiveQL will stop working, because hiveqlkernel 1.0.13 has the requirement thrift-sasl==0.3.*.
Install Developer Toolset Libs
yum install cyrus-sasl-devel.x86_64 cyrus-sasl-gssapi.x86_64 cyrus-sasl-sql.x86_64 cyrus-sasl-plain.x86_64 gcc-c++
Install additional Libs for Impyla
pip install thrift_sasl==0.2.1: pip install sasl;
Install ipython-sql
conda install -c conda-forge ipython-sql
Install impyla
pip install impyla==0.15a1
Note: it was installed a alfa version for impyla due to an incompatibility with python versions superior to 3.7.
R, Hive and Impala on JupyterHub
If you require to have access to Hive & Impala metadata, you can use Python + Hive with a kerberized custom kernel.
Install Jaydebeapi package
conda install -c conda-forge jaydebeapi
Create Python Kerberized Kernel
mkdir -p /usr/share/jupyter/kernels/pythonKerb
cd /usr/share/jupyter/kernels/pythonKerb
touch kernel.json
touch pythonKerb.sh
chmod a+x /usr/share/jupyter/kernels/pythonKerb/pythonKerb.sh
Note: Change with your values in the highlighted field.
Edit Kerberized Kernel
nano /usr/share/jupyter/kernels/kernel.json
{"argv":
["/usr/local/share/jupyter/kernels/pythonKerb/pythonKerb.sh
", "-f", "{connection_file}"],
"display_name": "PythonKerberized", "language": "python",
"name": "pythonKerb"}
Edit Kerberized Kernel script
nano /usr/share/jupyter/kernels/pythonKerb/pythonKerb.sh
R, Hive and Impala on JupyterHub
Edit Kerberized Kernel script
PROXY_USER="$(whoami)"
export JAVA_HOME=/usr/java/jdk1.8.0_181-amd64
export SPARK_HOME=/opt/spark
export HADOOP_HOME=/etc/hadoop/conf
export YARN_CONF_DIR=/etc/hadoop/conf
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HIVE_HOME=/etc/hadoop/conf
export PYTHONPATH=/opt/spark-2.2.0/python:/opt/spark-2.2.0/python/lib/py4j-0.10.4-src.zip
export PYTHONSTARTUP=/opt/spark-2.2.0/python/pyspark/shell.py
export PYSPARK_PYTHON=/usr/src/Python-3.7.1/python
export HIVE_AUX_JARS_PATH=/etc/hadoop/postgresql-9.0-801.jdbc4.jar
export HADOOP_CLIENT_OPTS="-Xmx2147483648 -XX:MaxPermSize=512M -Djava.net.preferIPv4Stack=true"
export CLASSPATH=$CLASSPATH:`hadoop classpath`:/etc/hadoop/*:/tmp/*
export PYTHONPATH=$PYTHONPATH:/opt/anaconda3/lib/python3.7/site-packages/jaydebeapi
# Kinit User/Keytab defined por the ProxyUser on the Cluster/HDFS
kinit -kt /tmp/${PROXY_USER}.keytab ${PROXY_USER}@DOMAIN.COM
# run the ipykernel
exec /opt/anaconda3/bin/python -m ipykernel_launcher $@
R, Hive and Impala on JupyterHub
Assuming that you don't have Impyla installed, or if so, you have created an environment for it!
HiveQL it’s the best Kernel to access to hive metadata and it has support.
Install Developer Toolset Libs
yum install cyrus-sasl-devel.x86_64 cyrus-sasl-gssapi.x86_64 cyrus-sasl-sql.x86_64 cyrus-sasl-plain.x86_64 gcc-c++
Install Hive interface & HiveQL Kernel
pip install pyhive; pip install --upgrade hiveqlKernel;
Jupyter Install Kernel
jupyter hiveql install
Check kernel installation
jupyter kernelspec list
R, Hive and Impala on JupyterHub
To access to a kerberized Cluster you will require a Kerberos Ticket in cache, therefore the solution will be the following:
Edit Kerberized Kernel
nano /usr/local/share/jupyter/kernels/hiveql/kernel.json
{"argv":
["/usr/local/share/jupyter/kernels/hiveql/hiveql.sh", "-f", "{connection_file}"],
"display_name": "HiveQL", "language": "hiveql", "name": "hiveql"}
Edit Kerberized Kernel script
touch /usr/local/share/jupyter/kernels/hiveql/hiveql.sh
nano /usr/local/share/jupyter/kernels/hiveql/hiveql.sh
Note: Change with your values in the highlighted field.
R, Hive and Impala on JupyterHub
Edit Kerberized Kernel script
PROXY_USER="$(whoami)"
export JAVA_HOME=/usr/java/jdk1.8.0_181-amd64
export SPARK_HOME=/opt/spark
export HADOOP_HOME=/etc/hadoop/conf
export YARN_CONF_DIR=/etc/hadoop/conf
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HIVE_HOME=/etc/hadoop/conf
export PYTHONPATH=/opt/spark-2.2.0/python:/opt/spark-2.2.0/python/lib/py4j-0.10.4-src.zip
export PYTHONSTARTUP=/opt/spark-2.2.0/python/pyspark/shell.py
export PYSPARK_PYTHON=/usr/src/Python-3.7.1/python
export HIVE_AUX_JARS_PATH=/etc/hadoop/postgresql-9.0-801.jdbc4.jar
export HADOOP_CLIENT_OPTS="-Xmx2147483648 -XX:MaxPermSize=512M -Djava.net.preferIPv4Stack=true"
# Kinit User/Keytab defined por the ProxyUser on the Cluster/HDFS
kinit -kt /tmp/${PROXY_USER}.keytab ${PROXY_USER}@DOMAIN.COM
# run the ipykernel
exec /opt/anaconda3/bin/python -m hiveql $@
Note: Change with your values in the highlighted field.
Interact with JupyterHub Kernels
The following information will serve as base of knowledge, how to interact with previous configured kernels with a
kerberized Cluster.
[HiveQL] Create Connection
$$ url=hive://hive@cm1.localdomain:10000/
$$ connect_args={"auth": "KERBEROS","kerberos_service_name": "hive"}
$$ pool_size=5
$$ max_overflow=10
[Impyla] Create Connection
from impala.dbapi import connect
conn = connect(host='cm1.localdomain', port=21050, kerberos_service_name='impala', auth_mechanism='GSSAPI')
Note: Change with your values in the highlighted field.
Interact with JupyterHub Kernels
[Impyla] Create Connection via SQLMagic
%load_ext sql
%config SqlMagic.autocommit=False
%sql impala://tpsimoes:welcome1@cm1.localdomain:21050/db?kerberos_service_name=impala&auth_mechanism=GSSAPI
[Python] Create Connection
import jaydebeapi
import pandas as pd
conn_hive =
jaydebeapi.connect("org.apache.hive.jdbc.HiveDriver","jdbc:hive2://cm1.localdomain:10000/db;AuthMech=1;KrbRealm=DOMAIN.
COM;KrbHostFQDN=cm1.localdomain;KrbServiceName=hive;KrbAuthType=2")
[Python] Kinit Keytab
import subprocess
result = subprocess.run(['kinit', '-kt','/tmp/tpsimoes.keytab',tpsimoes/cm1.localdomain@DOMAIN.COM'],
stdout=subprocess.PIPE)
result.stdout
Note: Change with your values in the highlighted field.
Thanks
Big Data Engineer
Tiago Simões

Más contenido relacionado

La actualidad más candente

Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...
Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...
Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...Spark Summit
 
Using Docker Containers to Improve Reproducibility in Software and Web Engine...
Using Docker Containers to Improve Reproducibility in Software and Web Engine...Using Docker Containers to Improve Reproducibility in Software and Web Engine...
Using Docker Containers to Improve Reproducibility in Software and Web Engine...Vincenzo Ferme
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance AnalysisBrendan Gregg
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security ArchitectureOwen O'Malley
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframeJaemun Jung
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in RustAndrew Lamb
 
Monitoring_with_Prometheus_Grafana_Tutorial
Monitoring_with_Prometheus_Grafana_TutorialMonitoring_with_Prometheus_Grafana_Tutorial
Monitoring_with_Prometheus_Grafana_TutorialTim Vaillancourt
 
Discover How IBM Uses InfluxDB and Grafana to Help Clients Monitor Large Prod...
Discover How IBM Uses InfluxDB and Grafana to Help Clients Monitor Large Prod...Discover How IBM Uses InfluxDB and Grafana to Help Clients Monitor Large Prod...
Discover How IBM Uses InfluxDB and Grafana to Help Clients Monitor Large Prod...InfluxData
 
Beautiful Monitoring With Grafana and InfluxDB
Beautiful Monitoring With Grafana and InfluxDBBeautiful Monitoring With Grafana and InfluxDB
Beautiful Monitoring With Grafana and InfluxDBleesjensen
 
Redis Streams
Redis Streams Redis Streams
Redis Streams Redis Labs
 
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...Spark Summit
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonIgor Anishchenko
 
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong ZhuBuild a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong ZhuCeph Community
 
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike SteenbergenMeet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergendistributed matters
 
Redis vs. MongoDB: Comparing In-Memory Databases with Percona Memory Engine
Redis vs. MongoDB: Comparing In-Memory Databases with Percona Memory EngineRedis vs. MongoDB: Comparing In-Memory Databases with Percona Memory Engine
Redis vs. MongoDB: Comparing In-Memory Databases with Percona Memory EngineScaleGrid.io
 
Serverless with Google Cloud Functions
Serverless with Google Cloud FunctionsServerless with Google Cloud Functions
Serverless with Google Cloud FunctionsJerry Jalava
 
Cfgmgmtcamp 2023 — eBPF Superpowers
Cfgmgmtcamp 2023 — eBPF SuperpowersCfgmgmtcamp 2023 — eBPF Superpowers
Cfgmgmtcamp 2023 — eBPF SuperpowersRaphaël PINSON
 
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...DataStax
 
Improved alerting with Prometheus and Alertmanager
Improved alerting with Prometheus and AlertmanagerImproved alerting with Prometheus and Alertmanager
Improved alerting with Prometheus and AlertmanagerJulien Pivotto
 
Prometheus in openstack-helm
Prometheus in openstack-helmPrometheus in openstack-helm
Prometheus in openstack-helm성일 임
 

La actualidad más candente (20)

Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...
Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...
Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...
 
Using Docker Containers to Improve Reproducibility in Software and Web Engine...
Using Docker Containers to Improve Reproducibility in Software and Web Engine...Using Docker Containers to Improve Reproducibility in Software and Web Engine...
Using Docker Containers to Improve Reproducibility in Software and Web Engine...
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance Analysis
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in Rust
 
Monitoring_with_Prometheus_Grafana_Tutorial
Monitoring_with_Prometheus_Grafana_TutorialMonitoring_with_Prometheus_Grafana_Tutorial
Monitoring_with_Prometheus_Grafana_Tutorial
 
Discover How IBM Uses InfluxDB and Grafana to Help Clients Monitor Large Prod...
Discover How IBM Uses InfluxDB and Grafana to Help Clients Monitor Large Prod...Discover How IBM Uses InfluxDB and Grafana to Help Clients Monitor Large Prod...
Discover How IBM Uses InfluxDB and Grafana to Help Clients Monitor Large Prod...
 
Beautiful Monitoring With Grafana and InfluxDB
Beautiful Monitoring With Grafana and InfluxDBBeautiful Monitoring With Grafana and InfluxDB
Beautiful Monitoring With Grafana and InfluxDB
 
Redis Streams
Redis Streams Redis Streams
Redis Streams
 
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
 
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong ZhuBuild a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
 
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike SteenbergenMeet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
 
Redis vs. MongoDB: Comparing In-Memory Databases with Percona Memory Engine
Redis vs. MongoDB: Comparing In-Memory Databases with Percona Memory EngineRedis vs. MongoDB: Comparing In-Memory Databases with Percona Memory Engine
Redis vs. MongoDB: Comparing In-Memory Databases with Percona Memory Engine
 
Serverless with Google Cloud Functions
Serverless with Google Cloud FunctionsServerless with Google Cloud Functions
Serverless with Google Cloud Functions
 
Cfgmgmtcamp 2023 — eBPF Superpowers
Cfgmgmtcamp 2023 — eBPF SuperpowersCfgmgmtcamp 2023 — eBPF Superpowers
Cfgmgmtcamp 2023 — eBPF Superpowers
 
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
 
Improved alerting with Prometheus and Alertmanager
Improved alerting with Prometheus and AlertmanagerImproved alerting with Prometheus and Alertmanager
Improved alerting with Prometheus and Alertmanager
 
Prometheus in openstack-helm
Prometheus in openstack-helmPrometheus in openstack-helm
Prometheus in openstack-helm
 

Similar a How to create a multi tenancy for an interactive data analysis with jupyter hub and ldap

How to create a secured multi tenancy for clustered ML with JupyterHub
How to create a secured multi tenancy for clustered ML with JupyterHubHow to create a secured multi tenancy for clustered ML with JupyterHub
How to create a secured multi tenancy for clustered ML with JupyterHubTiago Simões
 
Provisioning with Puppet
Provisioning with PuppetProvisioning with Puppet
Provisioning with PuppetJoe Ray
 
Build Your Own CaaS (Container as a Service)
Build Your Own CaaS (Container as a Service)Build Your Own CaaS (Container as a Service)
Build Your Own CaaS (Container as a Service)HungWei Chiu
 
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.Nagios
 
Puppet for Developers
Puppet for DevelopersPuppet for Developers
Puppet for Developerssagarhere4u
 
Improving Operations Efficiency with Puppet
Improving Operations Efficiency with PuppetImproving Operations Efficiency with Puppet
Improving Operations Efficiency with PuppetNicolas Brousse
 
Harmonious Development: Via Vagrant and Puppet
Harmonious Development: Via Vagrant and PuppetHarmonious Development: Via Vagrant and Puppet
Harmonious Development: Via Vagrant and PuppetAchieve Internet
 
Automating Complex Setups with Puppet
Automating Complex Setups with PuppetAutomating Complex Setups with Puppet
Automating Complex Setups with PuppetKris Buytaert
 
Automating complex infrastructures with Puppet
Automating complex infrastructures with PuppetAutomating complex infrastructures with Puppet
Automating complex infrastructures with PuppetKris Buytaert
 
[Devconf.cz][2017] Understanding OpenShift Security Context Constraints
[Devconf.cz][2017] Understanding OpenShift Security Context Constraints[Devconf.cz][2017] Understanding OpenShift Security Context Constraints
[Devconf.cz][2017] Understanding OpenShift Security Context ConstraintsAlessandro Arrichiello
 
Pyramid Deployment and Maintenance
Pyramid Deployment and MaintenancePyramid Deployment and Maintenance
Pyramid Deployment and MaintenanceJazkarta, Inc.
 
k8s practice 2023.pptx
k8s practice 2023.pptxk8s practice 2023.pptx
k8s practice 2023.pptxwonyong hwang
 
Advanced Eclipse Workshop (held at IPC2010 -spring edition-)
Advanced Eclipse Workshop (held at IPC2010 -spring edition-)Advanced Eclipse Workshop (held at IPC2010 -spring edition-)
Advanced Eclipse Workshop (held at IPC2010 -spring edition-)Bastian Feder
 
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...Nicolas Brousse
 
Chef - industrialize and automate your infrastructure
Chef - industrialize and automate your infrastructureChef - industrialize and automate your infrastructure
Chef - industrialize and automate your infrastructureMichaël Lopez
 
Puppet: Eclipsecon ALM 2013
Puppet: Eclipsecon ALM 2013Puppet: Eclipsecon ALM 2013
Puppet: Eclipsecon ALM 2013grim_radical
 
How to create a secured cloudera cluster
How to create a secured cloudera clusterHow to create a secured cloudera cluster
How to create a secured cloudera clusterTiago Simões
 

Similar a How to create a multi tenancy for an interactive data analysis with jupyter hub and ldap (20)

How to create a secured multi tenancy for clustered ML with JupyterHub
How to create a secured multi tenancy for clustered ML with JupyterHubHow to create a secured multi tenancy for clustered ML with JupyterHub
How to create a secured multi tenancy for clustered ML with JupyterHub
 
Provisioning with Puppet
Provisioning with PuppetProvisioning with Puppet
Provisioning with Puppet
 
Build Your Own CaaS (Container as a Service)
Build Your Own CaaS (Container as a Service)Build Your Own CaaS (Container as a Service)
Build Your Own CaaS (Container as a Service)
 
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
 
One-Man Ops
One-Man OpsOne-Man Ops
One-Man Ops
 
Puppet for Developers
Puppet for DevelopersPuppet for Developers
Puppet for Developers
 
Improving Operations Efficiency with Puppet
Improving Operations Efficiency with PuppetImproving Operations Efficiency with Puppet
Improving Operations Efficiency with Puppet
 
Harmonious Development: Via Vagrant and Puppet
Harmonious Development: Via Vagrant and PuppetHarmonious Development: Via Vagrant and Puppet
Harmonious Development: Via Vagrant and Puppet
 
Automating Complex Setups with Puppet
Automating Complex Setups with PuppetAutomating Complex Setups with Puppet
Automating Complex Setups with Puppet
 
Automating complex infrastructures with Puppet
Automating complex infrastructures with PuppetAutomating complex infrastructures with Puppet
Automating complex infrastructures with Puppet
 
EC CUBE 3.0.x installation guide
EC CUBE 3.0.x installation guideEC CUBE 3.0.x installation guide
EC CUBE 3.0.x installation guide
 
[Devconf.cz][2017] Understanding OpenShift Security Context Constraints
[Devconf.cz][2017] Understanding OpenShift Security Context Constraints[Devconf.cz][2017] Understanding OpenShift Security Context Constraints
[Devconf.cz][2017] Understanding OpenShift Security Context Constraints
 
Cooking with Chef
Cooking with ChefCooking with Chef
Cooking with Chef
 
Pyramid Deployment and Maintenance
Pyramid Deployment and MaintenancePyramid Deployment and Maintenance
Pyramid Deployment and Maintenance
 
k8s practice 2023.pptx
k8s practice 2023.pptxk8s practice 2023.pptx
k8s practice 2023.pptx
 
Advanced Eclipse Workshop (held at IPC2010 -spring edition-)
Advanced Eclipse Workshop (held at IPC2010 -spring edition-)Advanced Eclipse Workshop (held at IPC2010 -spring edition-)
Advanced Eclipse Workshop (held at IPC2010 -spring edition-)
 
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...
 
Chef - industrialize and automate your infrastructure
Chef - industrialize and automate your infrastructureChef - industrialize and automate your infrastructure
Chef - industrialize and automate your infrastructure
 
Puppet: Eclipsecon ALM 2013
Puppet: Eclipsecon ALM 2013Puppet: Eclipsecon ALM 2013
Puppet: Eclipsecon ALM 2013
 
How to create a secured cloudera cluster
How to create a secured cloudera clusterHow to create a secured cloudera cluster
How to create a secured cloudera cluster
 

Más de Tiago Simões

How to go the extra mile on monitoring
How to go the extra mile on monitoringHow to go the extra mile on monitoring
How to go the extra mile on monitoringTiago Simões
 
How to scheduled jobs in a cloudera cluster without oozie
How to scheduled jobs in a cloudera cluster without oozieHow to scheduled jobs in a cloudera cluster without oozie
How to scheduled jobs in a cloudera cluster without oozieTiago Simões
 
How to implement a gdpr solution in a cloudera architecture
How to implement a gdpr solution in a cloudera architectureHow to implement a gdpr solution in a cloudera architecture
How to implement a gdpr solution in a cloudera architectureTiago Simões
 
How to configure a hive high availability connection with zeppelin
How to configure a hive high availability connection with zeppelinHow to configure a hive high availability connection with zeppelin
How to configure a hive high availability connection with zeppelinTiago Simões
 
How to install and use multiple versions of applications in run-time
How to install and use multiple versions of applications in run-timeHow to install and use multiple versions of applications in run-time
How to install and use multiple versions of applications in run-timeTiago Simões
 
Hive vs impala vs spark - tuning
Hive vs impala vs spark - tuningHive vs impala vs spark - tuning
Hive vs impala vs spark - tuningTiago Simões
 
How to create a multi tenancy for an interactive data analysis
How to create a multi tenancy for an interactive data analysisHow to create a multi tenancy for an interactive data analysis
How to create a multi tenancy for an interactive data analysisTiago Simões
 

Más de Tiago Simões (7)

How to go the extra mile on monitoring
How to go the extra mile on monitoringHow to go the extra mile on monitoring
How to go the extra mile on monitoring
 
How to scheduled jobs in a cloudera cluster without oozie
How to scheduled jobs in a cloudera cluster without oozieHow to scheduled jobs in a cloudera cluster without oozie
How to scheduled jobs in a cloudera cluster without oozie
 
How to implement a gdpr solution in a cloudera architecture
How to implement a gdpr solution in a cloudera architectureHow to implement a gdpr solution in a cloudera architecture
How to implement a gdpr solution in a cloudera architecture
 
How to configure a hive high availability connection with zeppelin
How to configure a hive high availability connection with zeppelinHow to configure a hive high availability connection with zeppelin
How to configure a hive high availability connection with zeppelin
 
How to install and use multiple versions of applications in run-time
How to install and use multiple versions of applications in run-timeHow to install and use multiple versions of applications in run-time
How to install and use multiple versions of applications in run-time
 
Hive vs impala vs spark - tuning
Hive vs impala vs spark - tuningHive vs impala vs spark - tuning
Hive vs impala vs spark - tuning
 
How to create a multi tenancy for an interactive data analysis
How to create a multi tenancy for an interactive data analysisHow to create a multi tenancy for an interactive data analysis
How to create a multi tenancy for an interactive data analysis
 

Último

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 

Último (20)

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 

How to create a multi tenancy for an interactive data analysis with jupyter hub and ldap

  • 1. How-to create a multi tenancy for an interactive data analysis with JupyterHub & LDAP Spark Cluster + Jupyter + LDAP
  • 2. Introduction With this presentation you should be able to create an architecture for a framework of an interactive data analysis by using a Cloudera Spark Cluster with Kerberos, a Jupyter machine with JupyterHub and authentication via LDAP.
  • 3. Architecture This architecture enables the following: ● Transparent data-science development ● User Impersonation ● Authentication via LDAP ● Upgrades on Cluster won’t affect the developments. ● Controlled access to the data and resources by Kerberos/Sentry. ● Several coding API’s (Scala, R, Python, PySpark, etc…). ● Two layers of security with Kerberos & LDAP
  • 5. Pre-Assumptions 1. Cluster hostname: cm1.localdomain Jupyter hostname: cm3.localdomain 2. Cluster Python version: 3.7.1 3. Cluster Manager: Cloudera Manager 5.12.2 4. Service Yarn & PIP Installed 5. Cluster Authentication Pre-Installed: Kerberos a. Kerberos Realm DOMAIN.COM 6. Chosen IDE: Jupyter 7. JupyterHub Machine Authentication Not-Installed: Kerberos 8. AD Machine Installed with hostname: ad.localdomain 9. Java 1.8 installed in Both Machines 10. Cluster Spark version 2.2.0
  • 6. Anaconda Download and installation su - root wget https://repo.continuum.io/archive/Anaconda3-2018.12-Linux-x86_64.sh chmod +x Anaconda3-2018.12-Linux-x86_64.sh ./Anaconda3-2018.12-Linux-x86_64.sh Note 1: Change with your hostname and domain in the highlighted field. Note 2: Due to the package SudoSpawner - that requires Anaconda be installed with the root user! Note 3: JupyterHub requires Python 3.X, therefore it will be installed Anaconda 3
  • 7. Anaconda Path environment variables export PATH=/opt/anaconda3/bin:$PATH Java environment variables export JAVA_HOME=/usr/java/jdk1.8.0_181-amd64/; Spark environment variables export SPARK_HOME=/opt/spark; export SPARK_MASTER_IP=10.191.38.83; Yarn environment variables export YARN_CONF_DIR=/etc/hadoop/conf Yarn environment variables export PYTHONPATH=/opt/spark-2.2.0/python:/opt/spark-2.2.0/python/lib/py4j-0.10.4-src.zip; export PYTHONSTARTUP=/opt/spark-2.2.0/python/pyspark/shell.py; export PYSPARK_PYTHON=/usr/src/Python-3.7.1/python; Note: Change with your values in the highlighted field. Hadoop environment variables export HADOOP_HOME=/etc/hadoop/conf; export HADOOP_CONF_DIR=/etc/hadoop/conf; Hive environment variables export HIVE_HOME=/etc/hadoop/conf;
  • 8. Anaconda Validate installation anaconda-navigator Update Conda (Only if needed) conda update -n base -c defaults conda Start Jupyter Notebook (If non root) jupyter-notebook --ip='10.111.22.333' --port 9001 --debug > /opt/anaconda3/log.txt 2>&1 Start Jupyter Notebook (if root) jupyter-notebook --ip='10.111.22.333' --port 9001 --debug --allow-root > /opt/anaconda3/log.txt 2>&1 Note: it’s only necessary to change the highlighted, ex: for your ip.
  • 9. Jupyter or JupyterHub? JupyterHub it’s a multi-purpose notebook that: ● Manages authentication. ● Spawns single-user notebook on-demand. ● Gives each user a complete notebook server. How to choose?
  • 10. JupyterHub Install JupyterHub Package (with Http-Proxy) conda install -c conda-forge jupyterhub Validate Installation jupyterhub -h Start JupyterHub Server jupyterhub --ip='10.111.22.333' --port 9001 --debug > /opt/anaconda3/log.txt 2>&1 Note: it’s only necessary to change the highlighted, ex: for your ip.
  • 11. JupyterHub With LDAP Install Simple LDAP Authenticator Plugin for JupyterHub conda install -c conda-forge jupyterhub-ldapauthenticator Install SudoSpawner conda install -c conda-forge sudospawner Install Package LDAP to be able to Create Users Locally pip install jupyterhub-ldapcreateusers Generate JupyterHub Config File jupyterhub --generate-config Note 1: it’s only necessary to change the highlighted, ex: for your ip. Note 2: Sudospawner enables JupyterHub to spawn single-user servers without being root
  • 12. JupyterHub With LDAP Configure JupyterHub Config File nano /opt/anaconda3/jupyterhub_config.py import os import pwd import subprocess # Function to Create User Home def create_dir_hook(spawner): if not os.path.exists(os.path.join('/home/', spawner.user.name)): subprocess.call(["sudo", "/sbin/mkhomedir_helper", spawner.user.name]) c.Spawner.pre_spawn_hook = create_dir_hook c.JupyterHub.authenticator_class = 'ldapcreateusers.LocalLDAPCreateUsers' c.LocalLDAPCreateUsers.server_address = 'ad.localdomain' c.LocalLDAPCreateUsers.server_port = 3268 c.LocalLDAPCreateUsers.use_ssl = False c.LocalLDAPCreateUsers.lookup_dn = True # Instructions to Define LDAP Search - Doesn't have in consideration possible group users c.LocalLDAPCreateUsers.bind_dn_template = ['CN={username},DC=ad,DC=localdomain'] c.LocalLDAPCreateUsers.user_search_base = 'DC=ad,DC=localdomain'
  • 13. JupyterHub With LDAP c.LocalLDAPCreateUsers.lookup_dn_search_user = 'admin' c.LocalLDAPCreateUsers.lookup_dn_search_password = 'passWord' c.LocalLDAPCreateUsers.lookup_dn_user_dn_attribute = 'CN' c.LocalLDAPCreateUsers.user_attribute = 'sAMAccountName' c.LocalLDAPCreateUsers.escape_userdn = False c.JupyterHub.hub_ip = '10.111.22.333’ c.JupyterHub.port = 9001 # Instructions Required to Add User Home c.LocalAuthenticator.add_user_cmd = ['useradd', '-m'] c.LocalLDAPCreateUsers.create_system_users = True c.Spawner.debug = True c.Spawner.default_url = 'tree/home/{username}' c.Spawner.notebook_dir = '/' c.PAMAuthenticator.open_sessions = True Start JupyterHub Server With Config File jupyterhub -f /opt/anaconda3/jupyterhub_config.py --debug Note: it’s only necessary to change the highlighted, ex: for your ip.
  • 14. JupyterHub with LDAP + ProxyUser Has a reminder, to have ProxyUser working, you will require on both Machines (Cluster and JupyterHub): Java 1.8 and same Spark version, for this example it will be used the 2.2.0. [Cluster] Confirm Cluster Spark & Hadoop Version spark-shell hadoop version [JupyterHub] Download Spark & Create Symbolic link cd /tmp/ wget https://archive.apache.org/dist/spark/spark-2.2.0/spark-2.2.0-bin-hadoop2.6.tgz tar zxvf spark-2.2.0-bin-hadoop2.6.tgz mv spark-2.2.0-bin-hadoop2.6 /opt/spark-2.2.0 ln -s /opt/spark-2.2.0 /opt/spark Note: change with your Spark and Hadoop version in the highlighted field.
  • 15. Jupyter Hub with LDAP + ProxyUser [Cluster] Copy Hadoop/Hive/Spark Config files cd /etc/spark2/conf.cloudera.spark2_on_yarn/ scp * root@10.111.22.333:/etc/hadoop/conf/ [Cluster] HDFS ProxyUser Note: change with your IP and directory’s in the highlighted field. [JupyterHub] Create hadoop config files directory mkdir -p /etc/hadoop/conf/ ln -s /etc/hadoop/conf/ conf.cloudera.yarn [JupyterHub] Create spark-events directory mkdir /tmp/spark-events chown spark:spark spark-events chmod 777 /tmp/spark-events [JupyterHub] Test Spark 2 spark-submit --class org.apache.spark.examples.SparkPi --master yarn --num-executors 1 --driver-memory 512m --executor-memory 512m --executor-cores 1 --deploy-mode cluster --proxy-user tpsimoes --keytab /root/jupyter.keytab --conf spark.eventLog.enabled=true /opt/spark-2.2.0/examples/jars/spark-examples_2.11-2.2.0.jar 10;
  • 16. Check available kernel specs jupyter kernelspec list Install PySpark Kernel conda install -c conda-forge pyspark Confirm kernel installation jupyter kernelspec list Edit PySpark kernel nano /opt/anaconda3/share/jupyter/kernels/pyspark/kernel.json {"argv": ["/opt/anaconda3/share/jupyter/kernels/pyspark/python.sh", "-f", "{connection_file}"], "display_name": "PySpark (Spark 2.2.0)", "language":"python" } Create PySpark Script cd /opt/anaconda3/share/jupyter/kernels/pyspark; touch python.sh; chmod a+x python.sh; Jupyter Hub with LDAP + ProxyUser
  • 17. Jupyter Hub with LDAP + ProxyUser The python.sh script was elaborated due to the limitations on JupyterHub Kernel configurations that isn't able to get the Kerberos Credentials and also due to LDAP package that doesn't allow the proxyUser has is possible with Zeppelin. Therefore with this architecture solution you are able to: ● Add a new step of security, that requires the IDE keytab ● Enable the usage of proxyUser by using the flag from spark --proxy-user ${KERNEL_USERNAME} Edit PySpark Script touch /opt/anaconda3/share/jupyter/kernels/pyspark/python.sh; nano /opt/anaconda3/share/jupyter/kernels/pyspark/python.sh; # !/usr/bin/env bash # setup environment variable, etc. PROXY_USER="$(whoami)" export JAVA_HOME=/usr/java/jdk1.8.0_181-amd64 export SPARK_HOME=/opt/spark export SPARK_MASTER_IP=10.111.22.333 export HADOOP_HOME=/etc/hadoop/conf
  • 18. Jupyter Hub with LDAP + ProxyUser Edit PySpark Script export YARN_CONF_DIR=/etc/hadoop/conf export HADOOP_CONF_DIR=/etc/hadoop/conf export HIVE_HOME=/etc/hadoop/conf export PYTHONPATH=/opt/spark-2.2.0/python:/opt/spark-2.2.0/python/lib/py4j-0.10.4-src.zip export PYTHONSTARTUP=/opt/spark-2.2.0/python/pyspark/shell.py export PYSPARK_PYTHON=/usr/src/Python-3.7.1/python export PYSPARK_SUBMIT_ARGS="-v --master yarn --deploy-mode client --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --num-executors 2 --driver-memory 1024m --executor-memory 1024m --executor-cores 2 --proxy-user "${PROXY_USER}" --keytab /tmp/jupyter.keytab pyspark-shell" # Kinit User/Keytab defined por the ProxyUser on the Cluster/HDFS kinit -kt /tmp/jupyter.keytab jupyter/cm1.localdomain@DOMAIN.COM # run the ipykernel exec /opt/anaconda3/bin/python -m ipykernel $@ Note: change with your IP and directories in the highlighted field.
  • 20. To use JupyterLab without it being the default interface, you just have to swap on your browser url the “tree” with Lab! http://10.111.22.333:9001/user/tpsimoes/lab JupyterLab JupyterLab it’s the next-generation web-based interface for Jupyter. Install JupyterLab conda install -c conda-forge jupyterlab Install JupyterLab Launcher conda install -c conda-forge jupyterlab_launcher
  • 21. JupyterLab To be able to use the JupyterLab interface as default on Jupyter it requires additional changes. ● Change the JupyterHub Config File ● Additional extensions (for the Hub Menu) ● Create config file for JupyterLab Edit PySpark Script nano /opt/anaconda3/jupyterhub_config.py ... # Change the values on this Flags c.Spawner.default_url = '/lab' c.Spawner.notebook_dir = '/home/{username}' # Add this Flag c.Spawner.cmd = ['jupyter-labhub']
  • 22. JupyterLab Install jupyterlab-hub extension jupyter labextension install @jupyterlab/hub-extension Create JupyterLab Config File cd /opt/anaconda3/share/jupyter/lab/settings/ nano page_config.json { "hub_prefix": "/jupyter" }
  • 24. R, Hive and Impala on JupyterHub On this section the focus will reside on R, Hive, Impala and Kerberized Kernel. With R Kernel, it requires libs on both Machines (Cluster and Jupyter) [Cluster & Jupyter] Install R Libs yum install -y openssl-devel openssl libcurl-devel libssh2-devel [Jupyter] Create SymLinks for R libs ln -s /opt/anaconda3/lib/libssl.so.1.0.0 /usr/lib64/libssl.so.1.0.0; ln -s /opt/anaconda3/lib/libcrypto.so.1.0.0 /usr/lib64/libcrypto.so.1.0.0; [Cluster & Jupyter] To use SparkR devtools::install_github('apache/spark@v2.2.0', subdir='R/pkg') Note: Change with your values in the highlighted field. [Cluster & Jupyter] Start R & Install Packages R install.packages('git2r') install.packages('devtools') install.packages('repr') install.packages('IRdisplay') install.packages('crayon') install.packages('pbdZMQ')
  • 25. R, Hive and Impala on JupyterHub To interact with Hive metadata and the direct use of the sintax, the my recommendation is the HiveQL. Install Developer Toolset Libs yum install cyrus-sasl-devel.x86_64 cyrus-sasl-gssapi.x86_64 cyrus-sasl-sql.x86_64 cyrus-sasl-plain.x86_64 gcc-c++ Install Python + Hive interface (SQLAlchemy interface for Hive) pip install pyhive Install HiveQL Kernel pip install --upgrade hiveqlKernel jupyter hiveql install Confirm HiveQL Kernel installation jupyter kernelspec list
  • 26. R, Hive and Impala on JupyterHub Edit HiveQL Kernel cd /usr/local/share/jupyter/kernels/hiveql nano kernel.json {"argv": ["/usr/local/share/jupyter/kernels/hiveql/hiveql.sh", "-f", "{connection_file}"], "display_name": "HiveQL", "language": "hiveql", "name": "hiveql"} Create and Edit HiveQL script touch /opt/anaconda3/share/jupyter/kernels/hiveql/hiveql.sh; nano /opt/anaconda3/share/jupyter/kernels/hiveql/hiveql.sh; # !/usr/bin/env bash # setup environment variable, etc. PROXY_USER="$(whoami)"
  • 27. R, Hive and Impala on JupyterHub Edit HiveQL script export JAVA_HOME=/usr/java/jdk1.8.0_181-amd64 export SPARK_HOME=/opt/spark export HADOOP_HOME=/etc/hadoop/conf export YARN_CONF_DIR=/etc/hadoop/conf export HADOOP_CONF_DIR=/etc/hadoop/conf export HIVE_HOME=/etc/hadoop/conf export PYTHONPATH=/opt/spark-2.2.0/python:/opt/spark-2.2.0/python/lib/py4j-0.10.4-src.zip export PYTHONSTARTUP=/opt/spark-2.2.0/python/pyspark/shell.py export PYSPARK_PYTHON=/usr/src/Python-3.7.1/python export HIVE_AUX_JARS_PATH=/etc/hadoop/postgresql-9.0-801.jdbc4.jar export HADOOP_CLIENT_OPTS="-Xmx2147483648 -XX:MaxPermSize=512M -Djava.net.preferIPv4Stack=true" # Kinit User/Keytab defined por the ProxyUser on the Cluster/HDFS kinit -kt /tmp/jupyter.keytab jupyter/cm1.localdomain@DOMAIN.COM # run the ipykernel exec /opt/anaconda3/bin/python -m ipykernel $@ Note 1: change with your IP. directories and versions in the highlighted field. Note 2: add your users keytab to a chosen directory so that is possible to run with proxyuser
  • 28. R, Hive and Impala on JupyterHub To interact with Impala metadata, my recommendation is the Impyla, but there’s a catch, because due to a specific version of a lib (thrift_sasl), the HiveQL will stop working, because hiveqlkernel 1.0.13 has the requirement thrift-sasl==0.3.*. Install Developer Toolset Libs yum install cyrus-sasl-devel.x86_64 cyrus-sasl-gssapi.x86_64 cyrus-sasl-sql.x86_64 cyrus-sasl-plain.x86_64 gcc-c++ Install additional Libs for Impyla pip install thrift_sasl==0.2.1: pip install sasl; Install ipython-sql conda install -c conda-forge ipython-sql Install impyla pip install impyla==0.15a1 Note: it was installed a alfa version for impyla due to an incompatibility with python versions superior to 3.7.
  • 29. R, Hive and Impala on JupyterHub If you require to have access to Hive & Impala metadata, you can use Python + Hive with a kerberized custom kernel. Install Jaydebeapi package conda install -c conda-forge jaydebeapi Create Python Kerberized Kernel mkdir -p /usr/share/jupyter/kernels/pythonKerb cd /usr/share/jupyter/kernels/pythonKerb touch kernel.json touch pythonKerb.sh chmod a+x /usr/share/jupyter/kernels/pythonKerb/pythonKerb.sh Note: Change with your values in the highlighted field. Edit Kerberized Kernel nano /usr/share/jupyter/kernels/kernel.json {"argv": ["/usr/local/share/jupyter/kernels/pythonKerb/pythonKerb.sh ", "-f", "{connection_file}"], "display_name": "PythonKerberized", "language": "python", "name": "pythonKerb"} Edit Kerberized Kernel script nano /usr/share/jupyter/kernels/pythonKerb/pythonKerb.sh
  • 30. R, Hive and Impala on JupyterHub Edit Kerberized Kernel script PROXY_USER="$(whoami)" export JAVA_HOME=/usr/java/jdk1.8.0_181-amd64 export SPARK_HOME=/opt/spark export HADOOP_HOME=/etc/hadoop/conf export YARN_CONF_DIR=/etc/hadoop/conf export HADOOP_CONF_DIR=/etc/hadoop/conf export HIVE_HOME=/etc/hadoop/conf export PYTHONPATH=/opt/spark-2.2.0/python:/opt/spark-2.2.0/python/lib/py4j-0.10.4-src.zip export PYTHONSTARTUP=/opt/spark-2.2.0/python/pyspark/shell.py export PYSPARK_PYTHON=/usr/src/Python-3.7.1/python export HIVE_AUX_JARS_PATH=/etc/hadoop/postgresql-9.0-801.jdbc4.jar export HADOOP_CLIENT_OPTS="-Xmx2147483648 -XX:MaxPermSize=512M -Djava.net.preferIPv4Stack=true" export CLASSPATH=$CLASSPATH:`hadoop classpath`:/etc/hadoop/*:/tmp/* export PYTHONPATH=$PYTHONPATH:/opt/anaconda3/lib/python3.7/site-packages/jaydebeapi # Kinit User/Keytab defined por the ProxyUser on the Cluster/HDFS kinit -kt /tmp/${PROXY_USER}.keytab ${PROXY_USER}@DOMAIN.COM # run the ipykernel exec /opt/anaconda3/bin/python -m ipykernel_launcher $@
  • 31. R, Hive and Impala on JupyterHub Assuming that you don't have Impyla installed, or if so, you have created an environment for it! HiveQL it’s the best Kernel to access to hive metadata and it has support. Install Developer Toolset Libs yum install cyrus-sasl-devel.x86_64 cyrus-sasl-gssapi.x86_64 cyrus-sasl-sql.x86_64 cyrus-sasl-plain.x86_64 gcc-c++ Install Hive interface & HiveQL Kernel pip install pyhive; pip install --upgrade hiveqlKernel; Jupyter Install Kernel jupyter hiveql install Check kernel installation jupyter kernelspec list
  • 32. R, Hive and Impala on JupyterHub To access to a kerberized Cluster you will require a Kerberos Ticket in cache, therefore the solution will be the following: Edit Kerberized Kernel nano /usr/local/share/jupyter/kernels/hiveql/kernel.json {"argv": ["/usr/local/share/jupyter/kernels/hiveql/hiveql.sh", "-f", "{connection_file}"], "display_name": "HiveQL", "language": "hiveql", "name": "hiveql"} Edit Kerberized Kernel script touch /usr/local/share/jupyter/kernels/hiveql/hiveql.sh nano /usr/local/share/jupyter/kernels/hiveql/hiveql.sh Note: Change with your values in the highlighted field.
  • 33. R, Hive and Impala on JupyterHub Edit Kerberized Kernel script PROXY_USER="$(whoami)" export JAVA_HOME=/usr/java/jdk1.8.0_181-amd64 export SPARK_HOME=/opt/spark export HADOOP_HOME=/etc/hadoop/conf export YARN_CONF_DIR=/etc/hadoop/conf export HADOOP_CONF_DIR=/etc/hadoop/conf export HIVE_HOME=/etc/hadoop/conf export PYTHONPATH=/opt/spark-2.2.0/python:/opt/spark-2.2.0/python/lib/py4j-0.10.4-src.zip export PYTHONSTARTUP=/opt/spark-2.2.0/python/pyspark/shell.py export PYSPARK_PYTHON=/usr/src/Python-3.7.1/python export HIVE_AUX_JARS_PATH=/etc/hadoop/postgresql-9.0-801.jdbc4.jar export HADOOP_CLIENT_OPTS="-Xmx2147483648 -XX:MaxPermSize=512M -Djava.net.preferIPv4Stack=true" # Kinit User/Keytab defined por the ProxyUser on the Cluster/HDFS kinit -kt /tmp/${PROXY_USER}.keytab ${PROXY_USER}@DOMAIN.COM # run the ipykernel exec /opt/anaconda3/bin/python -m hiveql $@ Note: Change with your values in the highlighted field.
  • 34. Interact with JupyterHub Kernels The following information will serve as base of knowledge, how to interact with previous configured kernels with a kerberized Cluster. [HiveQL] Create Connection $$ url=hive://hive@cm1.localdomain:10000/ $$ connect_args={"auth": "KERBEROS","kerberos_service_name": "hive"} $$ pool_size=5 $$ max_overflow=10 [Impyla] Create Connection from impala.dbapi import connect conn = connect(host='cm1.localdomain', port=21050, kerberos_service_name='impala', auth_mechanism='GSSAPI') Note: Change with your values in the highlighted field.
  • 35. Interact with JupyterHub Kernels [Impyla] Create Connection via SQLMagic %load_ext sql %config SqlMagic.autocommit=False %sql impala://tpsimoes:welcome1@cm1.localdomain:21050/db?kerberos_service_name=impala&auth_mechanism=GSSAPI [Python] Create Connection import jaydebeapi import pandas as pd conn_hive = jaydebeapi.connect("org.apache.hive.jdbc.HiveDriver","jdbc:hive2://cm1.localdomain:10000/db;AuthMech=1;KrbRealm=DOMAIN. COM;KrbHostFQDN=cm1.localdomain;KrbServiceName=hive;KrbAuthType=2") [Python] Kinit Keytab import subprocess result = subprocess.run(['kinit', '-kt','/tmp/tpsimoes.keytab',tpsimoes/cm1.localdomain@DOMAIN.COM'], stdout=subprocess.PIPE) result.stdout Note: Change with your values in the highlighted field.