SlideShare a Scribd company logo
1 of 27
Download to read offline
CONNECTING PYTHON TO THE
SPARK ECOSYSTEM
Daniel Rodriguez
Software developer/data scientist
Continuum Analytics
Twitter: @danielfrg
Github: github.com/danielfrg
Spark Summit 2016:
Content
• PyData and Spark
• Python (PyData libs) package management
• Python package management in a cluster
• Multiple options
• Some usage of python (single node) libraries in a
cluster
• Future
2
PyData ecosystem
3
Image by Jake VanderPlas
PyData ecosystem: Numpy
4
• Numpy as the core for the ecosystem:
• Powerful N-dimensional array object
• Broadcasting functions
• Tools for integrating C/C++ and Fortran code
• Linear algebra, Fourier transform, and random
number capabilities
• Single node
• Python only
Spark Ecosystem
5
Image (c) Cloudera, Inc
Third party: spark-packages.org
Spark ecosystem
6
• Fast and general engine for large-scale data
processing
• High level libraries:
• Dataframes, MLlib, SQL, Graphs
• Similar as PyData
• Distributed by design: RDD as core
• JVM with multi language interfaces: Scala,
Java, Python, R
PyData vs Spark
7
• PyData for scientists - maybe they also use MPI
• Spark for a Big Data (after MapReduce)
• Both relatively fine and happy
• Scientist have to scale
• Big data people have to do data science, statistics,
ML
• Data scientists now do both
PySpark issues
8
• PySpark source is not very pythonic (imo)
• Packages
• Performance:
• Serialization and pipes
• RDD of pickle objects
• Py4J to communicate
• Driver and each worker: java <-> python
PyData package management
9
• Numpy and other have C/C++ and Fortran code
• Compiling. Windows
github.com/conda/conda +Conda
Conda-forge: Community powered packages
Package management in a cluster
10
• Lot (100s to 1000s) nodes
• Not really a new problem in DevOps or CM
Search on github for available conda modules
Cloudera Manager and Anaconda Parcel
11
• Most popular Hadoop distribution
• Anaconda parcel for CDH *:
• Static python distribution
• Super easy installation if you have CDH
*: Joint work between Cloudera and Continuum
More info:
- https://docs.continuum.io/anaconda/cloudera
- http://blog.cloudera.com/blog/2016/02/making-python-on-apache-hadoop-easier-with-anaconda-and-cdh
I don't use Cloudera
12
You manage your own Hadoop cluster?
For real?
Want to make this happen?
Call me maybe?
Anaconda for cluster management
13
• Agnostic: Cloud, on premise, Air-gap
• Dynamic package and environment management
• Based on salt
• Extra plugins: Jupyter Notebook and more
More info: https://docs.continuum.io/anaconda-cluster/index
Leverage Spark and YARN
14
• New ways to deploy environments:
• Sparkonda: https://github.com/moutai/sparkonda
Leverage Spark: Sparkonda
15
conda create -n sparkonda-test-env python=2.7 pip pandas scikit-learn numpy numba
source activate sparkonda-test-env
pip install sparkonda
sc.addPyFile('path/to/sparkonda_utils.py')
skon.CONDA_ENV_NAME = 'sparkonda-test-env'
skon.CONDA_ENV_LOCATION = ''.join([home_dir,'/miniconda/envs/',skon.CONDA_ENV_NAME])
skon.SC_NUM_EXECUTORS = 2
skon.SC_NUM_CORE_PER_EXECUTOR = 2
skon.pack_conda_env()
skon.distribute_conda_env(sc)
skon.list_cwd_files(sc)
skon.install_conda_env(sc)
skon.set_workers_python_interpreter(sc)
def check_pandas(x): import pandas as pd; return [pd.__version__]
skon.prun(sc, check_pandas, include_broadcast_vars=False)
More info: https://github.com/moutai/sparkonda
Leverage Spark and YARN
16
• Beta
• Conda + Spark: http://quasiben.github.io/blog/
2016/4/15/conda-spark/
• Security: Kerberos
PySpark
17
lines = sc.textFile("data.txt")
lineLengths = lines.map(lambda s: len(s))
totalLength = lineLengths.reduce(lambda a, b: a + b)
PySpark: NLTK
18
def word_tokenize(x):
import nltk
return nltk.word_tokenize(x)
def pos_tag(x):
import nltk
return nltk.pos_tag([x])
words = data.flatMap(word_tokenize)
pos_word = words.map(pos_tag)
pos_word.take(5)
[[('Address', 'NN')],
[('on', 'IN')],
[('the', 'DT')],
[('State', 'NNP')],
[('of', 'IN')]]
PySpark: Image processing GPUs
19
Jupyter Notebook:
https://gist.github.com/danielfrg/0afee63072793e9e9d6ebaee27865a4a
PySpark: Scikit-learn: spark-sklearn
20
from sklearn import grid_search, datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV
digits = datasets.load_digits()
X, y = digits.data, digits.target
param_grid = {"max_depth": [3, None],
"max_features": [1, 3, 10],
"min_samples_split": [1, 3, 10],
"min_samples_leaf": [1, 3, 10],
"bootstrap": [True, False],
"criterion": ["gini", "entropy"],
"n_estimators": [10, 20, 40, 80]}
gs = grid_search.GridSearchCV(RandomForestClassifier(),
param_grid=param_grid)
gs.fit(X, y)
More info: https://github.com/databricks/spark-sklearn
PySpark: Scikit-learn: spark-sklearn
21
More info: https://github.com/databricks/spark-sklearn
from sklearn import grid_search, datasets
from sklearn.ensemble import RandomForestClassifier
from spark_sklearn import GridSearchCV
digits = datasets.load_digits()
X, y = digits.data, digits.target
param_grid = {"max_depth": [3, None],
"max_features": [1, 3, 10],
"min_samples_split": [1, 3, 10],
"min_samples_leaf": [1, 3, 10],
"bootstrap": [True, False],
"criterion": ["gini", "entropy"],
"n_estimators": [10, 20, 40, 80]}
gs = grid_search.GridSearchCV(RandomForestClassifier(),
param_grid=param_grid)
gs.fit(X, y)
PySpark: Dataframes
22
Spark Dataframes
Performance happy
Future / Alternatives
23
• Apache Arrow
• Tensorflow
• Dask
Apache Arrow
24
More info:
- https://github.com/databricks/spark-sklearn
- https://blog.cloudera.com/blog/2016/02/introducing-apache-arrow-a-fast-interoperable-in-memory-columnar-data-structure-standard/
TensorFlow
25
Dask
26
import pandas as pd import dask.dataframe as dd
df = pd.read_csv('2015-01-01.csv') df = dd.read_csv('2015-*-*.csv')
df.groupby(df.user_id).value.mean() df.groupby(df.user_id).value.mean().compute()
Python parallel computing library for analytics
pip install dec2 OR conda install -c conda-forge dec2
dec2 up --keyname YOUR-AWS-KEY-NAME
--keypair ~/.ssh/YOUR-AWS-KEY-FILE.pem
--count 9 # Provision nine nodes
--nprocs 8 # Use eight separate worker processes per node
More info: http://dask.pydata.org/en/latest/
CONNECTING PYTHON TO THE
SPARK ECOSYSTEM
Questions?
Spark Summit 2016:
Daniel Rodriguez
Software developer/data scientist
Continuum Analytics
Twitter: @danielfrg
Github: github.com/danielfrg

More Related Content

What's hot

Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...Spark Summit
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsDatabricks
 
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David OjikaSpeeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David OjikaDatabricks
 
Pedal to the Metal: Accelerating Spark with Silicon Innovation
Pedal to the Metal: Accelerating Spark with Silicon InnovationPedal to the Metal: Accelerating Spark with Silicon Innovation
Pedal to the Metal: Accelerating Spark with Silicon InnovationJen Aman
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...Databricks
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Spark Summit
 
Deep Learning to Production with MLflow & RedisAI
Deep Learning to Production with MLflow & RedisAIDeep Learning to Production with MLflow & RedisAI
Deep Learning to Production with MLflow & RedisAIDatabricks
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Databricks
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache SparkDatabricks
 
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersTensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersDataWorks Summit
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareDatabricks
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySparkGetting The Best Performance With PySpark
Getting The Best Performance With PySparkSpark Summit
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonJen Aman
 
Apache Spark Performance: Past, Future and Present
Apache Spark Performance: Past, Future and PresentApache Spark Performance: Past, Future and Present
Apache Spark Performance: Past, Future and PresentDatabricks
 
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim DowlingDistributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim DowlingDatabricks
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkDatabricks
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...Spark Summit
 

What's hot (20)

Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David OjikaSpeeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
 
Pedal to the Metal: Accelerating Spark with Silicon Innovation
Pedal to the Metal: Accelerating Spark with Silicon InnovationPedal to the Metal: Accelerating Spark with Silicon Innovation
Pedal to the Metal: Accelerating Spark with Silicon Innovation
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
 
Deep Learning to Production with MLflow & RedisAI
Deep Learning to Production with MLflow & RedisAIDeep Learning to Production with MLflow & RedisAI
Deep Learning to Production with MLflow & RedisAI
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
 
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersTensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySparkGetting The Best Performance With PySpark
Getting The Best Performance With PySpark
 
Dask: Scaling Python
Dask: Scaling PythonDask: Scaling Python
Dask: Scaling Python
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And Python
 
Apache Spark Performance: Past, Future and Present
Apache Spark Performance: Past, Future and PresentApache Spark Performance: Past, Future and Present
Apache Spark Performance: Past, Future and Present
 
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim DowlingDistributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache Spark
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
 

Viewers also liked

High-Performance Python On Spark
High-Performance Python On SparkHigh-Performance Python On Spark
High-Performance Python On SparkJen Aman
 
Spark: Interactive To Production
Spark: Interactive To ProductionSpark: Interactive To Production
Spark: Interactive To ProductionJen Aman
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
 
Scaling PyData Up and Out
Scaling PyData Up and OutScaling PyData Up and Out
Scaling PyData Up and OutTravis Oliphant
 
Business Rule Learning with Interactive Selection of Association Rules - Rule...
Business Rule Learning with Interactive Selection of Association Rules - Rule...Business Rule Learning with Interactive Selection of Association Rules - Rule...
Business Rule Learning with Interactive Selection of Association Rules - Rule...Stanislav Vojíř
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkDatabricks
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemBojan Babic
 
Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016Holden Karau
 
Quels facteurs de pertinence pour la recherche de produits e-commerce ?
Quels facteurs de pertinence pour la recherche de produits e-commerce ?Quels facteurs de pertinence pour la recherche de produits e-commerce ?
Quels facteurs de pertinence pour la recherche de produits e-commerce ?Lamjed Ben Jabeur
 
Anaconda and PyData Solutions
Anaconda and PyData SolutionsAnaconda and PyData Solutions
Anaconda and PyData SolutionsTravis Oliphant
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangSpark Summit
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best PracticesCloudera, Inc.
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemCloudera, Inc.
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Cloudera, Inc.
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnApache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnDatabricks
 
Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Holden Karau
 
Not Your Father's Database by Vida Ha
Not Your Father's Database by Vida HaNot Your Father's Database by Vida Ha
Not Your Father's Database by Vida HaSpark Summit
 
Scale up and Scale Out Anaconda and PyData
Scale up and Scale Out Anaconda and PyDataScale up and Scale Out Anaconda and PyData
Scale up and Scale Out Anaconda and PyDataTravis Oliphant
 
Big Data in Production: Lessons from Running in the Cloud
Big Data in Production: Lessons from Running in the CloudBig Data in Production: Lessons from Running in the Cloud
Big Data in Production: Lessons from Running in the CloudJen Aman
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 

Viewers also liked (20)

High-Performance Python On Spark
High-Performance Python On SparkHigh-Performance Python On Spark
High-Performance Python On Spark
 
Spark: Interactive To Production
Spark: Interactive To ProductionSpark: Interactive To Production
Spark: Interactive To Production
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Scaling PyData Up and Out
Scaling PyData Up and OutScaling PyData Up and Out
Scaling PyData Up and Out
 
Business Rule Learning with Interactive Selection of Association Rules - Rule...
Business Rule Learning with Interactive Selection of Association Rules - Rule...Business Rule Learning with Interactive Selection of Association Rules - Rule...
Business Rule Learning with Interactive Selection of Association Rules - Rule...
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
 
Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016
 
Quels facteurs de pertinence pour la recherche de produits e-commerce ?
Quels facteurs de pertinence pour la recherche de produits e-commerce ?Quels facteurs de pertinence pour la recherche de produits e-commerce ?
Quels facteurs de pertinence pour la recherche de produits e-commerce ?
 
Anaconda and PyData Solutions
Anaconda and PyData SolutionsAnaconda and PyData Solutions
Anaconda and PyData Solutions
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene Pang
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnApache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
 
Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017
 
Not Your Father's Database by Vida Ha
Not Your Father's Database by Vida HaNot Your Father's Database by Vida Ha
Not Your Father's Database by Vida Ha
 
Scale up and Scale Out Anaconda and PyData
Scale up and Scale Out Anaconda and PyDataScale up and Scale Out Anaconda and PyData
Scale up and Scale Out Anaconda and PyData
 
Big Data in Production: Lessons from Running in the Cloud
Big Data in Production: Lessons from Running in the CloudBig Data in Production: Lessons from Running in the Cloud
Big Data in Production: Lessons from Running in the Cloud
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 

Similar to Connecting Python To The Spark Ecosystem

Scaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUsScaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUsTravis Oliphant
 
Scientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchScientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchDirk Petersen
 
Querying 1.8 billion reddit comments with python
Querying 1.8 billion reddit comments with pythonQuerying 1.8 billion reddit comments with python
Querying 1.8 billion reddit comments with pythonDaniel Rodriguez
 
data science toolkit 101: set up Python, Spark, & Jupyter
data science toolkit 101: set up Python, Spark, & Jupyterdata science toolkit 101: set up Python, Spark, & Jupyter
data science toolkit 101: set up Python, Spark, & JupyterRaj Singh
 
New Capabilities in the PyData Ecosystem
New Capabilities in the PyData EcosystemNew Capabilities in the PyData Ecosystem
New Capabilities in the PyData EcosystemTuri, Inc.
 
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfS51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfDLow6
 
Hands on with Apache Spark
Hands on with Apache SparkHands on with Apache Spark
Hands on with Apache SparkDan Lynn
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark DownscalingDatabricks
 
NVIDIA Rapids presentation
NVIDIA Rapids presentationNVIDIA Rapids presentation
NVIDIA Rapids presentationtestSri1
 
GPU Computing with Python and Anaconda: The Next Frontier
GPU Computing with Python and Anaconda: The Next FrontierGPU Computing with Python and Anaconda: The Next Frontier
GPU Computing with Python and Anaconda: The Next FrontierNVIDIA
 
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsConnected Data World
 
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDFGPU-Accelerating UDFs in PySpark with Numba and PyGDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDFKeith Kraus
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSPeterAndreasEntschev
 
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...Athens Big Data
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Day 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers ProgramDay 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers ProgramFIWARE
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 

Similar to Connecting Python To The Spark Ecosystem (20)

PyData Boston 2013
PyData Boston 2013PyData Boston 2013
PyData Boston 2013
 
Scaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUsScaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUs
 
Scientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchScientific Computing @ Fred Hutch
Scientific Computing @ Fred Hutch
 
Querying 1.8 billion reddit comments with python
Querying 1.8 billion reddit comments with pythonQuerying 1.8 billion reddit comments with python
Querying 1.8 billion reddit comments with python
 
data science toolkit 101: set up Python, Spark, & Jupyter
data science toolkit 101: set up Python, Spark, & Jupyterdata science toolkit 101: set up Python, Spark, & Jupyter
data science toolkit 101: set up Python, Spark, & Jupyter
 
New Capabilities in the PyData Ecosystem
New Capabilities in the PyData EcosystemNew Capabilities in the PyData Ecosystem
New Capabilities in the PyData Ecosystem
 
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfS51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
 
Data Science
Data ScienceData Science
Data Science
 
Hands on with Apache Spark
Hands on with Apache SparkHands on with Apache Spark
Hands on with Apache Spark
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
 
Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
 
NVIDIA Rapids presentation
NVIDIA Rapids presentationNVIDIA Rapids presentation
NVIDIA Rapids presentation
 
GPU Computing with Python and Anaconda: The Next Frontier
GPU Computing with Python and Anaconda: The Next FrontierGPU Computing with Python and Anaconda: The Next Frontier
GPU Computing with Python and Anaconda: The Next Frontier
 
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needs
 
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDFGPU-Accelerating UDFs in PySpark with Numba and PyGDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
 
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Day 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers ProgramDay 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers Program
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
 

Recently uploaded

Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIAlejandraGmez176757
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...elinavihriala
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsalex933524
 
how can i exchange pi coins for others currency like Bitcoin
how can i exchange pi coins for others currency like Bitcoinhow can i exchange pi coins for others currency like Bitcoin
how can i exchange pi coins for others currency like BitcoinDOT TECH
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理pyhepag
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfMichaelSenkow
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJames Polillo
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理pyhepag
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxStephen266013
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Calllward7
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyRafigAliyev2
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxDilipVasan
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictJack Cole
 
MALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptx
MALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptxMALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptx
MALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptxNidaFaviankaNawawi
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsCEPTES Software Inc
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdfvyankatesh1
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Jon Hansen
 

Recently uploaded (20)

Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
 
how can i exchange pi coins for others currency like Bitcoin
how can i exchange pi coins for others currency like Bitcoinhow can i exchange pi coins for others currency like Bitcoin
how can i exchange pi coins for others currency like Bitcoin
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdf
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call
 
Machine Learning for Accident Severity Prediction
Machine Learning for Accident Severity PredictionMachine Learning for Accident Severity Prediction
Machine Learning for Accident Severity Prediction
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
MALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptx
MALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptxMALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptx
MALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptx
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdf
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
 

Connecting Python To The Spark Ecosystem

  • 1. CONNECTING PYTHON TO THE SPARK ECOSYSTEM Daniel Rodriguez Software developer/data scientist Continuum Analytics Twitter: @danielfrg Github: github.com/danielfrg Spark Summit 2016:
  • 2. Content • PyData and Spark • Python (PyData libs) package management • Python package management in a cluster • Multiple options • Some usage of python (single node) libraries in a cluster • Future 2
  • 3. PyData ecosystem 3 Image by Jake VanderPlas
  • 4. PyData ecosystem: Numpy 4 • Numpy as the core for the ecosystem: • Powerful N-dimensional array object • Broadcasting functions • Tools for integrating C/C++ and Fortran code • Linear algebra, Fourier transform, and random number capabilities • Single node • Python only
  • 5. Spark Ecosystem 5 Image (c) Cloudera, Inc Third party: spark-packages.org
  • 6. Spark ecosystem 6 • Fast and general engine for large-scale data processing • High level libraries: • Dataframes, MLlib, SQL, Graphs • Similar as PyData • Distributed by design: RDD as core • JVM with multi language interfaces: Scala, Java, Python, R
  • 7. PyData vs Spark 7 • PyData for scientists - maybe they also use MPI • Spark for a Big Data (after MapReduce) • Both relatively fine and happy • Scientist have to scale • Big data people have to do data science, statistics, ML • Data scientists now do both
  • 8. PySpark issues 8 • PySpark source is not very pythonic (imo) • Packages • Performance: • Serialization and pipes • RDD of pickle objects • Py4J to communicate • Driver and each worker: java <-> python
  • 9. PyData package management 9 • Numpy and other have C/C++ and Fortran code • Compiling. Windows github.com/conda/conda +Conda Conda-forge: Community powered packages
  • 10. Package management in a cluster 10 • Lot (100s to 1000s) nodes • Not really a new problem in DevOps or CM Search on github for available conda modules
  • 11. Cloudera Manager and Anaconda Parcel 11 • Most popular Hadoop distribution • Anaconda parcel for CDH *: • Static python distribution • Super easy installation if you have CDH *: Joint work between Cloudera and Continuum More info: - https://docs.continuum.io/anaconda/cloudera - http://blog.cloudera.com/blog/2016/02/making-python-on-apache-hadoop-easier-with-anaconda-and-cdh
  • 12. I don't use Cloudera 12 You manage your own Hadoop cluster? For real? Want to make this happen? Call me maybe?
  • 13. Anaconda for cluster management 13 • Agnostic: Cloud, on premise, Air-gap • Dynamic package and environment management • Based on salt • Extra plugins: Jupyter Notebook and more More info: https://docs.continuum.io/anaconda-cluster/index
  • 14. Leverage Spark and YARN 14 • New ways to deploy environments: • Sparkonda: https://github.com/moutai/sparkonda
  • 15. Leverage Spark: Sparkonda 15 conda create -n sparkonda-test-env python=2.7 pip pandas scikit-learn numpy numba source activate sparkonda-test-env pip install sparkonda sc.addPyFile('path/to/sparkonda_utils.py') skon.CONDA_ENV_NAME = 'sparkonda-test-env' skon.CONDA_ENV_LOCATION = ''.join([home_dir,'/miniconda/envs/',skon.CONDA_ENV_NAME]) skon.SC_NUM_EXECUTORS = 2 skon.SC_NUM_CORE_PER_EXECUTOR = 2 skon.pack_conda_env() skon.distribute_conda_env(sc) skon.list_cwd_files(sc) skon.install_conda_env(sc) skon.set_workers_python_interpreter(sc) def check_pandas(x): import pandas as pd; return [pd.__version__] skon.prun(sc, check_pandas, include_broadcast_vars=False) More info: https://github.com/moutai/sparkonda
  • 16. Leverage Spark and YARN 16 • Beta • Conda + Spark: http://quasiben.github.io/blog/ 2016/4/15/conda-spark/ • Security: Kerberos
  • 17. PySpark 17 lines = sc.textFile("data.txt") lineLengths = lines.map(lambda s: len(s)) totalLength = lineLengths.reduce(lambda a, b: a + b)
  • 18. PySpark: NLTK 18 def word_tokenize(x): import nltk return nltk.word_tokenize(x) def pos_tag(x): import nltk return nltk.pos_tag([x]) words = data.flatMap(word_tokenize) pos_word = words.map(pos_tag) pos_word.take(5) [[('Address', 'NN')], [('on', 'IN')], [('the', 'DT')], [('State', 'NNP')], [('of', 'IN')]]
  • 19. PySpark: Image processing GPUs 19 Jupyter Notebook: https://gist.github.com/danielfrg/0afee63072793e9e9d6ebaee27865a4a
  • 20. PySpark: Scikit-learn: spark-sklearn 20 from sklearn import grid_search, datasets from sklearn.ensemble import RandomForestClassifier from sklearn.grid_search import GridSearchCV digits = datasets.load_digits() X, y = digits.data, digits.target param_grid = {"max_depth": [3, None], "max_features": [1, 3, 10], "min_samples_split": [1, 3, 10], "min_samples_leaf": [1, 3, 10], "bootstrap": [True, False], "criterion": ["gini", "entropy"], "n_estimators": [10, 20, 40, 80]} gs = grid_search.GridSearchCV(RandomForestClassifier(), param_grid=param_grid) gs.fit(X, y) More info: https://github.com/databricks/spark-sklearn
  • 21. PySpark: Scikit-learn: spark-sklearn 21 More info: https://github.com/databricks/spark-sklearn from sklearn import grid_search, datasets from sklearn.ensemble import RandomForestClassifier from spark_sklearn import GridSearchCV digits = datasets.load_digits() X, y = digits.data, digits.target param_grid = {"max_depth": [3, None], "max_features": [1, 3, 10], "min_samples_split": [1, 3, 10], "min_samples_leaf": [1, 3, 10], "bootstrap": [True, False], "criterion": ["gini", "entropy"], "n_estimators": [10, 20, 40, 80]} gs = grid_search.GridSearchCV(RandomForestClassifier(), param_grid=param_grid) gs.fit(X, y)
  • 23. Future / Alternatives 23 • Apache Arrow • Tensorflow • Dask
  • 24. Apache Arrow 24 More info: - https://github.com/databricks/spark-sklearn - https://blog.cloudera.com/blog/2016/02/introducing-apache-arrow-a-fast-interoperable-in-memory-columnar-data-structure-standard/
  • 26. Dask 26 import pandas as pd import dask.dataframe as dd df = pd.read_csv('2015-01-01.csv') df = dd.read_csv('2015-*-*.csv') df.groupby(df.user_id).value.mean() df.groupby(df.user_id).value.mean().compute() Python parallel computing library for analytics pip install dec2 OR conda install -c conda-forge dec2 dec2 up --keyname YOUR-AWS-KEY-NAME --keypair ~/.ssh/YOUR-AWS-KEY-FILE.pem --count 9 # Provision nine nodes --nprocs 8 # Use eight separate worker processes per node More info: http://dask.pydata.org/en/latest/
  • 27. CONNECTING PYTHON TO THE SPARK ECOSYSTEM Questions? Spark Summit 2016: Daniel Rodriguez Software developer/data scientist Continuum Analytics Twitter: @danielfrg Github: github.com/danielfrg