Which library should you choose for data-science? That's the question!

Which library should you choose for
data-science?
That’s the question! (?)
Anastasia Lieva
Data Scientist
@lievAnastazia

Agenda
1. What is Data-Science? How magic is it?
2. Python & Scala Data-Science ecosystem
3. Demonstration of some libraries
on real dataset
4. Your choice in the pocket?

the sexiest job
of the 21st century
Data-Science

most laborious job
of the 21st century?
Data-Science

Time series analysis
Clustering
Classification
Regression
...
...
Descriptive statistics
Frame the problem!

Components that we need to solve the problem
Learning/optimization of algorithm
Mathematical analysis
Tuning/optimization of algorithm
Preprocessing
Evaluation
...
Visualisation

Visualisation
Preprocessing
Machine
Learning
Evaluation
Features
engineering
Features
selection
Features
extraction
Hyper-param
eters tuning
Algorithm
optimization
Algorithm
Evaluation
strategies
Visualisation
Evaluation
metrics

On which aspects should we focus on?
Solution that works / Solution out of box
Solution that is well documented
Solution that is easy & fast to test
Solution that is easy & fast to develop
Solution that is easy & fast to industrialize
Solution that is easy to maintain
Solution that is easy & fast to scale

R
Python
SQL
Scala
Scala.io 2016 Anastasia Lieva : “Big-Data-Science in Scala”

Python
Scala
Which language to pick up?Frame your search:

Python
Scala
Which library to pick up?Frame your search:

Python
Scala
Which library to pick up?Frame your search:
Spark
Saddle
Smile
Breeze
Spark
Statsmodels
Scikit-learn
Numpy
Pandas
Simpy
Matplotlib
Bokeh
Searborn
Vispy
ggplot

Time series analysis
Clustering
Classification
Regression
...
...
Descriptive statistics
Frame the problem!
Python
Scala

Components that we need to resolve the problem
Learning/optimisation algorithme
Mathematical analysis
Tuning/optimisation of algorithme
Preprocessing
Evaluation
...
Visualisation

Frame your search Which library to pick up?
Scala
Spark SparkTS Smile Breeze Saddle
learning algorithms
mathematical analysis
algorithms tuning
preprocessing
evaluation
visualisation

Frame your search Which library to pick up?
PySpark Scikit-learn statsmodels Scipy SymPy Pandas Numpy
learning
algorithms
mathematical
analysis
algorithms
tuning/optimiz
ation
preprocessing
evaluation
visualisation
Python

Frame your search
Which library to pick up?
Pandas matplotlib searborn bokeh vispy ggplot
Visualisation
Python

Frame your search
BayesPy PyMC libpgm BNFinder pebl
Bayesian
inference
Python

Frame your search
TensorFlow Keras Theano Caffe Lasagne
deep learning
Python

Development
environment
matters
Python
Scala

Development environment matters Python
AnacondaST3
plugin for Sublime Text 3

Development environment matters
Scala

Problem:
Optimize click rate of delivering ads
We want to estimate the probability the ads will be clicked
● request configuration
● proposed creative
● user history
● third-party information
depending on:

Decision Tree
os
Category City
Games
Android
Music
iOs
Paris
Nantes
Yes No
Yes
No

{
"id":"951cb9f5-2bab-46ce-b759-8245cffxxxxx",
"time":"2016-06-09T0:25:28Z",
"bidfloor":2.88,
"appOrSite":"app",
"adType":"banner",
"categories":"games,news,football",
"publisherId":"11e281c1123139xxxxx",
"carrier":"208-10",
"os":"iOS",
"connectionType":3,
"coords":[48.929256439208984, 2.4255824089050293],
"adSize":[320, 50],
"exchange":"xxxxx",
[...],
"clicked":true
}
Raw data
500 Mb

Os BidPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z

Os BidPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
Click
False
True
False

Os BidPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
Click
False
True
False
Os BidPrice Time
3.0 6.0 1.0
5.0 3.0 5.0
1.0 2.0 3.0

Spark Preprocessing
Features
engineering
Features
selection
Features
extraction
Scala

Databricks Notebook
Display and download options

1. Spark SQL optimized methods
2. MLlib out-of-box features engineering / features selection
3. Dataset performance & type safety
Execution time for preprocessing 43 seconds
Spark
Scala

Saddle
SCALA
Preprocessing
Features
engineering
Features
selection
Features
extraction
Scala

1. Out-of-box easy to use structures:
frame, matrix, series, vectors
2. Not active development
Execution time for preprocessing 3,5 minutes
Saddle
Scala

1. TypeSafe & very performant
2. You have to implement yourself
all preprocessing stages and methods
Execution time for preprocessing 3,1 seconds
Native Scala library
Scala

Numpy,
Pandas,
Scikit-learn
Preprocessing
Features
engineering
Features
selection
Features
extraction
Python

Python
1. Numpy arrays instead of Python lists for operations on sequences
2. Pandas DataFrame slicing methods to access values
3. Pandas DataFrame methods for data-structure transformations and access
4. Scikit-learn for features engineering
Execution time for preprocessing 20 minutes
Numpy, Pandas, Scikit-learn

Python
3. Pandas DataFrame methods for data-structure transformation and access
Numpy
- homogeneous multidimensional array with its indexing, slicing and
reshaping tricks
- linear algebra

Python
Pandas
- DataFrame with its 425 methods : slicing, multi-indexing, merging,
grouping, missing values imputations …
- Plotting
- Time Series analysis

Python
Scikit-learn
- Preprocessing (features engineering, missing value imputation,
features selection)
- Decomposing signals in components (PCA, LDA, Factor analysis,
matrix factorisation)

Compare execution time for preprocessing
on laptop Intel Core i5 11Gb RAM, 4 cores

Visualisation
Preprocessing
Features
engineering
Features
selection
Features
extraction
os
Category City
Gam
es
Android
M
usic
iOs
Paris
Nantes
Yes No
Yes
No
Decision Tree

Scikit-learn
Machine
Learning
Hyper-param
eters tuning
Algorithm
optimization
Algorithm
Python

Scikit-learn
String
Indexer
Tokenizer Bucketizer PCA Assembler

Smile
Machine
Learning
Hyper-param
eters tuning
Algorithm
optimization
Algorithm
Scala

Model importance
0.17041644829479835,0.0,0.24611540915530505,1.1389295846602683,0.07655364222
388063,0.0,0.0,0.009896625232551026,4.57453119760533,0.36047880690737855,1.2
020833333333334,0.007662298205433167,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Smile

Spark
Machine
Learning
Hyper-param
eters tuning
Algorithm
optimization
Algorithm
Scala

Pipeline interface
String
Indexer
Tokenizer Bucketizer PCA Assembler

Visualisation
Visualisation
Preprocessing
Machine
Learning
Evaluation
Features
engineering
Features
selection
Features
extraction
Hyper-param
eters tuning
Algorithm
optimization
Algorithm
Evaluation
strategies
Evaluation
metrics

Spark Smile Scikit-learn
Regression
Binary
Classification
Multiclass
Classification
Regression
Classification
Clustering
Regression
Classification
evaluators
built-in methods for generation
of classification report
& confusion matrix

Compare execution time for learning
on laptop Intel Core i5 11Gb RAM, 4 cores

On which aspects should we focus on?
Scala Python
Solution that works /
Solution out of box
Solution well explained/supported
Solution easy & fast to test
Solution easy & fast to develop
Solution easy & fast industrialize
Solution easy to maintain
Solution easy & fast to scale

Thank you for your
attention!
and go make data-science to save the world
@lievAnastazia

Which library should you choose for data-science? That's the question!

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Which library should you choose for data-science? That's the question!

Similar to Which library should you choose for data-science? That's the question! (20)

More from Anastasia Bobyreva

More from Anastasia Bobyreva (8)

Recently uploaded

Recently uploaded (20)

Which library should you choose for data-science? That's the question!