This talk presents you the data-science ecosystem in two languages : Python and Scala. It demonstrates the use of their libraries on real dataset to solve binary classification problem with decision tree algorithm.
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
Which library should you choose for data-science? That's the question!
1. Which library should you choose for
data-science?
That’s the question! (?)
Anastasia Lieva
Data Scientist
@lievAnastazia
2. Agenda
1. What is Data-Science? How magic is it?
2. Python & Scala Data-Science ecosystem
3. Demonstration of some libraries
on real dataset
4. Your choice in the pocket?
9. Components that we need to solve the problem
Learning/optimization of algorithm
Mathematical analysis
Tuning/optimization of algorithm
Preprocessing
Evaluation
...
Visualisation
11. On which aspects should we focus on?
Solution that works / Solution out of box
Solution that is well documented
Solution that is easy & fast to test
Solution that is easy & fast to develop
Solution that is easy & fast to industrialize
Solution that is easy to maintain
Solution that is easy & fast to scale
18. Components that we need to resolve the problem
Learning/optimisation algorithme
Mathematical analysis
Tuning/optimisation of algorithme
Preprocessing
Evaluation
...
Visualisation
19. Frame your search Which library to pick up?
Scala
Spark SparkTS Smile Breeze Saddle
learning algorithms
mathematical analysis
algorithms tuning
preprocessing
evaluation
visualisation
20. Frame your search Which library to pick up?
PySpark Scikit-learn statsmodels Scipy SymPy Pandas Numpy
learning
algorithms
mathematical
analysis
algorithms
tuning/optimiz
ation
preprocessing
evaluation
visualisation
Python
21. Frame your search
Which library to pick up?
Pandas matplotlib searborn bokeh vispy ggplot
Visualisation
Python
22. Frame your search
Which library to pick up?
BayesPy PyMC libpgm BNFinder pebl
Bayesian
inference
Python
23. Frame your search
Which library to pick up?
TensorFlow Keras Theano Caffe Lasagne
deep learning
Python
27. Problem:
Optimize click rate of delivering ads
We want to estimate the probability the ads will be clicked
● request configuration
● proposed creative
● user history
● third-party information
depending on:
52. 1. Out-of-box easy to use structures:
frame, matrix, series, vectors
2. Not active development
Execution time for preprocessing 3,5 minutes
Saddle
Scala
53. 1. TypeSafe & very performant
2. You have to implement yourself
all preprocessing stages and methods
Execution time for preprocessing 3,1 seconds
Native Scala library
Scala
57. Python
1. Numpy arrays instead of Python lists for operations on sequences
2. Pandas DataFrame slicing methods to access values
3. Pandas DataFrame methods for data-structure transformations and access
4. Scikit-learn for features engineering
Execution time for preprocessing 20 minutes
Numpy, Pandas, Scikit-learn
58. Python
1. Numpy arrays instead of Python lists for operations on sequences
2. Pandas DataFrame slicing methods to access values
3. Pandas DataFrame methods for data-structure transformation and access
4. Scikit-learn for features engineering
Numpy
- homogeneous multidimensional array with its indexing, slicing and
reshaping tricks
- linear algebra
Execution time for preprocessing 20 minutes
Numpy, Pandas, Scikit-learn
59. Python
1. Numpy arrays instead of Python lists for operations on sequences
2. Pandas DataFrame slicing methods to access values
3. Pandas DataFrame methods for data-structure transformation and access
4. Scikit-learn for features engineering
Pandas
- DataFrame with its 425 methods : slicing, multi-indexing, merging,
grouping, missing values imputations …
- Plotting
- Time Series analysis
Execution time for preprocessing 20 minutes
Numpy, Pandas, Scikit-learn
60. Python
1. Numpy arrays instead of Python lists for operations on sequences
2. Pandas DataFrame slicing methods to access values
3. Pandas DataFrame methods for data-structure transformation and access
4. Scikit-learn for features engineering
Scikit-learn
- Preprocessing (features engineering, missing value imputation,
features selection)
- Decomposing signals in components (PCA, LDA, Factor analysis,
matrix factorisation)
Execution time for preprocessing 20 minutes
Numpy, Pandas, Scikit-learn
82. On which aspects should we focus on?
Scala Python
Solution that works /
Solution out of box
Solution well explained/supported
Solution easy & fast to test
Solution easy & fast to develop
Solution easy & fast industrialize
Solution easy to maintain
Solution easy & fast to scale
83. Thank you for your
attention!
and go make data-science to save the world
@lievAnastazia