SlideShare una empresa de Scribd logo
1 de 33
Descargar para leer sin conexión
Scikit-learn The state of the union
Ga¨el Varoquaux Open Source Innovation Spring
2016
Personal point of view, as an opening to scikit-learn days 2016 in Paris
1 Some history
Scikit-learn canal historique
G Varoquaux 2
1 scikit-learn growth: users
Website users (weekly): Google analytics
Debian popcon: ∼ 1% of the Debian users
G Varoquaux 3
1 scikit-learn growth: users
Website users (weekly): Google analytics
Debian popcon: ∼ 1% of the Debian users
Web searches: Google trends
G Varoquaux 3
1 scikit-learn growth: lines of code
Lines of code:
Huge feature set
https://www.openhub.net/p/scikit-learn
G Varoquaux 4
1 scikit-learn growth: contributors
Contributors:
759 contributors
https://www.openhub.net/p/scikit-learn
G Varoquaux 5
1 Started as David Cournapeau’s failed PhD project
David then preferred
improving numpy/scipy
That’s David sprinting in 2011
G Varoquaux 6
1 2009: We (Inria Parietal) need machine learning
My team takes over the
development
Hire a young guy
(Fabian Pedregosa)
Put post-docs and PhDs
(Alexandre Gramfort, Vincent Michel...)
Work in the open
Pythonic, fast, documented
G Varoquaux 7
1 2010: ICML MLOSS workshop
Machine Learning Open Source Software
“The examples in the
tutorial are pretty, but
not particularly useful
for the serious user.”
“For the sustainability of
the project it might be bet-
ter to narrow the focus...”
G Varoquaux 8
1 2011: NIPS sprint
People that I didn’t know
were solving my problems
G Varoquaux 9
1 2011: NIPS sprint
People that I didn’t know
were solving my problems
The project took off because of the community...
G Varoquaux 9
2 Upcoming cool stuff
Upcoming 0.18 release
G Varoquaux 10
2 Less code:
Lines of code:
G Varoquaux 11
2 Less code: Cython no longer embedded
Lines of code:
Generated C no longuer embedded in git
⇒ opens the door to fused-types (polymorphism)
⇒ multiple dtypes support in algorithm
= memory saver
Arthur MenschG Varoquaux 11
2 Faster code: better algorithmics
RandomizedPCA → PCA
Automatic choice randomized linear algebra
power iteration (arpack) full (lapack)
For large data: up to 20× speed up
https://github.com/scikit-learn/scikit-learn/issues/5243
Giorgio Patrini
G Varoquaux 12
2 Faster code: better algorithmics
RandomizedPCA → PCA
Automatic choice randomized linear algebra
power iteration (arpack) full (lapack)
For large data: up to 20× speed up
https://github.com/scikit-learn/scikit-learn/issues/5243
Giorgio Patrini
Elkan’s K means
For large data: ∼ 2× speed up.
https://github.com/scikit-learn/scikit-learn/pull/5414
Andreas M¨uller
G Varoquaux 12
2 New cross-validation objects
from s k l e a r n . c r o s s v a l i d a t i o n
import S t r a t i f i e d K F o l d
cv = S t r a t i f i e d K F o l d (y , n f o l d s =2)
for t r a i n , t e s t in cv :
X t r a i n = X[ t r a i n ]
y t a i n = y[ t r a i n ]
Data-independent nested-CV possible
https://github.com/scikit-learn/scikit-learn/pull/4294
Raghav R V
G Varoquaux 13
2 New cross-validation objects
from s k l e a r n . m o d e l s e l e c t i o n
import S t r a t i f i e d K F o l d
cv = S t r a t i f i e d K F o l d ( n f o l d s =2)
for t r a i n , t e s t in cv . s p l i t (X, y):
X t r a i n = X[ t r a i n ]
y t a i n = y[ t r a i n ]
Data-independent ⇒ nested-CV possible
https://github.com/scikit-learn/scikit-learn/pull/4294
Raghav R V
G Varoquaux 13
2 Sequential / Bayesian search CV
See hyper-parameter selection as a Bayesian
optimization / noisy fit problem.
⇒ choose hyper-parameters cleverly, not on a grid
Pull request stalled
https://github.com/scikit-learn/scikit-learn/pull/5491
Fabian Pedregosa, Sebastien Dubois, & Manoj Kumar
G Varoquaux 14
3 Vision(s): the future
G Varoquaux 15
Mission statement
Enable progress via data science
Lower the costs,
less technicalities
Machine learning
for everybody and
for everything
G Varoquaux 16
Mission statement
Enable progress via data science
Lower the costs,
less technicalities
Machine learning
for everybody and
for everything
Small hardware,
medium data
G Varoquaux 16
3 Deep learning
sklearn.neural network.MLPClassifier
architecture-specification language
GPUs unbound technicality
G Varoquaux 17
3 Deep learning
sklearn.neural network.MLPClassifier
architecture-specification language
GPUs unbound technicality
keras, caffe...
G Varoquaux 17
3 AutoML
Automatic model selection
Better hyper-parameter selection
Better description and uniformization of estimators
Integrate feedback from auto-sklearn
G Varoquaux 18
3 Better, faster, stronger
Faster models
From lightning, back to sklearn
Inspiration from XGBoost the paper is out!
G Varoquaux 19
3 Better, faster, stronger
Faster models
From lightning, back to sklearn
Inspiration from XGBoost the paper is out!
Larger data
More partial fit online forests?
Less copies
G Varoquaux 19
3 Scaling up (out?)
I don’t want java/scala
Less fluid prototyping
Cross-VM debugging hard
Numerics in java slowers than Lapack
Need C somewhere
G Varoquaux 20
3 Scaling up (out?)
I don’t want java/scala
They have:
Coupling distributed store to computation
Distributed job management
Create new stack? Ride on this one?
G Varoquaux 20
3 Scaling up (out?)
I don’t want java/scala
They have:
Coupling distributed store to computation
Distributed job management
Create new stack? Ride on this one?
Blaze, Ibis, dask: require rewrite of algorithms
dask promising for ETL
New backends for joblib parallel and storage
distributed, ssh
G Varoquaux 20
Sustainable growth
Reviewing is the bottleneck
User support drowns core devs
Users need stability (Airbus)
Coding is not the only thing
sprint, GSOC management, tutorials...
G Varoquaux 21
Sustainable growth
Reviewing is the bottleneck
User support drowns core devs
Users need stability (Airbus)
Coding is not the only thing
sprint, GSOC management, tutorials...
Structure & stability
How to organize funding and governance?
process/meetings/reports/funding proposal...
= work on project
Passionate coders get a lot done
unless they get drowned by meetings
G Varoquaux 21
@GaelVaroquaux
Funding: Inria, Nexedi, Paris-Saclay CDS, NYU CDS, GSoC

Más contenido relacionado

La actualidad más candente

Scipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in PythonScipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in Python
Wes McKinney
 
"The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming...
"The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming..."The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming...
"The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming...
Edge AI and Vision Alliance
 

La actualidad más candente (20)

Open & reproducible research - What can we do in practice?
Open & reproducible research - What can we do in practice?Open & reproducible research - What can we do in practice?
Open & reproducible research - What can we do in practice?
 
FireWorks overview
FireWorks overviewFireWorks overview
FireWorks overview
 
Big Data com Python
Big Data com PythonBig Data com Python
Big Data com Python
 
H20: A platform for big math
H20: A platform for big math H20: A platform for big math
H20: A platform for big math
 
PyTorch Python Tutorial | Deep Learning Using PyTorch | Image Classifier Usin...
PyTorch Python Tutorial | Deep Learning Using PyTorch | Image Classifier Usin...PyTorch Python Tutorial | Deep Learning Using PyTorch | Image Classifier Usin...
PyTorch Python Tutorial | Deep Learning Using PyTorch | Image Classifier Usin...
 
Reproducible Workflow with Cytoscape and Jupyter Notebook
Reproducible Workflow with Cytoscape and Jupyter NotebookReproducible Workflow with Cytoscape and Jupyter Notebook
Reproducible Workflow with Cytoscape and Jupyter Notebook
 
IPython: A Modern Vision of Interactive Computing (PyData SV 2013)
IPython: A Modern Vision of Interactive Computing (PyData SV 2013)IPython: A Modern Vision of Interactive Computing (PyData SV 2013)
IPython: A Modern Vision of Interactive Computing (PyData SV 2013)
 
Transfer learning, active learning using tensorflow object detection api
Transfer learning, active learning  using tensorflow object detection apiTransfer learning, active learning  using tensorflow object detection api
Transfer learning, active learning using tensorflow object detection api
 
Scaling PyData Up and Out
Scaling PyData Up and OutScaling PyData Up and Out
Scaling PyData Up and Out
 
Scaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUsScaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUs
 
Python and Sage
Python and SagePython and Sage
Python and Sage
 
Building Interpretable & Secure AI Systems using PyTorch
Building Interpretable & Secure AI Systems using PyTorchBuilding Interpretable & Secure AI Systems using PyTorch
Building Interpretable & Secure AI Systems using PyTorch
 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
 
PyData Barcelona Keynote
PyData Barcelona KeynotePyData Barcelona Keynote
PyData Barcelona Keynote
 
Some "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data frontSome "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data front
 
How You Can Use Open Source Materials to Learn Python & Data Science - EuroPy...
How You Can Use Open Source Materials to Learn Python & Data Science - EuroPy...How You Can Use Open Source Materials to Learn Python & Data Science - EuroPy...
How You Can Use Open Source Materials to Learn Python & Data Science - EuroPy...
 
Scipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in PythonScipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in Python
 
"The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming...
"The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming..."The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming...
"The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming...
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2O
 
Jay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIJay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AI
 

Destacado

Destacado (20)

Intro to machine learning with scikit learn
Intro to machine learning with scikit learnIntro to machine learning with scikit learn
Intro to machine learning with scikit learn
 
Intro to scikit learn may 2017
Intro to scikit learn may 2017Intro to scikit learn may 2017
Intro to scikit learn may 2017
 
Data Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learnData Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learn
 
Realtime predictive analytics using RabbitMQ & scikit-learn
Realtime predictive analytics using RabbitMQ & scikit-learnRealtime predictive analytics using RabbitMQ & scikit-learn
Realtime predictive analytics using RabbitMQ & scikit-learn
 
Machine learning in production with scikit-learn
Machine learning in production with scikit-learnMachine learning in production with scikit-learn
Machine learning in production with scikit-learn
 
Introduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learnIntroduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learn
 
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
 
Intro to scikit-learn
Intro to scikit-learnIntro to scikit-learn
Intro to scikit-learn
 
Machine learning with scikit-learn
Machine learning with scikit-learnMachine learning with scikit-learn
Machine learning with scikit-learn
 
Clustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn TutorialClustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn Tutorial
 
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learnNumerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
 
Machine Learning with scikit-learn
Machine Learning with scikit-learnMachine Learning with scikit-learn
Machine Learning with scikit-learn
 
Think machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetanThink machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetan
 
Exploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-LearnExploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-Learn
 
Accelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-LearnAccelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-Learn
 
Converting Scikit-Learn to PMML
Converting Scikit-Learn to PMMLConverting Scikit-Learn to PMML
Converting Scikit-Learn to PMML
 
Text Classification/Categorization
Text Classification/CategorizationText Classification/Categorization
Text Classification/Categorization
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the project
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 

Similar a Scikit-learn: the state of the union 2016

Similar a Scikit-learn: the state of the union 2016 (20)

Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
Building a Cutting-Edge Data Process Environment on a Budget by Gael VaroquauxBuilding a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
 
Processing biggish data on commodity hardware: simple Python patterns
Processing biggish data on commodity hardware: simple Python patternsProcessing biggish data on commodity hardware: simple Python patterns
Processing biggish data on commodity hardware: simple Python patterns
 
Computational practices for reproducible science
Computational practices for reproducible scienceComputational practices for reproducible science
Computational practices for reproducible science
 
NYC_2016_slides
NYC_2016_slidesNYC_2016_slides
NYC_2016_slides
 
Building collaborative workflows for scientific data
Building collaborative workflows for scientific dataBuilding collaborative workflows for scientific data
Building collaborative workflows for scientific data
 
Microservices Application Tracing Standards and Simulators - Adrians at OSCON
Microservices Application Tracing Standards and Simulators - Adrians at OSCONMicroservices Application Tracing Standards and Simulators - Adrians at OSCON
Microservices Application Tracing Standards and Simulators - Adrians at OSCON
 
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imagingScikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
 
Simple big data, in Python
Simple big data, in PythonSimple big data, in Python
Simple big data, in Python
 
ConSol_IBM_webcast_quarkus_the_blue_hedgehog_of_java_web_frameworks
ConSol_IBM_webcast_quarkus_the_blue_hedgehog_of_java_web_frameworksConSol_IBM_webcast_quarkus_the_blue_hedgehog_of_java_web_frameworks
ConSol_IBM_webcast_quarkus_the_blue_hedgehog_of_java_web_frameworks
 
carrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIcarrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-API
 
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker SwarmGenomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
 
Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...
 
H2O at Poznan R Meetup
H2O at Poznan R MeetupH2O at Poznan R Meetup
H2O at Poznan R Meetup
 
Deep learning with Keras
Deep learning with KerasDeep learning with Keras
Deep learning with Keras
 
Scientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchScientific Computing @ Fred Hutch
Scientific Computing @ Fred Hutch
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
 
Bootiful Reactive Testing - Mario Gray
Bootiful Reactive Testing - Mario GrayBootiful Reactive Testing - Mario Gray
Bootiful Reactive Testing - Mario Gray
 
OpenCon2014 - Sumatra as an Open Science tool
OpenCon2014 - Sumatra as an Open Science toolOpenCon2014 - Sumatra as an Open Science tool
OpenCon2014 - Sumatra as an Open Science tool
 
Scikit-Learn: Machine Learning in Python
Scikit-Learn: Machine Learning in PythonScikit-Learn: Machine Learning in Python
Scikit-Learn: Machine Learning in Python
 
Package a PyApp as a Flatpak Package: An HTTP Server for Example @ PyCon APAC...
Package a PyApp as a Flatpak Package: An HTTP Server for Example @ PyCon APAC...Package a PyApp as a Flatpak Package: An HTTP Server for Example @ PyCon APAC...
Package a PyApp as a Flatpak Package: An HTTP Server for Example @ PyCon APAC...
 

Más de Gael Varoquaux

Similarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesSimilarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variables
Gael Varoquaux
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities
Gael Varoquaux
 

Más de Gael Varoquaux (20)

Evaluating machine learning models and their diagnostic value
Evaluating machine learning models and their diagnostic valueEvaluating machine learning models and their diagnostic value
Evaluating machine learning models and their diagnostic value
 
Measuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imagingMeasuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imaging
 
Machine learning with missing values
Machine learning with missing valuesMachine learning with missing values
Machine learning with missing values
 
Dirty data science machine learning on non-curated data
Dirty data science machine learning on non-curated dataDirty data science machine learning on non-curated data
Dirty data science machine learning on non-curated data
 
Representation learning in limited-data settings
Representation learning in limited-data settingsRepresentation learning in limited-data settings
Representation learning in limited-data settings
 
Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?
 
Atlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mappingAtlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mapping
 
Similarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesSimilarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variables
 
Machine learning for functional connectomes
Machine learning for functional connectomesMachine learning for functional connectomes
Machine learning for functional connectomes
 
Towards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingTowards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imaging
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities
 
A tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imagingA tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imaging
 
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsEstimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
 
Scientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataScientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of data
 
Machine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsMachine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questions
 
Social-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsitySocial-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsity
 
Inter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRIInter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRI
 
Brain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizationsBrain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizations
 
Scikit-learn: apprentissage statistique en Python. Créer des machines intelli...
Scikit-learn: apprentissage statistique en Python. Créer des machines intelli...Scikit-learn: apprentissage statistique en Python. Créer des machines intelli...
Scikit-learn: apprentissage statistique en Python. Créer des machines intelli...
 
Scikit learn: apprentissage statistique en Python
Scikit learn: apprentissage statistique en PythonScikit learn: apprentissage statistique en Python
Scikit learn: apprentissage statistique en Python
 

Último

Último (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

Scikit-learn: the state of the union 2016

  • 1. Scikit-learn The state of the union Ga¨el Varoquaux Open Source Innovation Spring 2016 Personal point of view, as an opening to scikit-learn days 2016 in Paris
  • 2. 1 Some history Scikit-learn canal historique G Varoquaux 2
  • 3. 1 scikit-learn growth: users Website users (weekly): Google analytics Debian popcon: ∼ 1% of the Debian users G Varoquaux 3
  • 4. 1 scikit-learn growth: users Website users (weekly): Google analytics Debian popcon: ∼ 1% of the Debian users Web searches: Google trends G Varoquaux 3
  • 5. 1 scikit-learn growth: lines of code Lines of code: Huge feature set https://www.openhub.net/p/scikit-learn G Varoquaux 4
  • 6. 1 scikit-learn growth: contributors Contributors: 759 contributors https://www.openhub.net/p/scikit-learn G Varoquaux 5
  • 7. 1 Started as David Cournapeau’s failed PhD project David then preferred improving numpy/scipy That’s David sprinting in 2011 G Varoquaux 6
  • 8. 1 2009: We (Inria Parietal) need machine learning My team takes over the development Hire a young guy (Fabian Pedregosa) Put post-docs and PhDs (Alexandre Gramfort, Vincent Michel...) Work in the open Pythonic, fast, documented G Varoquaux 7
  • 9. 1 2010: ICML MLOSS workshop Machine Learning Open Source Software “The examples in the tutorial are pretty, but not particularly useful for the serious user.” “For the sustainability of the project it might be bet- ter to narrow the focus...” G Varoquaux 8
  • 10. 1 2011: NIPS sprint People that I didn’t know were solving my problems G Varoquaux 9
  • 11. 1 2011: NIPS sprint People that I didn’t know were solving my problems The project took off because of the community... G Varoquaux 9
  • 12. 2 Upcoming cool stuff Upcoming 0.18 release G Varoquaux 10
  • 13. 2 Less code: Lines of code: G Varoquaux 11
  • 14. 2 Less code: Cython no longer embedded Lines of code: Generated C no longuer embedded in git ⇒ opens the door to fused-types (polymorphism) ⇒ multiple dtypes support in algorithm = memory saver Arthur MenschG Varoquaux 11
  • 15. 2 Faster code: better algorithmics RandomizedPCA → PCA Automatic choice randomized linear algebra power iteration (arpack) full (lapack) For large data: up to 20× speed up https://github.com/scikit-learn/scikit-learn/issues/5243 Giorgio Patrini G Varoquaux 12
  • 16. 2 Faster code: better algorithmics RandomizedPCA → PCA Automatic choice randomized linear algebra power iteration (arpack) full (lapack) For large data: up to 20× speed up https://github.com/scikit-learn/scikit-learn/issues/5243 Giorgio Patrini Elkan’s K means For large data: ∼ 2× speed up. https://github.com/scikit-learn/scikit-learn/pull/5414 Andreas M¨uller G Varoquaux 12
  • 17. 2 New cross-validation objects from s k l e a r n . c r o s s v a l i d a t i o n import S t r a t i f i e d K F o l d cv = S t r a t i f i e d K F o l d (y , n f o l d s =2) for t r a i n , t e s t in cv : X t r a i n = X[ t r a i n ] y t a i n = y[ t r a i n ] Data-independent nested-CV possible https://github.com/scikit-learn/scikit-learn/pull/4294 Raghav R V G Varoquaux 13
  • 18. 2 New cross-validation objects from s k l e a r n . m o d e l s e l e c t i o n import S t r a t i f i e d K F o l d cv = S t r a t i f i e d K F o l d ( n f o l d s =2) for t r a i n , t e s t in cv . s p l i t (X, y): X t r a i n = X[ t r a i n ] y t a i n = y[ t r a i n ] Data-independent ⇒ nested-CV possible https://github.com/scikit-learn/scikit-learn/pull/4294 Raghav R V G Varoquaux 13
  • 19. 2 Sequential / Bayesian search CV See hyper-parameter selection as a Bayesian optimization / noisy fit problem. ⇒ choose hyper-parameters cleverly, not on a grid Pull request stalled https://github.com/scikit-learn/scikit-learn/pull/5491 Fabian Pedregosa, Sebastien Dubois, & Manoj Kumar G Varoquaux 14
  • 20. 3 Vision(s): the future G Varoquaux 15
  • 21. Mission statement Enable progress via data science Lower the costs, less technicalities Machine learning for everybody and for everything G Varoquaux 16
  • 22. Mission statement Enable progress via data science Lower the costs, less technicalities Machine learning for everybody and for everything Small hardware, medium data G Varoquaux 16
  • 23. 3 Deep learning sklearn.neural network.MLPClassifier architecture-specification language GPUs unbound technicality G Varoquaux 17
  • 24. 3 Deep learning sklearn.neural network.MLPClassifier architecture-specification language GPUs unbound technicality keras, caffe... G Varoquaux 17
  • 25. 3 AutoML Automatic model selection Better hyper-parameter selection Better description and uniformization of estimators Integrate feedback from auto-sklearn G Varoquaux 18
  • 26. 3 Better, faster, stronger Faster models From lightning, back to sklearn Inspiration from XGBoost the paper is out! G Varoquaux 19
  • 27. 3 Better, faster, stronger Faster models From lightning, back to sklearn Inspiration from XGBoost the paper is out! Larger data More partial fit online forests? Less copies G Varoquaux 19
  • 28. 3 Scaling up (out?) I don’t want java/scala Less fluid prototyping Cross-VM debugging hard Numerics in java slowers than Lapack Need C somewhere G Varoquaux 20
  • 29. 3 Scaling up (out?) I don’t want java/scala They have: Coupling distributed store to computation Distributed job management Create new stack? Ride on this one? G Varoquaux 20
  • 30. 3 Scaling up (out?) I don’t want java/scala They have: Coupling distributed store to computation Distributed job management Create new stack? Ride on this one? Blaze, Ibis, dask: require rewrite of algorithms dask promising for ETL New backends for joblib parallel and storage distributed, ssh G Varoquaux 20
  • 31. Sustainable growth Reviewing is the bottleneck User support drowns core devs Users need stability (Airbus) Coding is not the only thing sprint, GSOC management, tutorials... G Varoquaux 21
  • 32. Sustainable growth Reviewing is the bottleneck User support drowns core devs Users need stability (Airbus) Coding is not the only thing sprint, GSOC management, tutorials... Structure & stability How to organize funding and governance? process/meetings/reports/funding proposal... = work on project Passionate coders get a lot done unless they get drowned by meetings G Varoquaux 21
  • 33. @GaelVaroquaux Funding: Inria, Nexedi, Paris-Saclay CDS, NYU CDS, GSoC