SlideShare una empresa de Scribd logo
1 de 40
Descargar para leer sin conexión
Beware… For It’S THE...
Vowpal platypus
Peter HurforD
(With a little help from some friends)
WE OFTEN WANT TO PREDICT STUFF...
WE OFTEN WANT TO PREDICT STUFF…
...BUT WE RUN INTO LIMITATIONS.
WE OFTEN WANT TO PREDICT STUFF…
...BUT WE RUN INTO LIMITATIONS.
× ...Data set is too large, it doesn’t fit in RAM.
WE OFTEN WANT TO PREDICT STUFF…
...BUT WE RUN INTO LIMITATIONS.
× ...Data set is too large, it doesn’t fit in RAM.
× ...Data set is so large, it doesn’t fit on disk!
WE OFTEN WANT TO PREDICT STUFF…
...BUT WE RUN INTO LIMITATIONS.
× ...Data set is too large, it doesn’t fit in RAM.
× ...Data set is so large, it doesn’t fit on disk!
× ...Model train time is so slow, you can’t iterate
and try things.
“I want to use parallel
learning algorithms to
create fantastic learning
machines!”
- John Langford, 1997
YOU FOOL! THE ONLY
THING PARALLEL
MACHINES ARE USEFUL
FOR ARE COMPUTATIONAL
WINDTUNNELS!
TEN YEARS LATER...
VOWPAL
...Fast Online Learning
TEN YEARS LATER...
...WHAT’s WITH THE NAME?
...WHAT’s WITH THE NAME?
...WHAT’s WITH THE NAME?
+
...WHAT’s WITH THE NAME?
+
Traditional Approach
1. Load all training data
into RAM at once.
2. Fit model to training
dataset.
3. Load all predicting data
into RAM at once.
4. Use trained model to
make predictions.
WHAT DOES IT DO?
VW “Online” Approach
1. Train model on single
datapoints, one at a
time.
2. Do it again multiple
times.
3. Use trained model to
predict on new
datapoints, one at a
time.
Traditional Approach
1. Load all training data
into RAM at once.
2. Fit model to training
dataset.
3. Load all predicting data
into RAM at once.
4. Use trained model to
make predictions.
WHAT DOES IT DO?
× Online approach
eventually converges to
the same results as a
traditional (batch)
approach over enough
iterations.
WHAT DOES IT DO?
WHAT DOES IT DO?
× Online approach
eventually converges to
the same results as a
traditional (batch)
approach over enough
iterations.
× But you’re no longer
dependent on RAM!
Kaggle: World Data Science Competitions
× 3rd, 14th, and 29th / 718 on $16K Criteo ad click challenge
× 3rd / 472 on $2K KDD Cup Challenge
× 8th / 128 on $25K Avito.ru illicit content filtering challenge
IS IT ANY GOOD?
× szilard/benchm-ml: widely cited (1127 star) independent ML
speed benchmarks.
× Logistic Regression on 10M datapoints on a c3.8xlarge instance
(32 cores, 60GB RAM).
DID I MENTION IT’S FAST?
Engine Speed
Python Sklearn Crashed
R 90sec
Vowpal Wabbit 15sec
Spark 35sec
× szilard/benchm-ml: widely cited (1127 star) independent ML
speed benchmarks.
× Logistic Regression on 10M datapoints on a c3.8xlarge instance
(32 cores, 60GB RAM).
DID I MENTION IT’S FAST?
Engine Speed
Python Sklearn Crashed
R 90sec
Vowpal Wabbit 15sec
Spark 35sec
Yes, this was Spark 2.0, but it
was using MLLib. ML
performance is under testing
now.
× szilard/benchm-ml: widely cited (1127 star) independent ML
speed benchmarks.
× Logistic Regression on 10M datapoints on a c3.8xlarge instance
(32 cores, 60GB RAM).
DID I MENTION IT’S FAST?
Engine Speed
Python Sklearn Crashed
R 90sec
Vowpal Wabbit 15sec
Spark 35sec
But this benchmark was
only single core!
× szilard/benchm-ml: widely cited (1127 star) independent ML
speed benchmarks.
× Logistic Regression on 10M datapoints on a c3.8xlarge instance
(32 cores, 60GB RAM).
DID I MENTION IT’S FAST?
Engine Speed
Python Sklearn Crashed
R 90sec
Vowpal Wabbit 15sec
Spark 35sec
...and none of the
benchmarks include
data load time! (VP has
none.)
...But what’s THIS
ABOUT A PLATYPUS?
WHAT IS VOWPAL PLATYPUS?
× An open source vehicle for productionizing
Vowpal Wabbit in Python.
WHAT IS VOWPAL PLATYPUS?
× An open source vehicle for productionizing
Vowpal Wabbit in Python.
× Train and predict on Python dictionaries
instead of the obscure VW format.
WHAT IS VOWPAL PLATYPUS?
× An open source vehicle for productionizing
Vowpal Wabbit in Python.
× Train and predict on Python dictionaries
instead of the obscure VW format.
× Easily use VW’s parallel features to go
multicore and multi-machine.
WHAT IS VOWPAL PLATYPUS?
× An open source vehicle for productionizing
Vowpal Wabbit in Python.
× Train and predict on Python dictionaries
instead of the obscure VW format.
× Easily use VW’s parallel features to go
multicore and multi-machine.
VW has been used on
“terascale datasets, with
trillions of features,
billions of training
examples and millions of
parameters in an hour
using a cluster of 1000
machines.”
WHAT IS VOWPAL PLATYPUS?
× An open source vehicle for productionizing
Vowpal Wabbit in Python.
× Train and predict on Python dictionaries
instead of the obscure VW format.
× Easily use VW’s parallel features to go
multicore and multi-machine.
...so far VP has only
been used on a
maximum of 3 machines
(combined 108 core),
but we’re getting there...
dEMo #1!
dEMo #2!
dEMo #2!
dEMo #2!27,279 MOVIES & 138,494 users
dEMo #2!27,279 MOVIES & 138,494 users
3,757,977,826PReDICTIONS...need to be made.
dEMo #2!27,279 MOVIES & 138,494 users
21m47s
3,757,977,826PReDICTIONS...need to be made.
Total runtime on
3x c4.8xlarge
(108 cores total)
342nanoseconds per prediction
(wall clock time)
THE END! (...OR IS IT?)

Más contenido relacionado

Destacado

Advanced Python : Static and Class Methods
Advanced Python : Static and Class Methods Advanced Python : Static and Class Methods
Advanced Python : Static and Class Methods Bhanwar Singh Meena
 
Object Oriented Programming in Python
Object Oriented Programming in PythonObject Oriented Programming in Python
Object Oriented Programming in PythonSujith Kumar
 
Advance OOP concepts in Python
Advance OOP concepts in PythonAdvance OOP concepts in Python
Advance OOP concepts in PythonSujith Kumar
 
Basics of Object Oriented Programming in Python
Basics of Object Oriented Programming in PythonBasics of Object Oriented Programming in Python
Basics of Object Oriented Programming in PythonSujith Kumar
 
Python Tricks That You Can't Live Without
Python Tricks That You Can't Live WithoutPython Tricks That You Can't Live Without
Python Tricks That You Can't Live WithoutAudrey Roy
 
Prepping the Analytics organization for Artificial Intelligence evolution
Prepping the Analytics organization for Artificial Intelligence evolutionPrepping the Analytics organization for Artificial Intelligence evolution
Prepping the Analytics organization for Artificial Intelligence evolutionRamkumar Ravichandran
 
Python 101: Python for Absolute Beginners (PyTexas 2014)
Python 101: Python for Absolute Beginners (PyTexas 2014)Python 101: Python for Absolute Beginners (PyTexas 2014)
Python 101: Python for Absolute Beginners (PyTexas 2014)Paige Bailey
 
Python for Image Understanding: Deep Learning with Convolutional Neural Nets
Python for Image Understanding: Deep Learning with Convolutional Neural NetsPython for Image Understanding: Deep Learning with Convolutional Neural Nets
Python for Image Understanding: Deep Learning with Convolutional Neural NetsRoelof Pieters
 
Deep Learning - The Past, Present and Future of Artificial Intelligence
Deep Learning - The Past, Present and Future of Artificial IntelligenceDeep Learning - The Past, Present and Future of Artificial Intelligence
Deep Learning - The Past, Present and Future of Artificial IntelligenceLukas Masuch
 
Learn 90% of Python in 90 Minutes
Learn 90% of Python in 90 MinutesLearn 90% of Python in 90 Minutes
Learn 90% of Python in 90 MinutesMatt Harrison
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to PythonNowell Strite
 
Deep Learning through Examples
Deep Learning through ExamplesDeep Learning through Examples
Deep Learning through ExamplesSri Ambati
 

Destacado (13)

Advanced Python : Static and Class Methods
Advanced Python : Static and Class Methods Advanced Python : Static and Class Methods
Advanced Python : Static and Class Methods
 
Object Oriented Programming in Python
Object Oriented Programming in PythonObject Oriented Programming in Python
Object Oriented Programming in Python
 
Advance OOP concepts in Python
Advance OOP concepts in PythonAdvance OOP concepts in Python
Advance OOP concepts in Python
 
Basics of Object Oriented Programming in Python
Basics of Object Oriented Programming in PythonBasics of Object Oriented Programming in Python
Basics of Object Oriented Programming in Python
 
Python Tricks That You Can't Live Without
Python Tricks That You Can't Live WithoutPython Tricks That You Can't Live Without
Python Tricks That You Can't Live Without
 
Prepping the Analytics organization for Artificial Intelligence evolution
Prepping the Analytics organization for Artificial Intelligence evolutionPrepping the Analytics organization for Artificial Intelligence evolution
Prepping the Analytics organization for Artificial Intelligence evolution
 
Python 101: Python for Absolute Beginners (PyTexas 2014)
Python 101: Python for Absolute Beginners (PyTexas 2014)Python 101: Python for Absolute Beginners (PyTexas 2014)
Python 101: Python for Absolute Beginners (PyTexas 2014)
 
Python for Image Understanding: Deep Learning with Convolutional Neural Nets
Python for Image Understanding: Deep Learning with Convolutional Neural NetsPython for Image Understanding: Deep Learning with Convolutional Neural Nets
Python for Image Understanding: Deep Learning with Convolutional Neural Nets
 
Python Worst Practices
Python Worst PracticesPython Worst Practices
Python Worst Practices
 
Deep Learning - The Past, Present and Future of Artificial Intelligence
Deep Learning - The Past, Present and Future of Artificial IntelligenceDeep Learning - The Past, Present and Future of Artificial Intelligence
Deep Learning - The Past, Present and Future of Artificial Intelligence
 
Learn 90% of Python in 90 Minutes
Learn 90% of Python in 90 MinutesLearn 90% of Python in 90 Minutes
Learn 90% of Python in 90 Minutes
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to Python
 
Deep Learning through Examples
Deep Learning through ExamplesDeep Learning through Examples
Deep Learning through Examples
 

Similar a Fast Online Machine Learning with Vowpal Wabbit and Python

Spark Gotchas and Lessons Learned (2/20/20)
Spark Gotchas and Lessons Learned (2/20/20)Spark Gotchas and Lessons Learned (2/20/20)
Spark Gotchas and Lessons Learned (2/20/20)Jen Waller
 
MongoDB & Machine Learning
MongoDB & Machine LearningMongoDB & Machine Learning
MongoDB & Machine LearningTom Maiaroto
 
The computer science behind a modern disributed data store
The computer science behind a modern disributed data storeThe computer science behind a modern disributed data store
The computer science behind a modern disributed data storeJ On The Beach
 
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...NETWAYS
 
The Computer Science Behind a modern Distributed Database
The Computer Science Behind a modern Distributed DatabaseThe Computer Science Behind a modern Distributed Database
The Computer Science Behind a modern Distributed DatabaseArangoDB Database
 
Dear compiler please don't be my nanny v2
Dear compiler  please don't be my nanny v2Dear compiler  please don't be my nanny v2
Dear compiler please don't be my nanny v2Dino Dini
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...Andy Petrella
 
Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsLeveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsDomino Data Lab
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++Mike Acton
 
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Databricks
 
Metasepi team meeting #16: Safety on ATS language + MCU
Metasepi team meeting #16: Safety on ATS language + MCUMetasepi team meeting #16: Safety on ATS language + MCU
Metasepi team meeting #16: Safety on ATS language + MCUKiwamu Okabe
 
Sparklife - Life In The Trenches With Spark
Sparklife - Life In The Trenches With SparkSparklife - Life In The Trenches With Spark
Sparklife - Life In The Trenches With SparkIan Pointer
 
Lessons I Learned While Scaling to 5000 Puppet Agents
Lessons I Learned While Scaling to 5000 Puppet AgentsLessons I Learned While Scaling to 5000 Puppet Agents
Lessons I Learned While Scaling to 5000 Puppet AgentsPuppet
 
Scaling PyData Up and Out
Scaling PyData Up and OutScaling PyData Up and Out
Scaling PyData Up and OutTravis Oliphant
 
Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit Antti Haapala
 
Beat the devil: towards a Drupal performance benchmark
Beat the devil: towards a Drupal performance benchmarkBeat the devil: towards a Drupal performance benchmark
Beat the devil: towards a Drupal performance benchmarkPedro González Serrano
 
The Right Data for the Right Job
The Right Data for the Right JobThe Right Data for the Right Job
The Right Data for the Right JobEmily Curtin
 

Similar a Fast Online Machine Learning with Vowpal Wabbit and Python (20)

Spark Gotchas and Lessons Learned (2/20/20)
Spark Gotchas and Lessons Learned (2/20/20)Spark Gotchas and Lessons Learned (2/20/20)
Spark Gotchas and Lessons Learned (2/20/20)
 
MongoDB & Machine Learning
MongoDB & Machine LearningMongoDB & Machine Learning
MongoDB & Machine Learning
 
The computer science behind a modern disributed data store
The computer science behind a modern disributed data storeThe computer science behind a modern disributed data store
The computer science behind a modern disributed data store
 
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
 
Aug 2012 HUG: Hug BigTop
Aug 2012 HUG: Hug BigTopAug 2012 HUG: Hug BigTop
Aug 2012 HUG: Hug BigTop
 
The Computer Science Behind a modern Distributed Database
The Computer Science Behind a modern Distributed DatabaseThe Computer Science Behind a modern Distributed Database
The Computer Science Behind a modern Distributed Database
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
 
Dear compiler please don't be my nanny v2
Dear compiler  please don't be my nanny v2Dear compiler  please don't be my nanny v2
Dear compiler please don't be my nanny v2
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...
 
Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsLeveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science Tools
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
 
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
 
Metasepi team meeting #16: Safety on ATS language + MCU
Metasepi team meeting #16: Safety on ATS language + MCUMetasepi team meeting #16: Safety on ATS language + MCU
Metasepi team meeting #16: Safety on ATS language + MCU
 
Sparklife - Life In The Trenches With Spark
Sparklife - Life In The Trenches With SparkSparklife - Life In The Trenches With Spark
Sparklife - Life In The Trenches With Spark
 
Lessons I Learned While Scaling to 5000 Puppet Agents
Lessons I Learned While Scaling to 5000 Puppet AgentsLessons I Learned While Scaling to 5000 Puppet Agents
Lessons I Learned While Scaling to 5000 Puppet Agents
 
Data science tutorial
Data science tutorialData science tutorial
Data science tutorial
 
Scaling PyData Up and Out
Scaling PyData Up and OutScaling PyData Up and Out
Scaling PyData Up and Out
 
Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit
 
Beat the devil: towards a Drupal performance benchmark
Beat the devil: towards a Drupal performance benchmarkBeat the devil: towards a Drupal performance benchmark
Beat the devil: towards a Drupal performance benchmark
 
The Right Data for the Right Job
The Right Data for the Right JobThe Right Data for the Right Job
The Right Data for the Right Job
 

Último

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 

Último (20)

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 

Fast Online Machine Learning with Vowpal Wabbit and Python

  • 1. Beware… For It’S THE... Vowpal platypus Peter HurforD (With a little help from some friends)
  • 2. WE OFTEN WANT TO PREDICT STUFF...
  • 3. WE OFTEN WANT TO PREDICT STUFF… ...BUT WE RUN INTO LIMITATIONS.
  • 4. WE OFTEN WANT TO PREDICT STUFF… ...BUT WE RUN INTO LIMITATIONS. × ...Data set is too large, it doesn’t fit in RAM.
  • 5. WE OFTEN WANT TO PREDICT STUFF… ...BUT WE RUN INTO LIMITATIONS. × ...Data set is too large, it doesn’t fit in RAM. × ...Data set is so large, it doesn’t fit on disk!
  • 6. WE OFTEN WANT TO PREDICT STUFF… ...BUT WE RUN INTO LIMITATIONS. × ...Data set is too large, it doesn’t fit in RAM. × ...Data set is so large, it doesn’t fit on disk! × ...Model train time is so slow, you can’t iterate and try things.
  • 7. “I want to use parallel learning algorithms to create fantastic learning machines!” - John Langford, 1997
  • 8. YOU FOOL! THE ONLY THING PARALLEL MACHINES ARE USEFUL FOR ARE COMPUTATIONAL WINDTUNNELS!
  • 15. Traditional Approach 1. Load all training data into RAM at once. 2. Fit model to training dataset. 3. Load all predicting data into RAM at once. 4. Use trained model to make predictions. WHAT DOES IT DO?
  • 16. VW “Online” Approach 1. Train model on single datapoints, one at a time. 2. Do it again multiple times. 3. Use trained model to predict on new datapoints, one at a time. Traditional Approach 1. Load all training data into RAM at once. 2. Fit model to training dataset. 3. Load all predicting data into RAM at once. 4. Use trained model to make predictions. WHAT DOES IT DO?
  • 17. × Online approach eventually converges to the same results as a traditional (batch) approach over enough iterations. WHAT DOES IT DO?
  • 18. WHAT DOES IT DO? × Online approach eventually converges to the same results as a traditional (batch) approach over enough iterations. × But you’re no longer dependent on RAM!
  • 19. Kaggle: World Data Science Competitions × 3rd, 14th, and 29th / 718 on $16K Criteo ad click challenge × 3rd / 472 on $2K KDD Cup Challenge × 8th / 128 on $25K Avito.ru illicit content filtering challenge IS IT ANY GOOD?
  • 20. × szilard/benchm-ml: widely cited (1127 star) independent ML speed benchmarks. × Logistic Regression on 10M datapoints on a c3.8xlarge instance (32 cores, 60GB RAM). DID I MENTION IT’S FAST? Engine Speed Python Sklearn Crashed R 90sec Vowpal Wabbit 15sec Spark 35sec
  • 21. × szilard/benchm-ml: widely cited (1127 star) independent ML speed benchmarks. × Logistic Regression on 10M datapoints on a c3.8xlarge instance (32 cores, 60GB RAM). DID I MENTION IT’S FAST? Engine Speed Python Sklearn Crashed R 90sec Vowpal Wabbit 15sec Spark 35sec Yes, this was Spark 2.0, but it was using MLLib. ML performance is under testing now.
  • 22. × szilard/benchm-ml: widely cited (1127 star) independent ML speed benchmarks. × Logistic Regression on 10M datapoints on a c3.8xlarge instance (32 cores, 60GB RAM). DID I MENTION IT’S FAST? Engine Speed Python Sklearn Crashed R 90sec Vowpal Wabbit 15sec Spark 35sec But this benchmark was only single core!
  • 23. × szilard/benchm-ml: widely cited (1127 star) independent ML speed benchmarks. × Logistic Regression on 10M datapoints on a c3.8xlarge instance (32 cores, 60GB RAM). DID I MENTION IT’S FAST? Engine Speed Python Sklearn Crashed R 90sec Vowpal Wabbit 15sec Spark 35sec ...and none of the benchmarks include data load time! (VP has none.)
  • 25. WHAT IS VOWPAL PLATYPUS? × An open source vehicle for productionizing Vowpal Wabbit in Python.
  • 26. WHAT IS VOWPAL PLATYPUS? × An open source vehicle for productionizing Vowpal Wabbit in Python. × Train and predict on Python dictionaries instead of the obscure VW format.
  • 27. WHAT IS VOWPAL PLATYPUS? × An open source vehicle for productionizing Vowpal Wabbit in Python. × Train and predict on Python dictionaries instead of the obscure VW format. × Easily use VW’s parallel features to go multicore and multi-machine.
  • 28. WHAT IS VOWPAL PLATYPUS? × An open source vehicle for productionizing Vowpal Wabbit in Python. × Train and predict on Python dictionaries instead of the obscure VW format. × Easily use VW’s parallel features to go multicore and multi-machine. VW has been used on “terascale datasets, with trillions of features, billions of training examples and millions of parameters in an hour using a cluster of 1000 machines.”
  • 29. WHAT IS VOWPAL PLATYPUS? × An open source vehicle for productionizing Vowpal Wabbit in Python. × Train and predict on Python dictionaries instead of the obscure VW format. × Easily use VW’s parallel features to go multicore and multi-machine. ...so far VP has only been used on a maximum of 3 machines (combined 108 core), but we’re getting there...
  • 31.
  • 32.
  • 33.
  • 34.
  • 37. dEMo #2!27,279 MOVIES & 138,494 users
  • 38. dEMo #2!27,279 MOVIES & 138,494 users 3,757,977,826PReDICTIONS...need to be made.
  • 39. dEMo #2!27,279 MOVIES & 138,494 users 21m47s 3,757,977,826PReDICTIONS...need to be made. Total runtime on 3x c4.8xlarge (108 cores total) 342nanoseconds per prediction (wall clock time)
  • 40. THE END! (...OR IS IT?)