SlideShare una empresa de Scribd logo
1 de 36
Online Learning
Bayesian bandits and more

©MapR Technologies - Confidential   1
whoami – Ted Dunning


     Ted Dunning
       tdunning@maprtech.com
       tdunning@apache.org
       @ted_dunning


     We’re hiring at MapR


     For slides and other info
        http://www.slideshare.net/tdunning



©MapR Technologies - Confidential   2
Online
                                Scalable
                                     Incremental
©MapR Technologies - Confidential      3
Scalability and Learning

     What does scalable mean?



     What are inherent characteristics of scalable learning?



     What are the logical implications?




©MapR Technologies - Confidential     4
Scalable ≈ On-line
                                    If you squint just right


©MapR Technologies - Confidential               5
unit of work ≈ unit of time



©MapR Technologies - Confidential   6
Infinite
        Data                        Learning
     Stream

                                     State



©MapR Technologies - Confidential     7
Pick One




©MapR Technologies - Confidential       8
©MapR Technologies - Confidential   9
©MapR Technologies - Confidential   10
Now pick again



©MapR Technologies - Confidential         11
A Quick Diversion

     You see a coin
       –   What is the probability of heads?
       –   Could it be larger or smaller than that?
     I flip the coin and while it is in the air ask again
     I catch the coin and ask again
     I look at the coin (and you don’t) and ask again
     Why does the answer change?
       –   And did it ever have a single value?




©MapR Technologies - Confidential             12
Which One to Play?

     One may be better than the other
     The better coin pays off at some rate
     Playing the other will pay off at a lesser rate
       –   Playing the lesser coin has “opportunity cost”


     But how do we know which is which?
       –   Explore versus Exploit!




©MapR Technologies - Confidential             13
A First Conclusion

     Probability as expressed by humans is subjective and depends on
      information and experience




©MapR Technologies - Confidential    14
A Second Conclusion

     A single number is a bad way to express uncertain knowledge



     A distribution of values might be better




©MapR Technologies - Confidential     15
I Dunno




©MapR Technologies - Confidential   16
5 and 5




©MapR Technologies - Confidential   17
2 and 10




©MapR Technologies - Confidential   18
The Cynic Among Us




©MapR Technologies - Confidential   19
Demo


©MapR Technologies - Confidential     20
An Example




©MapR Technologies - Confidential   21
An Example




©MapR Technologies - Confidential   22
The Cluster Proximity Features

     Every point can be described by the nearest cluster
       –   4.3 bits per point in this case
       –   Significant error that can be decreased (to a point) by increasing number of
           clusters
     Or by the proximity to the 2 nearest clusters (2 x 4.3 bits + 1 sign
      bit + 2 proximities)
       –   Error is negligible
       –   Unwinds the data into a simple representation




©MapR Technologies - Confidential             23
Diagonalized Cluster Proximity




©MapR Technologies - Confidential   24
Lots of Clusters Are Fine




©MapR Technologies - Confidential   25
Surrogate Method

     Start with sloppy clustering into κ = k log n clusters
     Use these clusters as a weighted surrogate for the data
     Cluster surrogate data using ball k-means


     Results are provably high quality for highly clusterable data
     Sloppy clustering can be done on-line
     Surrogate can be kept in memory
     Ball k-means pass can be done at any time




©MapR Technologies - Confidential       26
Algorithm Costs

     O(k d log n) per point for Lloyd’s algorithm
          … not so good for k = 2000, n = 108

     Surrogate methods
          …. O(d log κ) = O(d (log k + log log n)) per point


     This is a big deal:
       –   k d log n = 2000 x 10 x 26 = 500,000
       –   log k + log log n = 11 + 5 = 17
       –   30,000 times faster makes the grade as a bona fide big deal



©MapR Technologies - Confidential            27
30,000 times faster sounds good




©MapR Technologies - Confidential       28
30,000 times faster sounds good

                                    but that isn’t the big news




©MapR Technologies - Confidential                29
30,000 times faster sounds good

                                    but that isn’t the big news

                                      these algorithms do
                                       on-line clustering

©MapR Technologies - Confidential                30
Parallel Speedup?

                                        200


                                                                                     Non- threaded




                                                                  ✓
                                        100
                                                  2
                 Tim e per point (μs)




                                                                                      Threaded version
                                                          3

                                        50
                                                                    4
                                        40                                              6
                                                                             5

                                                                                              8
                                        30
                                                                                                  10        14
                                                                                                       12
                                        20                    Perfect Scaling                                    16




                                        10
                                              1       2       3         4        5                                    20


                                                                  Threads
©MapR Technologies - Confidential                                       31
What about deployment?




©MapR Technologies - Confidential      32
Infinite
        Data                        Learning
     Stream

                                     State



©MapR Technologies - Confidential     33
Data
                                      Mapper
                              Split

                                       State



©MapR Technologies - Confidential      34
Data
                                      Mapper
                                      Mapper
                              Split    Mapper

                   Need shared
                    memory!                 State



©MapR Technologies - Confidential      35
whoami – Ted Dunning


     We’re hiring at MapR


     Ted Dunning
       tdunning@maprtech.com
       tdunning@apache.org
       @ted_dunning



     For slides and other info
                                    http://www.slideshare.net/tdunning

©MapR Technologies - Confidential                   36

Más contenido relacionado

La actualidad más candente

Buzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningBuzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningTed Dunning
 
Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clusteringTed Dunning
 
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedTed Dunning
 
Bda-dunning-2012-12-06
Bda-dunning-2012-12-06Bda-dunning-2012-12-06
Bda-dunning-2012-12-06Ted Dunning
 
What is the past future tense of data?
What is the past future tense of data?What is the past future tense of data?
What is the past future tense of data?Ted Dunning
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to NewMapR Technologies
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteTed Dunning
 
Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time TogetherMapR Technologies
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015Ted Dunning
 
Strata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionStrata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionTed Dunning
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoopTed Dunning
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Ted Dunning
 
What's new in Apache Mahout
What's new in Apache MahoutWhat's new in Apache Mahout
What's new in Apache MahoutTed Dunning
 
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ..."Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...Edge AI and Vision Alliance
 
Which Algorithms Really Matter
Which Algorithms Really MatterWhich Algorithms Really Matter
Which Algorithms Really MatterTed Dunning
 
Cognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesTed Dunning
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labImpetus Technologies
 

La actualidad más candente (20)

Buzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningBuzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learning
 
Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clustering
 
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinned
 
Bda-dunning-2012-12-06
Bda-dunning-2012-12-06Bda-dunning-2012-12-06
Bda-dunning-2012-12-06
 
Dunning strata-2012-27-02
Dunning strata-2012-27-02Dunning strata-2012-27-02
Dunning strata-2012-27-02
 
What is the past future tense of data?
What is the past future tense of data?What is the past future tense of data?
What is the past future tense of data?
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to New
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
 
Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time Together
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
 
Devoxx Real-Time Learning
Devoxx Real-Time LearningDevoxx Real-Time Learning
Devoxx Real-Time Learning
 
Strata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionStrata 2014 Anomaly Detection
Strata 2014 Anomaly Detection
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoop
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
 
What's new in Apache Mahout
What's new in Apache MahoutWhat's new in Apache Mahout
What's new in Apache Mahout
 
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ..."Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...
 
Which Algorithms Really Matter
Which Algorithms Really MatterWhich Algorithms Really Matter
Which Algorithms Really Matter
 
Hcj 2013-01-21
Hcj 2013-01-21Hcj 2013-01-21
Hcj 2013-01-21
 
Cognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approaches
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
 

Similar a Strata new-york-2012

Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRData Science London
 
Cmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop PerformanceCmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop PerformanceTed Dunning
 
predictive-analytics-san-diego-2013-02-21
predictive-analytics-san-diego-2013-02-21predictive-analytics-san-diego-2013-02-21
predictive-analytics-san-diego-2013-02-21Ted Dunning
 
CMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceCMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceMapR Technologies
 
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Codemotion
 
Real Time Learning
Real Time LearningReal Time Learning
Real Time LearningTed Dunning
 
Whats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutWhats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutTed Dunning
 
What's Right and Wrong with Apache Mahout
What's Right and Wrong with Apache MahoutWhat's Right and Wrong with Apache Mahout
What's Right and Wrong with Apache MahoutMapR Technologies
 
Real-time and long-time together
Real-time and long-time togetherReal-time and long-time together
Real-time and long-time togetherTed Dunning
 
News from Mahout
News from MahoutNews from Mahout
News from MahoutTed Dunning
 
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning  ClusteringGraphlab Ted Dunning  Clustering
Graphlab Ted Dunning ClusteringMapR Technologies
 
Chicago Hadoop in Finance - Ted Dunning
Chicago Hadoop in Finance - Ted DunningChicago Hadoop in Finance - Ted Dunning
Chicago Hadoop in Finance - Ted DunningMapR Technologies
 
Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012MapR Technologies
 

Similar a Strata new-york-2012 (20)

Strata New York 2012
Strata New York 2012Strata New York 2012
Strata New York 2012
 
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapR
 
Cmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop PerformanceCmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop Performance
 
predictive-analytics-san-diego-2013-02-21
predictive-analytics-san-diego-2013-02-21predictive-analytics-san-diego-2013-02-21
predictive-analytics-san-diego-2013-02-21
 
News From Mahout
News From MahoutNews From Mahout
News From Mahout
 
GoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 SkinnedGoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 Skinned
 
CMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceCMU Lecture on Hadoop Performance
CMU Lecture on Hadoop Performance
 
London hug
London hugLondon hug
London hug
 
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
 
Real Time Learning
Real Time LearningReal Time Learning
Real Time Learning
 
Whats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutWhats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache Mahout
 
What's Right and Wrong with Apache Mahout
What's Right and Wrong with Apache MahoutWhat's Right and Wrong with Apache Mahout
What's Right and Wrong with Apache Mahout
 
Real-time and long-time together
Real-time and long-time togetherReal-time and long-time together
Real-time and long-time together
 
News from Mahout
News from MahoutNews from Mahout
News from Mahout
 
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning  ClusteringGraphlab Ted Dunning  Clustering
Graphlab Ted Dunning Clustering
 
New directions for mahout
New directions for mahoutNew directions for mahout
New directions for mahout
 
Chicago Hadoop in Finance - Ted Dunning
Chicago Hadoop in Finance - Ted DunningChicago Hadoop in Finance - Ted Dunning
Chicago Hadoop in Finance - Ted Dunning
 
Deep Learning for Fraud Detection
Deep Learning for Fraud DetectionDeep Learning for Fraud Detection
Deep Learning for Fraud Detection
 
Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012
 
Big Data Analytics London
Big Data Analytics LondonBig Data Analytics London
Big Data Analytics London
 

Más de Ted Dunning

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxTed Dunning
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with KubernetesTed Dunning
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in KubernetesTed Dunning
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forTed Dunning
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningTed Dunning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning LogisticsTed Dunning
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTed Dunning
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logisticsTed Dunning
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real DataTed Dunning
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data SecurelyTed Dunning
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeTed Dunning
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownTed Dunning
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossibleTed Dunning
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningTed Dunning
 
Recommendation Techn
Recommendation TechnRecommendation Techn
Recommendation TechnTed Dunning
 
Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0Ted Dunning
 
My talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveMy talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveTed Dunning
 
Building multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search enginesBuilding multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search enginesTed Dunning
 

Más de Ted Dunning (20)

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
 
T digest-update
T digest-updateT digest-update
T digest-update
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data Securely
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossible
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
 
Recommendation Techn
Recommendation TechnRecommendation Techn
Recommendation Techn
 
Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0
 
My talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveMy talk about recommendation and search to the Hive
My talk about recommendation and search to the Hive
 
Building multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search enginesBuilding multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search engines
 

Último

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 

Último (20)

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 

Strata new-york-2012

  • 1. Online Learning Bayesian bandits and more ©MapR Technologies - Confidential 1
  • 2. whoami – Ted Dunning  Ted Dunning tdunning@maprtech.com tdunning@apache.org @ted_dunning  We’re hiring at MapR  For slides and other info http://www.slideshare.net/tdunning ©MapR Technologies - Confidential 2
  • 3. Online Scalable Incremental ©MapR Technologies - Confidential 3
  • 4. Scalability and Learning  What does scalable mean?  What are inherent characteristics of scalable learning?  What are the logical implications? ©MapR Technologies - Confidential 4
  • 5. Scalable ≈ On-line If you squint just right ©MapR Technologies - Confidential 5
  • 6. unit of work ≈ unit of time ©MapR Technologies - Confidential 6
  • 7. Infinite Data Learning Stream State ©MapR Technologies - Confidential 7
  • 8. Pick One ©MapR Technologies - Confidential 8
  • 9. ©MapR Technologies - Confidential 9
  • 10. ©MapR Technologies - Confidential 10
  • 11. Now pick again ©MapR Technologies - Confidential 11
  • 12. A Quick Diversion  You see a coin – What is the probability of heads? – Could it be larger or smaller than that?  I flip the coin and while it is in the air ask again  I catch the coin and ask again  I look at the coin (and you don’t) and ask again  Why does the answer change? – And did it ever have a single value? ©MapR Technologies - Confidential 12
  • 13. Which One to Play?  One may be better than the other  The better coin pays off at some rate  Playing the other will pay off at a lesser rate – Playing the lesser coin has “opportunity cost”  But how do we know which is which? – Explore versus Exploit! ©MapR Technologies - Confidential 13
  • 14. A First Conclusion  Probability as expressed by humans is subjective and depends on information and experience ©MapR Technologies - Confidential 14
  • 15. A Second Conclusion  A single number is a bad way to express uncertain knowledge  A distribution of values might be better ©MapR Technologies - Confidential 15
  • 16. I Dunno ©MapR Technologies - Confidential 16
  • 17. 5 and 5 ©MapR Technologies - Confidential 17
  • 18. 2 and 10 ©MapR Technologies - Confidential 18
  • 19. The Cynic Among Us ©MapR Technologies - Confidential 19
  • 20. Demo ©MapR Technologies - Confidential 20
  • 21. An Example ©MapR Technologies - Confidential 21
  • 22. An Example ©MapR Technologies - Confidential 22
  • 23. The Cluster Proximity Features  Every point can be described by the nearest cluster – 4.3 bits per point in this case – Significant error that can be decreased (to a point) by increasing number of clusters  Or by the proximity to the 2 nearest clusters (2 x 4.3 bits + 1 sign bit + 2 proximities) – Error is negligible – Unwinds the data into a simple representation ©MapR Technologies - Confidential 23
  • 24. Diagonalized Cluster Proximity ©MapR Technologies - Confidential 24
  • 25. Lots of Clusters Are Fine ©MapR Technologies - Confidential 25
  • 26. Surrogate Method  Start with sloppy clustering into κ = k log n clusters  Use these clusters as a weighted surrogate for the data  Cluster surrogate data using ball k-means  Results are provably high quality for highly clusterable data  Sloppy clustering can be done on-line  Surrogate can be kept in memory  Ball k-means pass can be done at any time ©MapR Technologies - Confidential 26
  • 27. Algorithm Costs  O(k d log n) per point for Lloyd’s algorithm … not so good for k = 2000, n = 108  Surrogate methods …. O(d log κ) = O(d (log k + log log n)) per point  This is a big deal: – k d log n = 2000 x 10 x 26 = 500,000 – log k + log log n = 11 + 5 = 17 – 30,000 times faster makes the grade as a bona fide big deal ©MapR Technologies - Confidential 27
  • 28. 30,000 times faster sounds good ©MapR Technologies - Confidential 28
  • 29. 30,000 times faster sounds good but that isn’t the big news ©MapR Technologies - Confidential 29
  • 30. 30,000 times faster sounds good but that isn’t the big news these algorithms do on-line clustering ©MapR Technologies - Confidential 30
  • 31. Parallel Speedup? 200 Non- threaded ✓ 100 2 Tim e per point (μs) Threaded version 3 50 4 40 6 5 8 30 10 14 12 20 Perfect Scaling 16 10 1 2 3 4 5 20 Threads ©MapR Technologies - Confidential 31
  • 32. What about deployment? ©MapR Technologies - Confidential 32
  • 33. Infinite Data Learning Stream State ©MapR Technologies - Confidential 33
  • 34. Data Mapper Split State ©MapR Technologies - Confidential 34
  • 35. Data Mapper Mapper Split Mapper Need shared memory! State ©MapR Technologies - Confidential 35
  • 36. whoami – Ted Dunning  We’re hiring at MapR  Ted Dunning tdunning@maprtech.com tdunning@apache.org @ted_dunning  For slides and other info http://www.slideshare.net/tdunning ©MapR Technologies - Confidential 36