SlideShare una empresa de Scribd logo
1 de 23
Descargar para leer sin conexión
Bayes on your (Big)Couch




                      Mike Miller
                      _milleratmit
                      July 25, 2011
I want my app to do _this_




             Mike Miller, Oscon 2011   2
CouchDB in a slide
• Schema-free document database management system
 Documents are JSON objects
 Able to store binary attachments

• RESTful API
 http://wiki.apache.org/couchdb/reference

• Views: Custom, persistent representations of your data
 Incremental MapReduce with results persisted to disk
 Fast querying by primary key (views stored in a B-tree)

• Bi-Directional Replication
 Master-slave and multi-master topologies supported
 Optional ‘filters’ to replicate a subset of the data
 Edge devices (mobile phones, sensors, etc.)
                                  Mike Miller, Oscon 2011   3
BigCouch = Couch+Scaling
• Open Source, Apache License
• Horizontal Scalability
 Easily add storage capacity by adding more servers
 Computing power (views, compaction, etc.) scales with
 more servers

• No SPOF
 Any node can handle any request
 Individual nodes can come and go

• Transparent to the Application
 All clustering operations take place “behind the curtain”
 looks (mostly) like a single server instance of CouchDB


                                       Mike Miller, Oscon 2011   4
...back to making my app smart




            Mike Miller, Oscon 2011   5
Sample Data
      Height vs. Weight
                  80
    Height [in]
                  75        Girls
                            Boys
                  70

                  65

                  60

                  55

                  50

                  45

                  40

                  35
                       80    100    120      140      160       180   200    220
                                                                        Weight [lbs]

                                          Mike Miller, Oscon 2011                      6
Naive Bayes Classifier
                                 gaus
           mean male
            height                 0.4

height                            0.35

                                   0.3

                                  0.25

                                   0.2

                                  0.15

           male height             0.1


    male    variance              0.05

                                    0
                                    -3   -2   -1   0   1   2   3




               Mike Miller, Oscon 2011                             7
Implementation Plan
                                                   Height vs. Weight
                                                               80




                                                 Height [in]
 Model people as documents in                                  75        Girls
                                                                         Boys
 CouchDB                                                       70

                                                               65

                                                               60
 Calculate Means/Variances with
                                                               55
 MapReduce
                                                               50

                                                               45

 Run classifier in the CouchDB as                               40

 post-MapReduce hook (“_list”)                                 35
                                                                    80    100    120   140   160   180   200    220
                                                                                                           Weight [lbs]


 • Note:
  do not need to specify fields to use in classification
  multi-class implementation
  continuous, incremental training! Results improve as training data trickles in.
                                   Mike Miller, Oscon 2011                                                                8
3 ways to follow along

 couchapp python tool to push/pull from other couchdb’s
 > sudo easy_install install -U couchapp
 > couchapp clone ‘http://millertime.cloudant.com/bitb'
 create an account at cloudant.com
 > curl -X PUT ‘http://<username>:<pwd>@<username>.cloudant.com/bitb’
 > couchapp push ‘http://<username>:<pwd>@<username>.cloudant.com/bitb’
 github
 > git clone git@github.com:mlmiller/bayes.git


 CouchDB replication to your cloudant account
 bonus, brings along the data, too!


                                      Mike Miller, Oscon 2011             9
The Code

post MapReduce                                  Classifier
 Hook (“_list”                                 (Probability
   method)                                     Calculator)




client side test
  via node.js                                  view code to
                                                calculate
                                                means and
   you can ignore                                variances
   everything else   Mike Miller, Oscon 2011              10
Data Model

                                     Arbitrary number of numerical
                                            fields allowed




‘class’ => training Data



                           Mike Miller, Oscon 2011                   11
Training via MapReduce
                                    ‘class’ => training Data
 views/training/map.js




           Calculate mean/variance for all numerical
                      fields in a document
                 emit: ([<class>, <field>], <value>)
                 Reduce: _stats (Erlang builtin)
                         Mike Miller, Oscon 2011               12
Bayes: Trained State




                             pre-reduce output



            Mike Miller, Oscon 2011              13
Bayes: Trained State




                                    Count, Min, Max, Mean,
                                          Variance

     Automatically Updated as new training Data
                      Arrives
                  Mike Miller, Oscon 2011                    14
Bayes Classifier
            lib/bayes_classifier.js
                     Load state from DB

                                      No assumptions on Field
                                              Names


                                 Calculate prob. for
                                    all possible
                                     hypotheses



            Mike Miller, Oscon 2011                             15
A brief aside...

 • Lets test our classifier
  Select 2000 documents for test
  Randomly choose 1000 documents for training sample
  Remaining documents used for validation

 • Simulate continuous training
  Add documents one at a time
  After each document addition, test on all 1000 of our validation sample
  Record and plot fraction of validation sample properly classified




                                Mike Miller, Oscon 2011                     16
A brief aside...


                                Dramatic improvement with
                                 additional training data




      Number of documents in the training set
                   Mike Miller, Oscon 2011                  17
... and back to the code




             Mike Miller, Oscon 2011   18
test it yourself
• Client side test via node.js
 > ./test.js height=<some number> weigth=<some number>
 Classifier runs server side, configured in line 6 of test.js




Can point this to
    your DB

                                      Mike Miller, Oscon 2011   19
Running as CouchApp



                create a database (e.g., ‘bitb’) at cloudant.com
                add data
                then push your code
                >couchapp push ‘http://<user>:<pwd>@<usr>.cloudant.com/bitb’
                HTML & CSS served directly from BigCouch to the browser
                Heavy lifting of classification done server side


 http://millertime.cloudant.com/bitb/_design/bayes/index.html
                          Mike Miller, Oscon 2011                              20
Running as API (_list)
 > curl 'http://millertime.cloudant.com/bitb/_design/
               bayes/_list/index/training?
       height=65.65&weight=168.61&format=json
                      &group=true'




                       Mike Miller, Oscon 2011          21
Wrapping Up: Bayes on BigCouch
• Simple code, powerful results
 light requirements on data model
 can be relaxed with more complex view code
 Continuous learning is very powerful
 e.g., time-based learning (automatically adapt to changing conditions)
 Classification can be performed client- or server-side
 push documents into DB and they are auto-tagged!
 More sophisticated classifiers easily implemented
 e.g., Cloudant Search pre-calculates and exposes TF-IDF scores for textual
 classification, weighted classifiers, etc
 View Engine allows simple deployment of sophisticated domain libraries in
 mass parallel
 e.g. Lucene, R, SciPy, NumPy, Matlab/Octave, etc..


                                   Mike Miller, Oscon 2011                    22
Give it a spin




 Hosting, Management, Support for CouchDB and BigCouch
                  http://cloudant.com
        http://github.com/cloudant/bigcouch
                     Mike Miller, Oscon 2011             23

Más contenido relacionado

Similar a Oscon miller 2011

P02 sparse coding cvpr2012 deep learning methods for vision
P02 sparse coding cvpr2012 deep learning methods for visionP02 sparse coding cvpr2012 deep learning methods for vision
P02 sparse coding cvpr2012 deep learning methods for visionzukun
 
Build on AWS: Migrating and Platforming
Build on AWS: Migrating and PlatformingBuild on AWS: Migrating and Platforming
Build on AWS: Migrating and PlatformingAmazon Web Services
 
BI 2008 Simple
BI 2008 SimpleBI 2008 Simple
BI 2008 Simplellangit
 
Hadoop World 2011: BI on Hadoop in Financial Services - Stefan Grschupf, Data...
Hadoop World 2011: BI on Hadoop in Financial Services - Stefan Grschupf, Data...Hadoop World 2011: BI on Hadoop in Financial Services - Stefan Grschupf, Data...
Hadoop World 2011: BI on Hadoop in Financial Services - Stefan Grschupf, Data...Cloudera, Inc.
 
Architecting Scalable Applications in the Cloud
Architecting Scalable Applications in the CloudArchitecting Scalable Applications in the Cloud
Architecting Scalable Applications in the CloudClint Edmonson
 
Stairway to heaven webinar
Stairway to heaven webinarStairway to heaven webinar
Stairway to heaven webinarCloudBees
 
DAT101 Understanding AWS Database Options - AWS re: Invent 2012
DAT101 Understanding AWS Database Options - AWS re: Invent 2012DAT101 Understanding AWS Database Options - AWS re: Invent 2012
DAT101 Understanding AWS Database Options - AWS re: Invent 2012Amazon Web Services
 
Scalable Database Options on AWS
Scalable Database Options on AWSScalable Database Options on AWS
Scalable Database Options on AWSAmazon Web Services
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...Data Con LA
 
SQL Server 2008 Integration Services
SQL Server 2008 Integration ServicesSQL Server 2008 Integration Services
SQL Server 2008 Integration ServicesEduardo Castro
 
44spotkaniePLSSUGWRO_CoNowegowKrainieChmur
44spotkaniePLSSUGWRO_CoNowegowKrainieChmur44spotkaniePLSSUGWRO_CoNowegowKrainieChmur
44spotkaniePLSSUGWRO_CoNowegowKrainieChmurTobias Koprowski
 
SQL Server 2008 Data Mining
SQL Server 2008 Data MiningSQL Server 2008 Data Mining
SQL Server 2008 Data Miningllangit
 
SQL Server 2008 Data Mining
SQL Server 2008 Data MiningSQL Server 2008 Data Mining
SQL Server 2008 Data Miningllangit
 
Machine Learning Classifiers
Machine Learning ClassifiersMachine Learning Classifiers
Machine Learning ClassifiersMostafa
 
SQL Server 2008 Data Mining
SQL Server 2008 Data MiningSQL Server 2008 Data Mining
SQL Server 2008 Data Miningllangit
 
Patterns & Practices of Microservices
Patterns & Practices of MicroservicesPatterns & Practices of Microservices
Patterns & Practices of MicroservicesWesley Reisz
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BIDenny Lee
 
Steps towards business intelligence
Steps towards business intelligenceSteps towards business intelligence
Steps towards business intelligenceAhsan Kabir
 
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday PhillyMicrosoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday PhillyMark Kromer
 

Similar a Oscon miller 2011 (20)

P02 sparse coding cvpr2012 deep learning methods for vision
P02 sparse coding cvpr2012 deep learning methods for visionP02 sparse coding cvpr2012 deep learning methods for vision
P02 sparse coding cvpr2012 deep learning methods for vision
 
Build on AWS: Migrating and Platforming
Build on AWS: Migrating and PlatformingBuild on AWS: Migrating and Platforming
Build on AWS: Migrating and Platforming
 
BI 2008 Simple
BI 2008 SimpleBI 2008 Simple
BI 2008 Simple
 
Hadoop World 2011: BI on Hadoop in Financial Services - Stefan Grschupf, Data...
Hadoop World 2011: BI on Hadoop in Financial Services - Stefan Grschupf, Data...Hadoop World 2011: BI on Hadoop in Financial Services - Stefan Grschupf, Data...
Hadoop World 2011: BI on Hadoop in Financial Services - Stefan Grschupf, Data...
 
Architecting Scalable Applications in the Cloud
Architecting Scalable Applications in the CloudArchitecting Scalable Applications in the Cloud
Architecting Scalable Applications in the Cloud
 
Stairway to heaven webinar
Stairway to heaven webinarStairway to heaven webinar
Stairway to heaven webinar
 
DAT101 Understanding AWS Database Options - AWS re: Invent 2012
DAT101 Understanding AWS Database Options - AWS re: Invent 2012DAT101 Understanding AWS Database Options - AWS re: Invent 2012
DAT101 Understanding AWS Database Options - AWS re: Invent 2012
 
Scalable Database Options on AWS
Scalable Database Options on AWSScalable Database Options on AWS
Scalable Database Options on AWS
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...
 
SQL Server 2008 Integration Services
SQL Server 2008 Integration ServicesSQL Server 2008 Integration Services
SQL Server 2008 Integration Services
 
44spotkaniePLSSUGWRO_CoNowegowKrainieChmur
44spotkaniePLSSUGWRO_CoNowegowKrainieChmur44spotkaniePLSSUGWRO_CoNowegowKrainieChmur
44spotkaniePLSSUGWRO_CoNowegowKrainieChmur
 
SQL Server 2008 Data Mining
SQL Server 2008 Data MiningSQL Server 2008 Data Mining
SQL Server 2008 Data Mining
 
SQL Server 2008 Data Mining
SQL Server 2008 Data MiningSQL Server 2008 Data Mining
SQL Server 2008 Data Mining
 
Machine Learning Classifiers
Machine Learning ClassifiersMachine Learning Classifiers
Machine Learning Classifiers
 
WebSphere Commerce v7 Data Load
WebSphere Commerce v7 Data LoadWebSphere Commerce v7 Data Load
WebSphere Commerce v7 Data Load
 
SQL Server 2008 Data Mining
SQL Server 2008 Data MiningSQL Server 2008 Data Mining
SQL Server 2008 Data Mining
 
Patterns & Practices of Microservices
Patterns & Practices of MicroservicesPatterns & Practices of Microservices
Patterns & Practices of Microservices
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
 
Steps towards business intelligence
Steps towards business intelligenceSteps towards business intelligence
Steps towards business intelligence
 
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday PhillyMicrosoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
 

Último

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Último (20)

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 

Oscon miller 2011

  • 1. Bayes on your (Big)Couch Mike Miller _milleratmit July 25, 2011
  • 2. I want my app to do _this_ Mike Miller, Oscon 2011 2
  • 3. CouchDB in a slide • Schema-free document database management system Documents are JSON objects Able to store binary attachments • RESTful API http://wiki.apache.org/couchdb/reference • Views: Custom, persistent representations of your data Incremental MapReduce with results persisted to disk Fast querying by primary key (views stored in a B-tree) • Bi-Directional Replication Master-slave and multi-master topologies supported Optional ‘filters’ to replicate a subset of the data Edge devices (mobile phones, sensors, etc.) Mike Miller, Oscon 2011 3
  • 4. BigCouch = Couch+Scaling • Open Source, Apache License • Horizontal Scalability Easily add storage capacity by adding more servers Computing power (views, compaction, etc.) scales with more servers • No SPOF Any node can handle any request Individual nodes can come and go • Transparent to the Application All clustering operations take place “behind the curtain” looks (mostly) like a single server instance of CouchDB Mike Miller, Oscon 2011 4
  • 5. ...back to making my app smart Mike Miller, Oscon 2011 5
  • 6. Sample Data Height vs. Weight 80 Height [in] 75 Girls Boys 70 65 60 55 50 45 40 35 80 100 120 140 160 180 200 220 Weight [lbs] Mike Miller, Oscon 2011 6
  • 7. Naive Bayes Classifier gaus mean male height 0.4 height 0.35 0.3 0.25 0.2 0.15 male height 0.1 male variance 0.05 0 -3 -2 -1 0 1 2 3 Mike Miller, Oscon 2011 7
  • 8. Implementation Plan Height vs. Weight 80 Height [in] Model people as documents in 75 Girls Boys CouchDB 70 65 60 Calculate Means/Variances with 55 MapReduce 50 45 Run classifier in the CouchDB as 40 post-MapReduce hook (“_list”) 35 80 100 120 140 160 180 200 220 Weight [lbs] • Note: do not need to specify fields to use in classification multi-class implementation continuous, incremental training! Results improve as training data trickles in. Mike Miller, Oscon 2011 8
  • 9. 3 ways to follow along couchapp python tool to push/pull from other couchdb’s > sudo easy_install install -U couchapp > couchapp clone ‘http://millertime.cloudant.com/bitb' create an account at cloudant.com > curl -X PUT ‘http://<username>:<pwd>@<username>.cloudant.com/bitb’ > couchapp push ‘http://<username>:<pwd>@<username>.cloudant.com/bitb’ github > git clone git@github.com:mlmiller/bayes.git CouchDB replication to your cloudant account bonus, brings along the data, too! Mike Miller, Oscon 2011 9
  • 10. The Code post MapReduce Classifier Hook (“_list” (Probability method) Calculator) client side test via node.js view code to calculate means and you can ignore variances everything else Mike Miller, Oscon 2011 10
  • 11. Data Model Arbitrary number of numerical fields allowed ‘class’ => training Data Mike Miller, Oscon 2011 11
  • 12. Training via MapReduce ‘class’ => training Data views/training/map.js Calculate mean/variance for all numerical fields in a document emit: ([<class>, <field>], <value>) Reduce: _stats (Erlang builtin) Mike Miller, Oscon 2011 12
  • 13. Bayes: Trained State pre-reduce output Mike Miller, Oscon 2011 13
  • 14. Bayes: Trained State Count, Min, Max, Mean, Variance Automatically Updated as new training Data Arrives Mike Miller, Oscon 2011 14
  • 15. Bayes Classifier lib/bayes_classifier.js Load state from DB No assumptions on Field Names Calculate prob. for all possible hypotheses Mike Miller, Oscon 2011 15
  • 16. A brief aside... • Lets test our classifier Select 2000 documents for test Randomly choose 1000 documents for training sample Remaining documents used for validation • Simulate continuous training Add documents one at a time After each document addition, test on all 1000 of our validation sample Record and plot fraction of validation sample properly classified Mike Miller, Oscon 2011 16
  • 17. A brief aside... Dramatic improvement with additional training data Number of documents in the training set Mike Miller, Oscon 2011 17
  • 18. ... and back to the code Mike Miller, Oscon 2011 18
  • 19. test it yourself • Client side test via node.js > ./test.js height=<some number> weigth=<some number> Classifier runs server side, configured in line 6 of test.js Can point this to your DB Mike Miller, Oscon 2011 19
  • 20. Running as CouchApp create a database (e.g., ‘bitb’) at cloudant.com add data then push your code >couchapp push ‘http://<user>:<pwd>@<usr>.cloudant.com/bitb’ HTML & CSS served directly from BigCouch to the browser Heavy lifting of classification done server side http://millertime.cloudant.com/bitb/_design/bayes/index.html Mike Miller, Oscon 2011 20
  • 21. Running as API (_list) > curl 'http://millertime.cloudant.com/bitb/_design/ bayes/_list/index/training? height=65.65&weight=168.61&format=json &group=true' Mike Miller, Oscon 2011 21
  • 22. Wrapping Up: Bayes on BigCouch • Simple code, powerful results light requirements on data model can be relaxed with more complex view code Continuous learning is very powerful e.g., time-based learning (automatically adapt to changing conditions) Classification can be performed client- or server-side push documents into DB and they are auto-tagged! More sophisticated classifiers easily implemented e.g., Cloudant Search pre-calculates and exposes TF-IDF scores for textual classification, weighted classifiers, etc View Engine allows simple deployment of sophisticated domain libraries in mass parallel e.g. Lucene, R, SciPy, NumPy, Matlab/Octave, etc.. Mike Miller, Oscon 2011 22
  • 23. Give it a spin Hosting, Management, Support for CouchDB and BigCouch http://cloudant.com http://github.com/cloudant/bigcouch Mike Miller, Oscon 2011 23