SlideShare a Scribd company logo
1 of 19
Download to read offline
Data-driven modeling
                                           APAM E4990


                                           Jake Hofman

                                          Columbia University


                                         February 6, 2012




Jake Hofman   (Columbia University)        Data-driven modeling   February 6, 2012   1 / 13
Diagnoses a la Bayes1



              • You’re testing for a rare disease:
                  • 1% of the population is infected
              • You have a highly sensitive and specific test:
                  • 99% of sick patients test positive
                  • 99% of healthy patients test negative
              • Given that a patient tests positive, what is probability the
                   patient is sick?




              1
                  Wiggins, SciAm 2006
Jake Hofman       (Columbia University)   Data-driven modeling       February 6, 2012   2 / 13
Diagnoses a la Bayes


                                           Population
                                           10,000 ppl



                              1% Sick                        99% Healthy
                              100 ppl                          9900 ppl

               99% Test +             1% Test -     1% Test +      99% Test -
                 99 ppl                 1 per         99 ppl        9801 ppl




Jake Hofman   (Columbia University)         Data-driven modeling            February 6, 2012   3 / 13
Diagnoses a la Bayes

                                           Population
                                           10,000 ppl



                              1% Sick                        99% Healthy
                              100 ppl                          9900 ppl

               99% Test +             1% Test -     1% Test +      99% Test -
                 99 ppl                 1 per         99 ppl        9801 ppl

          So given that a patient tests positive (198 ppl), there is a 50%
                        chance the patient is sick (99 ppl)!




Jake Hofman   (Columbia University)         Data-driven modeling            February 6, 2012   3 / 13
Diagnoses a la Bayes

                                            Population
                                            10,000 ppl



                               1% Sick                        99% Healthy
                               100 ppl                          9900 ppl

                99% Test +             1% Test -     1% Test +      99% Test -
                  99 ppl                 1 per         99 ppl        9801 ppl
              The small error rate on the large healthy population produces
                                   many false positives.




Jake Hofman    (Columbia University)         Data-driven modeling            February 6, 2012   3 / 13
Inverting conditional probabilities


       Bayes’ Theorem
       Equate the far right- and left-hand sides of product rule

                               p (y |x) p (x) = p (x, y ) = p (x|y ) p (y )

       and divide to get the probability of y given x from the probability
       of x given y :
                                        p (x|y ) p (y )
                             p (y |x) =
                                            p (x)

       where p (x) =                  y ∈ΩY   p (x|y ) p (y ) is the normalization constant.




Jake Hofman   (Columbia University)                Data-driven modeling           February 6, 2012   4 / 13
Diagnoses a la Bayes



       Given that a patient tests positive, what is probability the patient
       is sick?
                                           99/100       1/100

                                        p (+|sick) p (sick)              99   1
                  p (sick|+) =                                      =       =
                                              p (+)                     198   2
                                      99/1002 +99/1002 =198/1002

       where p (+) = p (+|sick) p (sick) + p (+|healthy ) p (healthy ).




Jake Hofman   (Columbia University)          Data-driven modeling                 February 6, 2012   5 / 13
(Super) Naive Bayes
       We can use Bayes’ rule to build a one-word spam classifier:

                                              p (word|spam) p (spam)
                            p (spam|word) =
                                                     p (word)

       where we estimate these probabilities with ratios of counts:
                                         # spam docs containing word
                      p (word|spam) =
                      ˆ
                                                # spam docs
                                         # ham docs containing word
                       p (word|ham)
                       ˆ               =
                                                # ham docs
                                         # spam docs
                           p (spam)
                           ˆ           =
                                           # docs
                                         # ham docs
                            p (ham)
                            ˆ          =
                                           # docs


Jake Hofman   (Columbia University)     Data-driven modeling           February 6, 2012   6 / 13
(Super) Naive Bayes

                 $ ./enron_naive_bayes.sh meeting
                 1500 spam examples
                 3672 ham examples
                 16 spam examples containing meeting
                 153 ham examples containing meeting

                 estimated            P(spam) = .2900
                 estimated            P(ham) = .7100
                 estimated            P(meeting|spam) = .0106
                 estimated            P(meeting|ham) = .0416

                 P(spam|meeting) = .0923




Jake Hofman   (Columbia University)         Data-driven modeling   February 6, 2012   7 / 13
(Super) Naive Bayes

                 $ ./enron_naive_bayes.sh money
                 1500 spam examples
                 3672 ham examples
                 194 spam examples containing money
                 50 ham examples containing money

                 estimated            P(spam) = .2900
                 estimated            P(ham) = .7100
                 estimated            P(money|spam) = .1293
                 estimated            P(money|ham) = .0136

                 P(spam|money) = .7957




Jake Hofman   (Columbia University)         Data-driven modeling   February 6, 2012   8 / 13
(Super) Naive Bayes

                 $ ./enron_naive_bayes.sh enron
                 1500 spam examples
                 3672 ham examples
                 0 spam examples containing enron
                 1478 ham examples containing enron

                 estimated            P(spam) = .2900
                 estimated            P(ham) = .7100
                 estimated            P(enron|spam) = 0
                 estimated            P(enron|ham) = .4025

                 P(spam|enron) = 0




Jake Hofman   (Columbia University)         Data-driven modeling   February 6, 2012   9 / 13
Naive Bayes


       Represent each document by a binary vector x where xj = 1 if the
       j-th word appears in the document (xj = 0 otherwise).

       Modeling each word as an independent Bernoulli random variable,
       the probability of observing a document x of class c is:
                                                          x
                                      p (x|c) =          θjcj (1 − θjc )1−xj
                                                     j


       where θjc denotes the probability that the j-th word occurs in a
       document of class c.




Jake Hofman   (Columbia University)           Data-driven modeling             February 6, 2012   10 / 13
Naive Bayes



       Using this likelihood in Bayes’ rule and taking a logarithm, we have:

                                          p (x|c) p (c)
          log p (c|x) = log
                                              p (x)
                                                   θjc                                       θc
                              =           xj log         +            log(1 − θjc ) + log
                                                 1 − θjc                                    p (x)
                                      j                           j

       where θc is the probability of observing a document of class c.




Jake Hofman   (Columbia University)            Data-driven modeling                  February 6, 2012   11 / 13
Naive Bayes



       We can eliminate p (x) by calculating the log-odds:

              p (1|x)                          θj1 (1 − θj0 )                      1 − θj1       θ1
        log           =               xj log                  +              log           + log
              p (0|x)                          θj0 (1 − θj1 )                      1 − θj0       θ0
                                 j                                       j
                                                  wj
                                                                                      w0

       which gives a linear classifier of the form w · x + w0




Jake Hofman   (Columbia University)               Data-driven modeling                      February 6, 2012   12 / 13
Naive Bayes
       We train by counting words and documents within classes to
       estimate θjc and θc :

                                               ˆ            njc
                                               θjc     =
                                                            nc
                                                  ˆ         nc
                                                  θc   =
                                                            n
       and use these to calculate the weights wj and bias w0 :
                                              ˆ           ˆ

                                                   ˆ        ˆ
                                                   θj1 (1 − θj0 )
                                      ˆ
                                      wj   = log
                                                   ˆ        ˆ
                                                   θj0 (1 − θj1 )
                                                             ˆ
                                                         1 − θj1      ˆ
                                                                      θ1
                                      w0 =
                                      ˆ            log           + log .
                                                             ˆ
                                                         1 − θj0      ˆ
                                                                      θ0
                                              j

       We we predict by simply adding the weights of the words that
       appear in the document to the bias term.
Jake Hofman   (Columbia University)           Data-driven modeling         February 6, 2012   13 / 13
Naive Bayes




              In practice, this works better than one might expect given its
                                        simplicity2




              2
                  http://www.jstor.org/pss/1403452
Jake Hofman       (Columbia University)   Data-driven modeling     February 6, 2012   14 / 13
Naive Bayes




         Training is computationally cheap and scalable, and the model is
                      easy to update given new observations2




              2
                  http://www.springerlink.com/content/wu3g458834583125/
Jake Hofman       (Columbia University)   Data-driven modeling      February 6, 2012   14 / 13
Naive Bayes




                     Performance varies with document representations and
                               corresponding likelihood models2




              2
                  http://ceas.cc/2006/15.pdf
Jake Hofman       (Columbia University)   Data-driven modeling      February 6, 2012   14 / 13
Naive Bayes




              It’s often important to smooth parameter estimates (e.g., by
                         adding pseudocounts) to avoid overfitting




Jake Hofman   (Columbia University)   Data-driven modeling        February 6, 2012   14 / 13

More Related Content

Viewers also liked

Modeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at ScaleModeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at Scalejakehofman
 
Modeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in RModeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in Rjakehofman
 
Bitcoin's Killer App 2017 - Sean Walsh
Bitcoin's Killer App 2017 -  Sean WalshBitcoin's Killer App 2017 -  Sean Walsh
Bitcoin's Killer App 2017 - Sean WalshSean Walsh
 
Recomendaciones integrales de política pública para las juventudes en la ar...
Recomendaciones integrales de política pública para las juventudes en la ar...Recomendaciones integrales de política pública para las juventudes en la ar...
Recomendaciones integrales de política pública para las juventudes en la ar...Jorge Roldán
 
Cuadro de los pagos tributarios
Cuadro de los pagos tributariosCuadro de los pagos tributarios
Cuadro de los pagos tributariosselvagomez2872
 
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...Andreas Önnerfors
 
Charitable trust ppt
Charitable trust pptCharitable trust ppt
Charitable trust pptSweety Sharma
 
Motivación laboral
Motivación laboralMotivación laboral
Motivación laboralalexander_hv
 
IBM Hadoop-DS Benchmark Report - 30TB
IBM Hadoop-DS Benchmark Report - 30TBIBM Hadoop-DS Benchmark Report - 30TB
IBM Hadoop-DS Benchmark Report - 30TBGord Sissons
 
ระบบสารสนเทศ
ระบบสารสนเทศระบบสารสนเทศ
ระบบสารสนเทศPetch Boonyakorn
 
Amazon alexa - building custom skills
Amazon alexa - building custom skillsAmazon alexa - building custom skills
Amazon alexa - building custom skillsAniruddha Chakrabarti
 
2016 Results & Outlook
2016 Results & Outlook 2016 Results & Outlook
2016 Results & Outlook Total
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
 
Teraproc Application Cluster-as-a-Service Overview Presentation
Teraproc Application Cluster-as-a-Service Overview PresentationTeraproc Application Cluster-as-a-Service Overview Presentation
Teraproc Application Cluster-as-a-Service Overview PresentationGord Sissons
 

Viewers also liked (16)

Modeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at ScaleModeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at Scale
 
Modeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in RModeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in R
 
Bitcoin's Killer App 2017 - Sean Walsh
Bitcoin's Killer App 2017 -  Sean WalshBitcoin's Killer App 2017 -  Sean Walsh
Bitcoin's Killer App 2017 - Sean Walsh
 
Recomendaciones integrales de política pública para las juventudes en la ar...
Recomendaciones integrales de política pública para las juventudes en la ar...Recomendaciones integrales de política pública para las juventudes en la ar...
Recomendaciones integrales de política pública para las juventudes en la ar...
 
Social Media Policies
Social Media PoliciesSocial Media Policies
Social Media Policies
 
Cuadro de los pagos tributarios
Cuadro de los pagos tributariosCuadro de los pagos tributarios
Cuadro de los pagos tributarios
 
Blog
BlogBlog
Blog
 
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...
 
Charitable trust ppt
Charitable trust pptCharitable trust ppt
Charitable trust ppt
 
Motivación laboral
Motivación laboralMotivación laboral
Motivación laboral
 
IBM Hadoop-DS Benchmark Report - 30TB
IBM Hadoop-DS Benchmark Report - 30TBIBM Hadoop-DS Benchmark Report - 30TB
IBM Hadoop-DS Benchmark Report - 30TB
 
ระบบสารสนเทศ
ระบบสารสนเทศระบบสารสนเทศ
ระบบสารสนเทศ
 
Amazon alexa - building custom skills
Amazon alexa - building custom skillsAmazon alexa - building custom skills
Amazon alexa - building custom skills
 
2016 Results & Outlook
2016 Results & Outlook 2016 Results & Outlook
2016 Results & Outlook
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 
Teraproc Application Cluster-as-a-Service Overview Presentation
Teraproc Application Cluster-as-a-Service Overview PresentationTeraproc Application Cluster-as-a-Service Overview Presentation
Teraproc Application Cluster-as-a-Service Overview Presentation
 

More from jakehofman

Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2jakehofman
 
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1jakehofman
 
Modeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: NetworksModeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: Networksjakehofman
 
Modeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalizationModeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalizationjakehofman
 
Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1jakehofman
 
Modeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation SystemsModeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation Systemsjakehofman
 
Modeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive BayesModeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive Bayesjakehofman
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Countingjakehofman
 
Modeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case StudiesModeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case Studiesjakehofman
 
NYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social ScienceNYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social Sciencejakehofman
 
Computational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: ClassificationComputational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: Classificationjakehofman
 
Computational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: RegressionComputational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: Regressionjakehofman
 
Computational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online ExperimentsComputational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online Experimentsjakehofman
 
Computational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data WranglingComputational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data Wranglingjakehofman
 
Computational Social Science, Lecture 08: Counting Fast, Part II
Computational Social Science, Lecture 08: Counting Fast, Part IIComputational Social Science, Lecture 08: Counting Fast, Part II
Computational Social Science, Lecture 08: Counting Fast, Part IIjakehofman
 
Computational Social Science, Lecture 07: Counting Fast, Part I
Computational Social Science, Lecture 07: Counting Fast, Part IComputational Social Science, Lecture 07: Counting Fast, Part I
Computational Social Science, Lecture 07: Counting Fast, Part Ijakehofman
 
Computational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part IIComputational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part IIjakehofman
 
Computational Social Science, Lecture 05: Networks, Part I
Computational Social Science, Lecture 05: Networks, Part IComputational Social Science, Lecture 05: Networks, Part I
Computational Social Science, Lecture 05: Networks, Part Ijakehofman
 
Computational Social Science, Lecture 04: Counting at Scale, Part II
Computational Social Science, Lecture 04: Counting at Scale, Part IIComputational Social Science, Lecture 04: Counting at Scale, Part II
Computational Social Science, Lecture 04: Counting at Scale, Part IIjakehofman
 
Computational Social Science, Lecture 03: Counting at Scale, Part I
Computational Social Science, Lecture 03: Counting at Scale, Part IComputational Social Science, Lecture 03: Counting at Scale, Part I
Computational Social Science, Lecture 03: Counting at Scale, Part Ijakehofman
 

More from jakehofman (20)

Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
 
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
 
Modeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: NetworksModeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: Networks
 
Modeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalizationModeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalization
 
Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1
 
Modeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation SystemsModeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation Systems
 
Modeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive BayesModeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive Bayes
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Counting
 
Modeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case StudiesModeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case Studies
 
NYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social ScienceNYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social Science
 
Computational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: ClassificationComputational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: Classification
 
Computational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: RegressionComputational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: Regression
 
Computational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online ExperimentsComputational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online Experiments
 
Computational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data WranglingComputational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data Wrangling
 
Computational Social Science, Lecture 08: Counting Fast, Part II
Computational Social Science, Lecture 08: Counting Fast, Part IIComputational Social Science, Lecture 08: Counting Fast, Part II
Computational Social Science, Lecture 08: Counting Fast, Part II
 
Computational Social Science, Lecture 07: Counting Fast, Part I
Computational Social Science, Lecture 07: Counting Fast, Part IComputational Social Science, Lecture 07: Counting Fast, Part I
Computational Social Science, Lecture 07: Counting Fast, Part I
 
Computational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part IIComputational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part II
 
Computational Social Science, Lecture 05: Networks, Part I
Computational Social Science, Lecture 05: Networks, Part IComputational Social Science, Lecture 05: Networks, Part I
Computational Social Science, Lecture 05: Networks, Part I
 
Computational Social Science, Lecture 04: Counting at Scale, Part II
Computational Social Science, Lecture 04: Counting at Scale, Part IIComputational Social Science, Lecture 04: Counting at Scale, Part II
Computational Social Science, Lecture 04: Counting at Scale, Part II
 
Computational Social Science, Lecture 03: Counting at Scale, Part I
Computational Social Science, Lecture 03: Counting at Scale, Part IComputational Social Science, Lecture 03: Counting at Scale, Part I
Computational Social Science, Lecture 03: Counting at Scale, Part I
 

Recently uploaded

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 

Recently uploaded (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 

Data-driven Modeling: Lecture 03

  • 1. Data-driven modeling APAM E4990 Jake Hofman Columbia University February 6, 2012 Jake Hofman (Columbia University) Data-driven modeling February 6, 2012 1 / 13
  • 2. Diagnoses a la Bayes1 • You’re testing for a rare disease: • 1% of the population is infected • You have a highly sensitive and specific test: • 99% of sick patients test positive • 99% of healthy patients test negative • Given that a patient tests positive, what is probability the patient is sick? 1 Wiggins, SciAm 2006 Jake Hofman (Columbia University) Data-driven modeling February 6, 2012 2 / 13
  • 3. Diagnoses a la Bayes Population 10,000 ppl 1% Sick 99% Healthy 100 ppl 9900 ppl 99% Test + 1% Test - 1% Test + 99% Test - 99 ppl 1 per 99 ppl 9801 ppl Jake Hofman (Columbia University) Data-driven modeling February 6, 2012 3 / 13
  • 4. Diagnoses a la Bayes Population 10,000 ppl 1% Sick 99% Healthy 100 ppl 9900 ppl 99% Test + 1% Test - 1% Test + 99% Test - 99 ppl 1 per 99 ppl 9801 ppl So given that a patient tests positive (198 ppl), there is a 50% chance the patient is sick (99 ppl)! Jake Hofman (Columbia University) Data-driven modeling February 6, 2012 3 / 13
  • 5. Diagnoses a la Bayes Population 10,000 ppl 1% Sick 99% Healthy 100 ppl 9900 ppl 99% Test + 1% Test - 1% Test + 99% Test - 99 ppl 1 per 99 ppl 9801 ppl The small error rate on the large healthy population produces many false positives. Jake Hofman (Columbia University) Data-driven modeling February 6, 2012 3 / 13
  • 6. Inverting conditional probabilities Bayes’ Theorem Equate the far right- and left-hand sides of product rule p (y |x) p (x) = p (x, y ) = p (x|y ) p (y ) and divide to get the probability of y given x from the probability of x given y : p (x|y ) p (y ) p (y |x) = p (x) where p (x) = y ∈ΩY p (x|y ) p (y ) is the normalization constant. Jake Hofman (Columbia University) Data-driven modeling February 6, 2012 4 / 13
  • 7. Diagnoses a la Bayes Given that a patient tests positive, what is probability the patient is sick? 99/100 1/100 p (+|sick) p (sick) 99 1 p (sick|+) = = = p (+) 198 2 99/1002 +99/1002 =198/1002 where p (+) = p (+|sick) p (sick) + p (+|healthy ) p (healthy ). Jake Hofman (Columbia University) Data-driven modeling February 6, 2012 5 / 13
  • 8. (Super) Naive Bayes We can use Bayes’ rule to build a one-word spam classifier: p (word|spam) p (spam) p (spam|word) = p (word) where we estimate these probabilities with ratios of counts: # spam docs containing word p (word|spam) = ˆ # spam docs # ham docs containing word p (word|ham) ˆ = # ham docs # spam docs p (spam) ˆ = # docs # ham docs p (ham) ˆ = # docs Jake Hofman (Columbia University) Data-driven modeling February 6, 2012 6 / 13
  • 9. (Super) Naive Bayes $ ./enron_naive_bayes.sh meeting 1500 spam examples 3672 ham examples 16 spam examples containing meeting 153 ham examples containing meeting estimated P(spam) = .2900 estimated P(ham) = .7100 estimated P(meeting|spam) = .0106 estimated P(meeting|ham) = .0416 P(spam|meeting) = .0923 Jake Hofman (Columbia University) Data-driven modeling February 6, 2012 7 / 13
  • 10. (Super) Naive Bayes $ ./enron_naive_bayes.sh money 1500 spam examples 3672 ham examples 194 spam examples containing money 50 ham examples containing money estimated P(spam) = .2900 estimated P(ham) = .7100 estimated P(money|spam) = .1293 estimated P(money|ham) = .0136 P(spam|money) = .7957 Jake Hofman (Columbia University) Data-driven modeling February 6, 2012 8 / 13
  • 11. (Super) Naive Bayes $ ./enron_naive_bayes.sh enron 1500 spam examples 3672 ham examples 0 spam examples containing enron 1478 ham examples containing enron estimated P(spam) = .2900 estimated P(ham) = .7100 estimated P(enron|spam) = 0 estimated P(enron|ham) = .4025 P(spam|enron) = 0 Jake Hofman (Columbia University) Data-driven modeling February 6, 2012 9 / 13
  • 12. Naive Bayes Represent each document by a binary vector x where xj = 1 if the j-th word appears in the document (xj = 0 otherwise). Modeling each word as an independent Bernoulli random variable, the probability of observing a document x of class c is: x p (x|c) = θjcj (1 − θjc )1−xj j where θjc denotes the probability that the j-th word occurs in a document of class c. Jake Hofman (Columbia University) Data-driven modeling February 6, 2012 10 / 13
  • 13. Naive Bayes Using this likelihood in Bayes’ rule and taking a logarithm, we have: p (x|c) p (c) log p (c|x) = log p (x) θjc θc = xj log + log(1 − θjc ) + log 1 − θjc p (x) j j where θc is the probability of observing a document of class c. Jake Hofman (Columbia University) Data-driven modeling February 6, 2012 11 / 13
  • 14. Naive Bayes We can eliminate p (x) by calculating the log-odds: p (1|x) θj1 (1 − θj0 ) 1 − θj1 θ1 log = xj log + log + log p (0|x) θj0 (1 − θj1 ) 1 − θj0 θ0 j j wj w0 which gives a linear classifier of the form w · x + w0 Jake Hofman (Columbia University) Data-driven modeling February 6, 2012 12 / 13
  • 15. Naive Bayes We train by counting words and documents within classes to estimate θjc and θc : ˆ njc θjc = nc ˆ nc θc = n and use these to calculate the weights wj and bias w0 : ˆ ˆ ˆ ˆ θj1 (1 − θj0 ) ˆ wj = log ˆ ˆ θj0 (1 − θj1 ) ˆ 1 − θj1 ˆ θ1 w0 = ˆ log + log . ˆ 1 − θj0 ˆ θ0 j We we predict by simply adding the weights of the words that appear in the document to the bias term. Jake Hofman (Columbia University) Data-driven modeling February 6, 2012 13 / 13
  • 16. Naive Bayes In practice, this works better than one might expect given its simplicity2 2 http://www.jstor.org/pss/1403452 Jake Hofman (Columbia University) Data-driven modeling February 6, 2012 14 / 13
  • 17. Naive Bayes Training is computationally cheap and scalable, and the model is easy to update given new observations2 2 http://www.springerlink.com/content/wu3g458834583125/ Jake Hofman (Columbia University) Data-driven modeling February 6, 2012 14 / 13
  • 18. Naive Bayes Performance varies with document representations and corresponding likelihood models2 2 http://ceas.cc/2006/15.pdf Jake Hofman (Columbia University) Data-driven modeling February 6, 2012 14 / 13
  • 19. Naive Bayes It’s often important to smooth parameter estimates (e.g., by adding pseudocounts) to avoid overfitting Jake Hofman (Columbia University) Data-driven modeling February 6, 2012 14 / 13