SlideShare a Scribd company logo
1 of 18
From Linguistic Rules to Machine Learning

              Cohan Sujay Carlos
                 Aiaioo Labs
               Bangalore, India
              cohan@aiaioo.com
What is a Classifier?

A machine learning tool used to apply a label to data.
Classification used in Text Categorization
        Politics                   Sports
   The UN Security            Warwickshire's Clarke
   Council adopts its first   equalled the first-class
   clear condemnation of      record of seven
   Syria for its continuing   catches for an
   crackdown on               outfielder in an
   protests, as the army      innings but Lancashire
   continues its advance      took control on day
   into Hama.                 three.
How to Build a Classifier for Text Categorization
          How do you tell which label (Politics/Sports) is suitable?


  The UN Security                         Warwickshire's Clarke
  Council adopts its first                equalled the first-class
  clear condemnation of                   record of seven
  Syria for its continuing                catches for an
  crackdown on                            outfielder in an
  protests, as the army                   innings but Lancashire
  continues its advance                   took control on day
  into Hama.                              three.
In this case it’s only words
    that you will need!
Classification used in Text Categorization
                       See the words?


   The UN Security                 Warwickshire's Clarke
   Council adopts its first        equalled the first-class
   clear condemnation of           record of seven
   Syria for its continuing        catches for an
   crackdown on                    outfielder in an
   protests, as the army           innings but Lancashire
   continues its advance           took control on day
   into Hama.                      three.
Rule-Based Text Categorization
                   Gazetteers (word lists)


UN Security Council                  Warwickshire
Adopts                               Clarke
Condemnation                         First-class
Syria                                Record
Crackdown                            Catches
Protests                             Outfielder
Army                                 Innings
Hama                                 Lancashire

        So you can just use word lists for classification?
        Yeah, but they won’t work very well.
        Can you see why word lists alone won’t work very well?
Rule-Based to Naïve Bayesian

How can you go:

from the starting point (word lists)

to a really cool classification algorithm

     All you need is weights!
Rule-Based with Weights
           Let’s improve the gazetteers with weights

   Politics                                 Sports
UN                 1.0              Warwickshire         0.3
Adopts             0.1              Clarke               0.1
Condemnation       0.2              First-class          0.6
Syria              0.3              Record               0.3
Crackdown          0.8              Catches              0.6
Protests           1.0              Outfielder           1.0
Army               0.8              Innings              0.9
Hama               1.0              Lancashire           0.5
       These weights are nothing but P(Category|Word).
Rule-Based with Weights
         Let’s improve the gazetteers with weights

   Politics                               Sports
UN               1.0              Warwickshire             0.3
Adopts           0.1              Clarke                   0.1
Condemnation     0.2              First-class              0.6
Syria            0.3              Record                   0.3
Crackdown        0.8              Catches                  0.6
Protests         1.0              Outfielder               1.0
Army             0.8              Innings                  0.9
Hama             1.0              Lancashire               0.5
          P(Politics|Word)                           P(Sports|Word)
Rule-Based with Weights

   Politics
UN             1.0   P(Politics|UN)
Adopts         0.1   P(Politics|Adopts)
Condemnation   0.2   P(Politics|Condemnation)
Syria          0.3   P(Politics|Syria)
Crackdown      0.8   P(Politics|Crackdown)
Protests       1.0   P(Politics|Protests)
Army           0.8   P(Politics|Army)
Hama           1.0   P(Politics|Hama)
Rule-Based with Weights

    Politics
UN          1.0      P(Politics | “UN”)
Adopts      0.1      P(Politics | “Adopts”)

         How can you learn these probabilities automatically?
Rule-Based with Weights

        Politics
   UN             1.0      P(Politics | “UN”)
   Adopts         0.1      P(Politics | “Adopts”)

               How can you learn these probabilities automatically?


        Estimation
    P(Politics | “UN”) = 20/20

Statistically not a very accurate estimator - denominator is small.
Rule-Based with Weights

    Politics
UN           1.0      P(Politics | “UN”)
Adopts       0.1      P(Politics | “Adopts”)

          How can you learn these probabilities automatically?


    Instead you Estimate
P(“UN”|Politics) = 20/40000
{ C(“UN” in politics) / C(all words in category politics) }

                          Statistically this is a better estimator
Rule-Based with Weights

      Politics
 UN            1.0      P(Politics | “UN”)
 Adopts        0.1      P(Politics | “Adopts”)

            How can you learn these probabilities automatically?



A Naïve Bayesian classifier uses a P(Politics | “UN”)
estimate calculated from P(“UN”|Politics).

       That’s so cool! Time to learn how to do that!
Use Bayesian Inversion!


In other words, we are looking to turn
P(F|E) into P(E|F).

There is an equation to do this :
P(E|F) = P(F|E) * P(E) / P(F) [Bayesian Inversion]
So finally … you have …

           Politics
   UN                  1.0      P(“UN”|Politics)* P(Politics)/P(“UN”)
   Adopts              0.1      P(“Adopts”|Politics)*P(Politics)/P(“Adopts”)




                          That was easy wasn’t it?!




These don’t have to be only words. They can be ANY sort of feature (word pairs, syntax).
You have just Learnt
               How to Build
        A Naïve Bayesian Classifier
Starting from Linguistic Rules (Word Lists)

              Cohan Sujay Carlos
                 Aiaioo Labs
               Bangalore, India
              cohan@aiaioo.com

More Related Content

Recently uploaded

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 

Recently uploaded (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

Featured

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Featured (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Rules engines to machine learning

  • 1. From Linguistic Rules to Machine Learning Cohan Sujay Carlos Aiaioo Labs Bangalore, India cohan@aiaioo.com
  • 2. What is a Classifier? A machine learning tool used to apply a label to data.
  • 3. Classification used in Text Categorization Politics Sports The UN Security Warwickshire's Clarke Council adopts its first equalled the first-class clear condemnation of record of seven Syria for its continuing catches for an crackdown on outfielder in an protests, as the army innings but Lancashire continues its advance took control on day into Hama. three.
  • 4. How to Build a Classifier for Text Categorization How do you tell which label (Politics/Sports) is suitable? The UN Security Warwickshire's Clarke Council adopts its first equalled the first-class clear condemnation of record of seven Syria for its continuing catches for an crackdown on outfielder in an protests, as the army innings but Lancashire continues its advance took control on day into Hama. three.
  • 5. In this case it’s only words that you will need!
  • 6. Classification used in Text Categorization See the words? The UN Security Warwickshire's Clarke Council adopts its first equalled the first-class clear condemnation of record of seven Syria for its continuing catches for an crackdown on outfielder in an protests, as the army innings but Lancashire continues its advance took control on day into Hama. three.
  • 7. Rule-Based Text Categorization Gazetteers (word lists) UN Security Council Warwickshire Adopts Clarke Condemnation First-class Syria Record Crackdown Catches Protests Outfielder Army Innings Hama Lancashire So you can just use word lists for classification? Yeah, but they won’t work very well. Can you see why word lists alone won’t work very well?
  • 8. Rule-Based to Naïve Bayesian How can you go: from the starting point (word lists) to a really cool classification algorithm All you need is weights!
  • 9. Rule-Based with Weights Let’s improve the gazetteers with weights Politics Sports UN 1.0 Warwickshire 0.3 Adopts 0.1 Clarke 0.1 Condemnation 0.2 First-class 0.6 Syria 0.3 Record 0.3 Crackdown 0.8 Catches 0.6 Protests 1.0 Outfielder 1.0 Army 0.8 Innings 0.9 Hama 1.0 Lancashire 0.5 These weights are nothing but P(Category|Word).
  • 10. Rule-Based with Weights Let’s improve the gazetteers with weights Politics Sports UN 1.0 Warwickshire 0.3 Adopts 0.1 Clarke 0.1 Condemnation 0.2 First-class 0.6 Syria 0.3 Record 0.3 Crackdown 0.8 Catches 0.6 Protests 1.0 Outfielder 1.0 Army 0.8 Innings 0.9 Hama 1.0 Lancashire 0.5 P(Politics|Word) P(Sports|Word)
  • 11. Rule-Based with Weights Politics UN 1.0 P(Politics|UN) Adopts 0.1 P(Politics|Adopts) Condemnation 0.2 P(Politics|Condemnation) Syria 0.3 P(Politics|Syria) Crackdown 0.8 P(Politics|Crackdown) Protests 1.0 P(Politics|Protests) Army 0.8 P(Politics|Army) Hama 1.0 P(Politics|Hama)
  • 12. Rule-Based with Weights Politics UN 1.0 P(Politics | “UN”) Adopts 0.1 P(Politics | “Adopts”) How can you learn these probabilities automatically?
  • 13. Rule-Based with Weights Politics UN 1.0 P(Politics | “UN”) Adopts 0.1 P(Politics | “Adopts”) How can you learn these probabilities automatically? Estimation P(Politics | “UN”) = 20/20 Statistically not a very accurate estimator - denominator is small.
  • 14. Rule-Based with Weights Politics UN 1.0 P(Politics | “UN”) Adopts 0.1 P(Politics | “Adopts”) How can you learn these probabilities automatically? Instead you Estimate P(“UN”|Politics) = 20/40000 { C(“UN” in politics) / C(all words in category politics) } Statistically this is a better estimator
  • 15. Rule-Based with Weights Politics UN 1.0 P(Politics | “UN”) Adopts 0.1 P(Politics | “Adopts”) How can you learn these probabilities automatically? A Naïve Bayesian classifier uses a P(Politics | “UN”) estimate calculated from P(“UN”|Politics). That’s so cool! Time to learn how to do that!
  • 16. Use Bayesian Inversion! In other words, we are looking to turn P(F|E) into P(E|F). There is an equation to do this : P(E|F) = P(F|E) * P(E) / P(F) [Bayesian Inversion]
  • 17. So finally … you have … Politics UN 1.0 P(“UN”|Politics)* P(Politics)/P(“UN”) Adopts 0.1 P(“Adopts”|Politics)*P(Politics)/P(“Adopts”) That was easy wasn’t it?! These don’t have to be only words. They can be ANY sort of feature (word pairs, syntax).
  • 18. You have just Learnt How to Build A Naïve Bayesian Classifier Starting from Linguistic Rules (Word Lists) Cohan Sujay Carlos Aiaioo Labs Bangalore, India cohan@aiaioo.com