SlideShare una empresa de Scribd logo
1 de 35
600.465 Connecting the dots - I(NLP in Practice) Delip Rao delip@jhu.edu
What is “Text”?
What is “Text”?
What is “Text”?
“Real” World Tons of data on the web A lot of it is text In many languages In many genres Language by itself is complex.  The Web further complicates language.
But we have 600.465 ,[object Object]
1. Formalize some insights
2. Study the formalism mathematically
3. Develop & implement algorithms
4. Test on real dataForward Backward,  Gradient Descent, LBFGS, Simulated Annealing, Contrastive Estimation, … feature functions! f(wi = off, wi+1 = the) f(wi = obama, yi = NP) Adapted from : Jason Eisner
NLP for fun and profit Making NLP more accessible Provide APIs for common NLP tasks vartext = document.get(…); varentities = agent.markNE(text); Big $$$$ Backend to intelligent processing of text
Desideratum: Multilinguality Except for feature extraction, systems should be language agnostic
In this lecture Understand how to solve and ace in NLP tasks Learn general methodology or approaches End-to-End development using an example task Overview of (un)common NLP tasks
Case study: Named Entity Recognition
Case study: Named Entity Recognition Demo: http://viewer.opencalais.com ,[object Object]
How do we find out well we are doing?
How can we improve?,[object Object]
Case study: Named Entity Recognition Collect data to learn from Sentences with words marked as PER, ORG, LOC, NONE How do we get this data?
Pay the experts
Wisdom of the crowds
Getting the data: Annotation Time consuming Costs $$$ Need for quality control Inter-annotator aggreement Kappa score (Kippendorf, 1980) Smarter ways to annotate Get fewer annotations: Active Learning Rationales (Zaidan, Eisner & Piatko, 2007)
Only France and Great Britain backed Fischler ‘s proposal . Only France and Great Britain backed Fischler‘s proposal . Input x Labels y
[object Object]
2. Study the formalism mathematically
3. Develop & implement algorithms
4. Test on real dataOur recipe …
NER: Designing features Need to segment sentences Tokenize the sentences Preprocessing Not as trivial as you think Original text itself might be in an ugly HTML Cleaneval!
NER: Designing features
NER: Designing features
NER: Designing features
NER: Designing features
NER: Designing features These are extracted during preprocessing!
NER: Designing features
NER: Designing features

Más contenido relacionado

Similar a NLP in Practice - Part I

Gadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALLGadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALL
Lawrie Hunter
 
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-ProcessingAn-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
Theodore J. LaGrow
 
Deitel® SerHow To Program SeriesC How to Program.docx
Deitel® SerHow To Program SeriesC How to Program.docxDeitel® SerHow To Program SeriesC How to Program.docx
Deitel® SerHow To Program SeriesC How to Program.docx
simonithomas47935
 
Problem-based Learning & Resource-based Learning two complementary approac...
Problem-based Learning & Resource-based Learning  two complementary approac...Problem-based Learning & Resource-based Learning  two complementary approac...
Problem-based Learning & Resource-based Learning two complementary approac...
Wilco te Winkel
 

Similar a NLP in Practice - Part I (20)

Gadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALLGadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALL
 
Applications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and Design
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-ProcessingAn-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
 
Natural Language Processing for Irish
Natural Language Processing for IrishNatural Language Processing for Irish
Natural Language Processing for Irish
 
Let's pretend
Let's pretendLet's pretend
Let's pretend
 
Week1- Introduction.pptx
Week1- Introduction.pptxWeek1- Introduction.pptx
Week1- Introduction.pptx
 
Identify Development Pains and Resolve Them with Idea Flow
Identify Development Pains and Resolve Them with Idea FlowIdentify Development Pains and Resolve Them with Idea Flow
Identify Development Pains and Resolve Them with Idea Flow
 
Big Data and Natural Language Processing
Big Data and Natural Language ProcessingBig Data and Natural Language Processing
Big Data and Natural Language Processing
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
2013 arizona-swc
2013 arizona-swc2013 arizona-swc
2013 arizona-swc
 
NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
 
Deitel® SerHow To Program SeriesC How to Program.docx
Deitel® SerHow To Program SeriesC How to Program.docxDeitel® SerHow To Program SeriesC How to Program.docx
Deitel® SerHow To Program SeriesC How to Program.docx
 
IRJET- Querying Database using Natural Language Interface
IRJET-  	  Querying Database using Natural Language InterfaceIRJET-  	  Querying Database using Natural Language Interface
IRJET- Querying Database using Natural Language Interface
 
Text-mining and Automation
Text-mining and AutomationText-mining and Automation
Text-mining and Automation
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition Technology
 
Storytelling for research software engineers
Storytelling for research software engineersStorytelling for research software engineers
Storytelling for research software engineers
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extraction
 
Problem-based Learning & Resource-based Learning two complementary approac...
Problem-based Learning & Resource-based Learning  two complementary approac...Problem-based Learning & Resource-based Learning  two complementary approac...
Problem-based Learning & Resource-based Learning two complementary approac...
 

Último

Último (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 

NLP in Practice - Part I

  • 1. 600.465 Connecting the dots - I(NLP in Practice) Delip Rao delip@jhu.edu
  • 2.
  • 6. “Real” World Tons of data on the web A lot of it is text In many languages In many genres Language by itself is complex. The Web further complicates language.
  • 7.
  • 9. 2. Study the formalism mathematically
  • 10. 3. Develop & implement algorithms
  • 11. 4. Test on real dataForward Backward, Gradient Descent, LBFGS, Simulated Annealing, Contrastive Estimation, … feature functions! f(wi = off, wi+1 = the) f(wi = obama, yi = NP) Adapted from : Jason Eisner
  • 12. NLP for fun and profit Making NLP more accessible Provide APIs for common NLP tasks vartext = document.get(…); varentities = agent.markNE(text); Big $$$$ Backend to intelligent processing of text
  • 13. Desideratum: Multilinguality Except for feature extraction, systems should be language agnostic
  • 14. In this lecture Understand how to solve and ace in NLP tasks Learn general methodology or approaches End-to-End development using an example task Overview of (un)common NLP tasks
  • 15. Case study: Named Entity Recognition
  • 16.
  • 17. How do we find out well we are doing?
  • 18.
  • 19. Case study: Named Entity Recognition Collect data to learn from Sentences with words marked as PER, ORG, LOC, NONE How do we get this data?
  • 21. Wisdom of the crowds
  • 22. Getting the data: Annotation Time consuming Costs $$$ Need for quality control Inter-annotator aggreement Kappa score (Kippendorf, 1980) Smarter ways to annotate Get fewer annotations: Active Learning Rationales (Zaidan, Eisner & Piatko, 2007)
  • 23. Only France and Great Britain backed Fischler ‘s proposal . Only France and Great Britain backed Fischler‘s proposal . Input x Labels y
  • 24.
  • 25. 2. Study the formalism mathematically
  • 26. 3. Develop & implement algorithms
  • 27. 4. Test on real dataOur recipe …
  • 28. NER: Designing features Need to segment sentences Tokenize the sentences Preprocessing Not as trivial as you think Original text itself might be in an ugly HTML Cleaneval!
  • 33. NER: Designing features These are extracted during preprocessing!
  • 36. NER: Designing features Can you think of other features? HAS_DIGITS IS_HYPHENATED IS_ALLCAPS FREQ_WORD RARE_WORD USEFUL_UNIGRAM_PER USEFUL_BIGRAM_PER USEFUL_UNIGRAM_LOC USEFUL_BIGRAM_LOC USEFUL_UNIGRAM_ORG USEFUL_BIGRAM_ORG USEFUL_SUFFIX_PER USEFUL_SUFFIX_LOC USEFUL_SUFFIX_ORG WORD PREV_WORD NEXT_WORD PREV_BIGRAM NEXT_BIGRAM POS PREV_POS NEXT_POS PREV_POS_BIGRAM NEXT_POS_BIGRAM IN_LEXICON_PER IN_LEXICON_LOC IN_LEXICON_ORG IS_CAPITALIZED
  • 37. Case: Named Entity Recognition Evaluation Metrics Token accuracy: What percent of the tokens got labeled correctly Problem with accuracy Precision-Recall-F president O Barack B-PER Obama O
  • 38. NER: How can we improve? Engineer better features Design better models Conditional Random Fields Y1 Y2 Y3 Y4 x1 x2 x3 x4
  • 39. NER: How else can we improve? Unlabeled data! example from Jerry Zhu
  • 40. NER : Challenges Domain transfer WSJ NYT WSJ  Blogs ?? WSJ  Twitter ??!? Tough nut: Organizations Non textual data? Entity Extraction is a Boring Solved Problem – or is it? (Vilain, Su and Lubar, 2007)
  • 41. NER: Related application Extracting real estate information from Criagslist Ads Our oversized one, two and three bedroom apartment homes with floor plans featuring 1 and 2 baths offer space unlike any competition. Relax and enjoy the views from your own private balcony or patio, or feel free to entertain, with plenty of space in your large living room, dining area and eat-in kitchen. The lovely pool and sun deck make summer fun a splash. Our location makes commuting a breeze – Near MTA bus lines, the Metro station, major shopping areas, and for the little ones, an elementary school is right next door. Our oversized one, two and three bedroom apartment homes with floor plans featuring 1 and 2 baths offer space unlike any competition. Relax and enjoy the views from your own private balcony or patio, or feel free to entertain, with plenty of space in your large living room, dining area and eat-inkitchen. The lovely pool and sun deck make summer fun a splash. Our location makes commuting a breeze – Near MTA bus lines, the Metro station, major shopping areas, and for the little ones, an elementary school is right next door.
  • 42. NER: Related Application BioNLP: Annotation of chemical entities Corbet, Batchelor & Teufel, 2007
  • 43. Shared Tasks: NLP in practice Shared Task Everybody works on a (mostly) common dataset Evaluation measures are defined Participants get ranked on the evaluation measures Advance the state of the art Set benchmarks Tasks involve common hard problems or new interesting problems