SlideShare una empresa de Scribd logo
1 de 26
Descargar para leer sin conexión
The pope is catholic.
language as an interfacelanguage as data
introduction
(@spencermountain)
introduction
problem
problem
problem
london in the rain london in the
in the rain
london in
in the
the rain
london
in
the
rain
4-gram: 3-gram: 2-gram: 1-gram:
10 requests per keystroke
4-words -
5-words : 15 6-words : 21 7-words : 28 8-words : 36
problem
london in the rain
london in the
in the rain
london in
in the
the rain
London
in
the
rain
in
the
london in the
in the rain
london in
the rain
Stopwords
blacklist:
Edge gram
filter:
Redundancy
check:
#1
#2
#3
in the
“rain”
“london”
“london in the rain”
problem
● NLTK - excellent, huge, python
● Stanford parser - excellent, huge, java
Alchemy,
TextRazor,
OpenCalais,
Embedly,
Zemanta
Or an offsite API?● Freeling - excellent, huge, C++
● Illinois tagger - excellent, huge, java
When all you’ve got is a jackhammer..
niche
Can it be hacked?
tldr: yes.
niche
How big is a language?
Shakespeare - 35,000
niche
Zipfs law
The top 10 words account for 25% of language.
The top 100 words account for 50% of language.
The top 50,000 words account for 95% of language.
niche
602 kb
uncompressed
50,000
different words
An average person will ever hear
~4 lookups in binary search
niche
first, let’s kill the
nouns 70%
process
180 kb
uncompressed
Noun Verb Adjective Adverb
Tomato
Tomatoes
Toronto
Torontonian
Speak
Spoke
Speaking
will speak
have spoken
had spoken
...
nice
nicer
nicest
quickly
quicklier
quickliest
“awesome”“awesomeify”
improveify your vocabularies
“quickly”“quick”
n/2.3
Each word
*not handsome *not truly*not is*not economics
“tomatoey”“tomato”
“speak”“speaker”
“aggressive”“agressiveness”
“civil”“civilize”
niche
then, let’s conjugate
our verbs
process
110 kb
uncompressed
process
jQuery
256kb
d3js
330kb
react
653kb
110 kb
uncompressed
lodash
503kb
the whole
english
language
110kb
Ok, let’s roll our own POS tagger..
(what could go rong?)
process
1) ☑ Lexicon
2) ◻ Suffix regexes
3) ◻ Sentence
Suffix rules
process
1) ☑ Lexicon
2) ☑ Suffix regexes
3) ◻ Sentence
Grammar rules - markov
She could walk the walk .
before: Verb - Det - Verb
after: Verb - Det - Noun
process
1) ☑ Lexicon
2) ☑ Suffix regexes
3) ☑ Sentence
“Unreasonable effectiveness” of rule-based taggers-
● a 1,000 word lexicon - 45% precision
● fallback to [Noun] - 70% precision
● a little regex - 74% precision
● a little grammar in it - 81% precision
process
t.text(“keep on rocking in the free world”)
t.negate()
//“don’t keep on rocking in the free world.”
outcome
t.text(“it is a cool library”)
t.toValleyGirl()
//“so, it is like, a cool library.”
outcome
We gave the monkeys the
bananas,
..because they were ripe. ..because they were hungry.
outcome
We gave the monkeys the bananas
[Pr] [Verb] [Dt] [Noun] [Dt] [Noun]
list of letters
POS-tagging
Dependency parser We give [Noun] [Noun]
[act / transfer / voluntary] [genus / monkey]
Knowledge engine
[plant / banana]
outcome
#TODOFML
● Mutable/Immutable API
● Speed, performance testing
● Romantic-language verb conjugations
● ‘bl.ocks.org’ of demos and docs
outcome
npm install --wooyeah
@spencermountain
Slack group, mailing list, github, Toronto/coffee

Más contenido relacionado

Similar a nlp_compromise

Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...Apache OpenNLP
 
Text generation and_advanced_topics
Text generation and_advanced_topicsText generation and_advanced_topics
Text generation and_advanced_topicsankit_ppt
 
Collatinus: Lemmatizer and morphological analyzer for Latin texts
Collatinus: Lemmatizer and morphological analyzer for Latin textsCollatinus: Lemmatizer and morphological analyzer for Latin texts
Collatinus: Lemmatizer and morphological analyzer for Latin textsEquipex Biblissima
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to Pythontswr
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Mustafa Jarrar
 
Sequencing run grief counseling: counting kmers at MG-RAST
Sequencing run grief counseling: counting kmers at MG-RASTSequencing run grief counseling: counting kmers at MG-RAST
Sequencing run grief counseling: counting kmers at MG-RASTwltrimbl
 
An overview of JavaScript and Node.js
An overview of JavaScript and Node.jsAn overview of JavaScript and Node.js
An overview of JavaScript and Node.jsLuciano Mammino
 
Intuition & Use-Cases of Embeddings in NLP & beyond
Intuition & Use-Cases of Embeddings in NLP & beyondIntuition & Use-Cases of Embeddings in NLP & beyond
Intuition & Use-Cases of Embeddings in NLP & beyondC4Media
 
CSCE181 Big ideas in NLP
CSCE181 Big ideas in NLPCSCE181 Big ideas in NLP
CSCE181 Big ideas in NLPInsoo Chung
 
Hello, Joe. Hello, Mike; Hello, Robert.
Hello, Joe. Hello, Mike; Hello, Robert.Hello, Joe. Hello, Mike; Hello, Robert.
Hello, Joe. Hello, Mike; Hello, Robert.⌨️ Steven Proctor
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4DigiGurukul
 
Here be dragons
Here be dragonsHere be dragons
Here be dragonsdeelay1
 
Distributed Natural Language Processing Systems in Python
Distributed Natural Language Processing Systems in PythonDistributed Natural Language Processing Systems in Python
Distributed Natural Language Processing Systems in PythonClare Corthell
 
Large Scale Text Processing
Large Scale Text ProcessingLarge Scale Text Processing
Large Scale Text ProcessingSuneel Marthi
 
Large Scale Processing of Unstructured Text
Large Scale Processing of Unstructured TextLarge Scale Processing of Unstructured Text
Large Scale Processing of Unstructured TextDataWorks Summit
 
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vecword2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec👋 Christopher Moody
 
Introduction to functional programming (In Arabic)
Introduction to functional programming (In Arabic)Introduction to functional programming (In Arabic)
Introduction to functional programming (In Arabic)Omar Abdelhafith
 
Learning for sequences - Adam Mathias
Learning for sequences  - Adam MathiasLearning for sequences  - Adam Mathias
Learning for sequences - Adam MathiasDataFest Tbilisi
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 

Similar a nlp_compromise (20)

Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
 
About programming languages
About programming languagesAbout programming languages
About programming languages
 
Text generation and_advanced_topics
Text generation and_advanced_topicsText generation and_advanced_topics
Text generation and_advanced_topics
 
Collatinus: Lemmatizer and morphological analyzer for Latin texts
Collatinus: Lemmatizer and morphological analyzer for Latin textsCollatinus: Lemmatizer and morphological analyzer for Latin texts
Collatinus: Lemmatizer and morphological analyzer for Latin texts
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to Python
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
 
Sequencing run grief counseling: counting kmers at MG-RAST
Sequencing run grief counseling: counting kmers at MG-RASTSequencing run grief counseling: counting kmers at MG-RAST
Sequencing run grief counseling: counting kmers at MG-RAST
 
An overview of JavaScript and Node.js
An overview of JavaScript and Node.jsAn overview of JavaScript and Node.js
An overview of JavaScript and Node.js
 
Intuition & Use-Cases of Embeddings in NLP & beyond
Intuition & Use-Cases of Embeddings in NLP & beyondIntuition & Use-Cases of Embeddings in NLP & beyond
Intuition & Use-Cases of Embeddings in NLP & beyond
 
CSCE181 Big ideas in NLP
CSCE181 Big ideas in NLPCSCE181 Big ideas in NLP
CSCE181 Big ideas in NLP
 
Hello, Joe. Hello, Mike; Hello, Robert.
Hello, Joe. Hello, Mike; Hello, Robert.Hello, Joe. Hello, Mike; Hello, Robert.
Hello, Joe. Hello, Mike; Hello, Robert.
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4
 
Here be dragons
Here be dragonsHere be dragons
Here be dragons
 
Distributed Natural Language Processing Systems in Python
Distributed Natural Language Processing Systems in PythonDistributed Natural Language Processing Systems in Python
Distributed Natural Language Processing Systems in Python
 
Large Scale Text Processing
Large Scale Text ProcessingLarge Scale Text Processing
Large Scale Text Processing
 
Large Scale Processing of Unstructured Text
Large Scale Processing of Unstructured TextLarge Scale Processing of Unstructured Text
Large Scale Processing of Unstructured Text
 
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vecword2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
 
Introduction to functional programming (In Arabic)
Introduction to functional programming (In Arabic)Introduction to functional programming (In Arabic)
Introduction to functional programming (In Arabic)
 
Learning for sequences - Adam Mathias
Learning for sequences  - Adam MathiasLearning for sequences  - Adam Mathias
Learning for sequences - Adam Mathias
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 

Más de TorontoNodeJS

Node.js API pitfalls
Node.js API pitfallsNode.js API pitfalls
Node.js API pitfallsTorontoNodeJS
 
Safely Build, Publish & Maintain ES2015, ES2016 Packages Today
Safely Build, Publish & Maintain ES2015, ES2016 Packages TodaySafely Build, Publish & Maintain ES2015, ES2016 Packages Today
Safely Build, Publish & Maintain ES2015, ES2016 Packages TodayTorontoNodeJS
 
Understanding the Single Thread Event Loop
Understanding the Single Thread Event LoopUnderstanding the Single Thread Event Loop
Understanding the Single Thread Event LoopTorontoNodeJS
 
Avoiding callback hell with promises
Avoiding callback hell with promisesAvoiding callback hell with promises
Avoiding callback hell with promisesTorontoNodeJS
 
Building your own slack bot on the AWS stack
Building your own slack bot on the AWS stackBuilding your own slack bot on the AWS stack
Building your own slack bot on the AWS stackTorontoNodeJS
 

Más de TorontoNodeJS (7)

Node.js API pitfalls
Node.js API pitfallsNode.js API pitfalls
Node.js API pitfalls
 
Safely Build, Publish & Maintain ES2015, ES2016 Packages Today
Safely Build, Publish & Maintain ES2015, ES2016 Packages TodaySafely Build, Publish & Maintain ES2015, ES2016 Packages Today
Safely Build, Publish & Maintain ES2015, ES2016 Packages Today
 
KoNote
KoNoteKoNote
KoNote
 
Understanding the Single Thread Event Loop
Understanding the Single Thread Event LoopUnderstanding the Single Thread Event Loop
Understanding the Single Thread Event Loop
 
Avoiding callback hell with promises
Avoiding callback hell with promisesAvoiding callback hell with promises
Avoiding callback hell with promises
 
Node as an API shim
Node as an API shimNode as an API shim
Node as an API shim
 
Building your own slack bot on the AWS stack
Building your own slack bot on the AWS stackBuilding your own slack bot on the AWS stack
Building your own slack bot on the AWS stack
 

Último

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Último (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

nlp_compromise

  • 1. The pope is catholic. language as an interfacelanguage as data introduction
  • 6. london in the rain london in the in the rain london in in the the rain london in the rain 4-gram: 3-gram: 2-gram: 1-gram: 10 requests per keystroke 4-words - 5-words : 15 6-words : 21 7-words : 28 8-words : 36 problem
  • 7. london in the rain london in the in the rain london in in the the rain London in the rain in the london in the in the rain london in the rain Stopwords blacklist: Edge gram filter: Redundancy check: #1 #2 #3 in the “rain” “london” “london in the rain” problem
  • 8. ● NLTK - excellent, huge, python ● Stanford parser - excellent, huge, java Alchemy, TextRazor, OpenCalais, Embedly, Zemanta Or an offsite API?● Freeling - excellent, huge, C++ ● Illinois tagger - excellent, huge, java When all you’ve got is a jackhammer.. niche
  • 9. Can it be hacked? tldr: yes. niche
  • 10. How big is a language? Shakespeare - 35,000 niche
  • 11. Zipfs law The top 10 words account for 25% of language. The top 100 words account for 50% of language. The top 50,000 words account for 95% of language. niche
  • 12. 602 kb uncompressed 50,000 different words An average person will ever hear ~4 lookups in binary search niche
  • 13. first, let’s kill the nouns 70% process 180 kb uncompressed
  • 14. Noun Verb Adjective Adverb Tomato Tomatoes Toronto Torontonian Speak Spoke Speaking will speak have spoken had spoken ... nice nicer nicest quickly quicklier quickliest “awesome”“awesomeify” improveify your vocabularies “quickly”“quick” n/2.3 Each word *not handsome *not truly*not is*not economics “tomatoey”“tomato” “speak”“speaker” “aggressive”“agressiveness” “civil”“civilize” niche
  • 15. then, let’s conjugate our verbs process 110 kb uncompressed
  • 17. Ok, let’s roll our own POS tagger.. (what could go rong?) process 1) ☑ Lexicon 2) ◻ Suffix regexes 3) ◻ Sentence
  • 18. Suffix rules process 1) ☑ Lexicon 2) ☑ Suffix regexes 3) ◻ Sentence
  • 19. Grammar rules - markov She could walk the walk . before: Verb - Det - Verb after: Verb - Det - Noun process 1) ☑ Lexicon 2) ☑ Suffix regexes 3) ☑ Sentence
  • 20. “Unreasonable effectiveness” of rule-based taggers- ● a 1,000 word lexicon - 45% precision ● fallback to [Noun] - 70% precision ● a little regex - 74% precision ● a little grammar in it - 81% precision process
  • 21. t.text(“keep on rocking in the free world”) t.negate() //“don’t keep on rocking in the free world.” outcome
  • 22. t.text(“it is a cool library”) t.toValleyGirl() //“so, it is like, a cool library.” outcome
  • 23. We gave the monkeys the bananas, ..because they were ripe. ..because they were hungry. outcome
  • 24. We gave the monkeys the bananas [Pr] [Verb] [Dt] [Noun] [Dt] [Noun] list of letters POS-tagging Dependency parser We give [Noun] [Noun] [act / transfer / voluntary] [genus / monkey] Knowledge engine [plant / banana] outcome
  • 25. #TODOFML ● Mutable/Immutable API ● Speed, performance testing ● Romantic-language verb conjugations ● ‘bl.ocks.org’ of demos and docs outcome
  • 26. npm install --wooyeah @spencermountain Slack group, mailing list, github, Toronto/coffee