An introduction to performing natural language processing (NLP) tasks in Ruby. Video is here: https://skillsmatter.com/skillscasts/4883-how-to-parse-go#video
6. document
sentence
word
example
Chunking & segmenting
Breaking text into paragraphs, sentences and other zones
Start with a document/some text:
“The second nonabsolute number is the given time of
arrival, which is now known to be one of those most bizarre
of mathematical concepts, a recipriversexclusion, a number
whose existence can only be defined as being anything other
than itself…..”
16. document
sentence
word
A couple of methods!
!
Regex tagger
/*.ing/
VBG
/*.ed/
VBD
!
Lookup on words
E.g.
calculating : { VBG: 6 }
orange: { JJ: 2, NN: 5 }
example
17. document
sentence
word
example
A tale of two taggers
EngTagger
rb-brill-tagger
Probabilistic (uses
•
Rule based
look up table prev.
•
•
C extensions
slide)
•
Brown corpus trained
•
Pure ruby
18. document
sentence
word
example
Treat gem
Bundles many of the gems shown
Wraps them in a DSL
s = sentence(“A really good sentence.”)
s.do(:chunk, :segment, :tokenize, :parse)
stemming; tokenising; chunking; serialising;
tagging; text extraction from pdfs and html;
19. LRUG Sentiments
A tag
{NN}
Pass in regex => /({JJ}|{JJS})({NNS}|{NNP})/
And some tagged tokens
#=> [(Word @tag="JJ", @text="jolly"),!
(Word @tag="NN", @text="face")]
22. Gems
Text - Paul Battley’s box of tricks
Treat
Tokenizer
Punkt segmenter
Chronic - for extracting dates
23. Other things you can do/I didn’t talk about
Calculate text edit distance
Extract entities using the Stanford
libraries via the RJB
!
Extract topic words (LDA)
!
Keyword extraction - TfIdf
!
Jruby
24. Thank you for processing.
Questions?
@tomcartwrightuk
Thanks to Tim Cowlishaw and the HT dev
team for specialised rubber duck support