LinkedIn emplea cookies para mejorar la funcionalidad y el rendimiento de nuestro sitio web, así como para ofrecer publicidad relevante. Si continúas navegando por ese sitio web, aceptas el uso de cookies. Consulta nuestras Condiciones de uso y nuestra Política de privacidad para más información.
LinkedIn emplea cookies para mejorar la funcionalidad y el rendimiento de nuestro sitio web, así como para ofrecer publicidad relevante. Si continúas navegando por ese sitio web, aceptas el uso de cookies. Consulta nuestra Política de privacidad y nuestras Condiciones de uso para más información.
Exploring Natural Language Processing in Ruby - Tokyo Rubyist Meetup (April 9th, 2015)
This presentation will cover 3 natural language processing gems I’ve released over the past year:
* Pragmatic Segmenter (a sentence boundary detection gem)
* Chat Correct (a gem for English teachers/students that provides error analysis when an incorrect sentence is diffed with a correct sentence)
* Word Count Analyzer (a gem that analyzes a string for potential “word count gray areas” which cause tools to report different word counts)
The talk will cover various aspects of building these gems including working from first principles, testing edge cases, and getting comfortable with regular expressions. I’ll also introduce a project that is currently in-progress - a new algorithm for parallel text alignment and some of the related challenges with building it.
A rule-based sentence boundary
detection gem that works out-of-the-box
across many languages.
What is segmentation?
Segmentation is the process of splitting a text
into segments or sentences. In other words,
deciding where sentences begin and end.
text = ”Hello Tokyo Rubyists. Let’s try segmentation.”
segment #1: Hello Tokyo Rubyists.
segment #2: Let’s try segmentation.
Why care about segmentation?
Sentence segmentation is the foundation of many
common NLP tasks:!
• Machine translation!
• Bitext alignment!
• Part-of-speech tagging!
• Grammar parsing
Errors in segmentation compound
into errors in these other NLP tasks
Why reinvent the wheel?
• Most segmentation libraries are built to
support only English (or English plus a few
• Current solutions do not handle ill-formatted
• Some libraries perform really well when
trained with a data in a speciﬁc language and
a speciﬁc domain, but what happens when
your data could come from any language
How can we achieve the following
string = “Hello world. Let’s try segmentation.”
Desired output: [“Hello world.”, “Let’s try segmentation.”]
Pragmatic Segmenter1 Using the core or standard library (no gems)
Time to check your solutions
Let’s brainstorm other edge cases
that will make our first solution fail
Currently 52 English Golden Rules covering edge cases such as:!
• abbreviations at the end of a sentence!
• email addresses!
• web addresses!
• geo coordinates!
Rubyists like to keep it DRY
Most researchers either use the WSJ corpus or Brown corpus from the Penn
Treebank to test their segmentation algorithm!
There are limits to using these corpora:!
1. The corpora may be too expensive for some people ($1,700)!
2. The majority of the sentences in the corpora are sentences that end
with a regular word followed by a period, thus testing the same thing
over and over again
In the Brown Corpus 92% of potential sentence boundaries come after a regular word.
The WSJ Corpus is richer with abbreviations and only 83% of sentences end with a
regular word followed by a period.!
Andrei Mikheev - Periods, Capitalized Words, etc.
A comparison of segmentation libraries
Name Language License
Golden Rule Score !
Golden Rule Score
Pragmatic Segmenter Ruby MIT 98.08% 100.00% 3.84 s
TactfulTokenizer Ruby GNU GPLv3 65.38% 48.57% 46.32 s
Open NLP Java APLv2 59.62% 45.71% 1.27 s
Stanford CoreNLP Java GNU GPLv3 59.62% 31.43% 0.92 s
Splitta Python APLv2 55.77% 37.14% N/A
Punkt Python APLv2 46.15% 48.57% 1.79 s
SRX English Ruby GNU GPLv3 30.77% 28.57% 6.19 s
Scapel Ruby GNU GPLv3 28.85% 20.00% 0.13 s
† The performance test takes the 50 English Golden Rules combined into one string and runs it 100 times through each library. The number is an average of 10 runs.
The Holy Grail
A.M. / P.M. as non sentence boundary and sentence boundary
At 5 a.m. Mr. Smith went to the bank. He left the bank at 6 P.M. Mr. Smith then went to the store.
Golden Rule #18
All tested segmentation libraries failed this spec
["At 5 a.m. Mr. Smith went to the bank.", "He left the bank at 6 P.M.", "Mr. Smith then went to the store."]
A Ruby gem that shows the errors
and error types when a correct
English sentence is diffed with an
incorrect English sentence.
I was giving a weekly Skype English lesson
and the student was focusing on writing
practice for the TOEFL test
I would correct the student’s sentence, but it
would often seem as if he was missing some
of my corrections - even if I read it with a
LOT OF STRESS!!
A color coded way to
a student’s mistake(s)
Word Count Analyzer
Analyzes a string for potential areas
of the text that might cause word
count discrepancies depending on
the tool used.
Word Count Analyzer
• Translation is typically billed on a per
• Different tools often report different
I wanted to understand what was
causing these differences in word count
Word count gray areas
Word Count Analyzer
Common word count gray areas include:!
• Hyphenated Words!
• Numbered Lists!
• XML and HTML tags!
• Forward slashes and backslashes!
? ? ?
A bitext alignment (aka parallel text
alignment) tool with a focus on high
What’s it used for?
• Translation memory!
• Machine translation
? ? ?
Current commercial state-of-the-art!
• Gale-Church sentence-length information plus
dictionary if available (e.g. hunalign)!
? ? ?
Areas for improvement
? ? ?
•Early misalignment compounds into
•Accuracy may suffer for non-Roman
languages unless the algorithm is
•Does not handle cross alignments
nor uneven alignments
A method for higher accuracy
• Machine translate A - B and B - A!
• Relative sentence length!
• Order or position in the document
? ? ?
0 1 2 3 4 5
• better accuracy!
• can handle crossing alignments!
• can handle uneven segments matches !
(1 to 2, 2 to 1, 1 to 3, 3 to 1, 2 to 3, and 3 to 2)
? ? ?
• potential data privacy issues !
(depending on method to obtain machine translation)
Small framework for thinking about new
Use your ignorance as a weapon to think about a problem
from ﬁrst principles (you aren’t yet weighed down with any
Diff your conceptual framework and your research. Look
at where it diverges and try to understand why.!
Has tech changed/advanced? Were you missing something?
Do your research.