Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
A Study in (P)rose
1. A STUDY IN (P)ROSE
NLP Applied to Sherlock Holmes Stories
Stefano Bragaglia
2. The shadow was seated in a chair,
black outline upon the luminous
screen of the window.
• Corpora
• Basic Statistics
• Content & Word Frequency
• Readability
• Characters & Centrality
• Automatic Summarisation
• Word Vectors & Clustering
• Sentiment & Subjectivity
• Latent Topics
221BBakerStreet
3. “I only require a few missing links to have
an entirely connected case.”
• http://nbviewer.jupyter.org/github/skipgram/modern-nlp-in-
python/blob/master/executable/Modern_NLP_in_Python.ipynb
• http://brandonrose.org/clustering
• https://theinvisibleevent.wordpress.com/2015/11/08/35-the-language-of-
sherlock-holmes-a-study-in-consistency/
• http://www.christianpeccei.com/holmes/
• https://github.com/sgsinclair/alta/blob/master/ipynb/Python.ipynb
• http://data-mining.philippe-fournier-viger.com/tutorial-how-to-discover-hidden-
patterns-in-text-documents/
• http://sujitpal.blogspot.co.uk/2015/07/discovering-entity-relationships-in.html
• All the pictures are copyright of the respective authors.
4. “I had an idea that he might, and I took the liberty
of bringing the tools with me.”
• matplotlib – http://matplotlib.org
• newspaper3k – https://github.com/codelucas/newspaper
• python-igraph – http://igraph.org/python/#pyinstallosx
• pyclusterig – https://github.com/annoviko/pyclustering
• spaCy – https://spacy.io
• sumy – https://github.com/miso-belica/sumy
• textaCy – https://textacy.readthedocs.io/en/latest/index.html
• textblob – https://textblob.readthedocs.io/en/dev/
• word_cloud – https://github.com/amueller/word_cloud
5. CORPORA
“I have some documents here,” said my friend Sherlock
Holmes, as we sat one winter's night on either side of the
fire, “which I really think, Watson, that it would be
worth your while to glance over.
6. “I seem to have heard some queer stories about him.”
• In linguistics, a corpus (plural corpora) or text corpus is a large and
structured set of texts (nowadays usually electronically stored and
processed).
• The texts may be in a single language (monolingual corpus) or in
multiple languages (multilingual corpus). If formatted for side-by-side
comparison, they are called aligned parallel corpora (translation
corpus for translations, else comparable corpus).
• They are often subjected to annotation to make them more useful,
i.e. POS-tagging: information about words’ part of speech are added
as tags. If they contain further structured levels of analysis, they are
called Treebanks or Parsed Corpora.
7. “I seem to have heard some queer stories about him.”
• In linguistics, a corpus (plural corpora) or text corpus is a large and
structured set of texts (nowadays usually electronically stored and
processed).
• The texts may be in a single language (monolingual corpus) or in
multiple languages (multilingual corpus). If formatted for side-by-side
comparison, they are called aligned parallel corpora (translation
corpus for translations, else comparable corpus).
• They are often subjected to annotation to make them more useful,
i.e. POS-tagging: information about words’ part of speech are added
as tags. If they contain further structured levels of analysis, they are
called Treebanks or Parsed Corpora.
8. “I seem to have heard some queer stories about him.”
• In linguistics, a corpus (plural corpora) or text corpus is a large and
structured set of texts (nowadays usually electronically stored and
processed).
• The texts may be in a single language (monolingual corpus) or in
multiple languages (multilingual corpus). If formatted for side-by-side
comparison, they are called aligned parallel corpora (translation
corpus for translations, else comparable corpus).
• They are often subjected to annotation to make them more useful,
i.e. POS-tagging: information about words’ part of speech are added
as tags. If they contain further structured levels of analysis, they are
called Treebanks or Parsed Corpora.
9. “I seem to have heard some
queer stories about him.”
The complete Sherlock Homes Canon:
• 60 adventures in 9 books:
• 4 novels
• 56 short stories in 5 collections
• Freely available in several formats:
• https://sherlock-holm.es/
10. “I seem to have heard some queer stories about him.”
The Novels The Adventures of Sherlock Holmes The Memoirs of Sherlock Holmes
STUD A Study in Scarlet 1887-10 SCAN A Scandal in Bohemia 1891-07 SILV Silver Blaze 1892-12
SIGN The Sign of the Four 1890-02 REDH The Red-Headed League 1891-08 YELL Yellow Face 1893-02
HOUN The Hound of the Baskerville 1901-08 IDEN A Case of Identity 1891-09 STOC The Stockbroker’s Clerk 1893-03
VALL The Valley of Fear 1914-09 BOSC The Boscombe Valley Mystery 1891-10 GLOR The “Gloria Scott” 1893-04
FIVE The Five Orange Pips 1891-11 MUSG The Musgrave Ritual 1893-05
TWIS The Man with the Twisted Lip 1891-12 REIG The Reigate Puzzle 1893-06
BLUE The Adventure of the Blue Carbuncle 1892-01 CROO The Crooked Man 1893-07
SPEC The Adventure of the Speckled Band 1892-02 RESI The Resident Patient 1893-08
ENGR The Adventure of the Engineer’s Thumb 1892-03 GREE The Greek Interpreter 1893-09
NOBL The Adventure of the Noble Bachelor 1892-04 NAVA The Naval Treaty 1893-10
BERY The Adventure of the Beryl Coronet 1892-05 FINA The Final Problem 1893-12
COPP The Adventure of the Copper Beeches 1892-06
11. “I seem to have heard some queer stories about him.”
The Return of Sherlock Holmes His Last Bow The Case-Book of Sherlock Holmes
EMPT The Adventure of the Empty House 1903-09 WIST The Adventure of Wisteria Lodge 1908-08 ILLU The Illustrious Client 1924-11
NORW The Adventure of the Norwood Builder 1903-10 CARD The Adventure of the Cardboard Box 1893-01 BLAN The Blanched Soldier 1926-10
DANC The Adventure of the Dancing Men 1903-12 REDC The Adventure of the Red Circle 1911-03 MAZA The Adventure of the Mazarin Stone 1921-10
SOLI The Adventure of the Solitary Cyclist 1903-12 BRUC The Adventure of the Bruce-Partington
Plans
1908-12 3GAB The Adventure of the Three Gables 1926-09
PRIO The Adventure of the Priory School 1904-01 DYIN The Adventure of the Dying Detective 1913-11 SUSS The Adventure of the Sussex Vampire 1924-01
BLAC The Adventure of Black Peter 1904-02 LADY The Disappearance of Lady Frances Carfax 1911-12 3GAR The Adventure of the Three Garridebs 1924-10
CHAS The Adventure of Charles Augustus
Milverton
1904-03 DEVI The Adventure of the Devil’s Foot 1910-12 THOR The Problem of Thor Bridge 1922-02
SIXN The Adventure of the Six Napoleons 1904-04 LAST His Last Bow 1917-09 CREE The Adventure of the Creeping Man 1923-03
3STU The Adventure of the Three Students 1904-06 LION The Adventure of the Lion’s Mane 1926-11
GOLD The Adventure of the Golden Pince-Nez 1904-07 VEIL The Adventure of the Veiled Lodger 1927-01
MISS The Adventure of the Missing Three-
Quarter
1904-08 SHOS Adventure of Shoscombe Old Place 1927-03
ABBE The Adventure of the Abbey Grande 1904-09 RETI The Adventure of the Retired Colourman 1926-12
SECO The Adventure of the Second Stain 1904-12
12. BASIC STATISTICS
“You can, for example, never foretell what any one man
will do, but you can say with precision what an average
number will be up to. Individuals vary, but percentages
remain constant. So says the statistician.”
13. Now, I am counting upon joining it
here…
for document in corpus:
statistics['words'] = 0
statistics['sentences'] = 0
statistics['characters'] = 0
for sentence in document:
statistics['sentences'] += 1
for word in sentence:
if word is not punctuation:
statistics['words'] += 1
statistics['characters'] += len(word)
statistics['length'] = statistics['characters'] / statistics['words']
statistics['size'] = statistics['words'] / statistics['sentences']
14. Now, I am counting upon joining it
here…
meta = { 'stud.txt': {'author': 'Arthur Conan Doyle',
'collection': 'The Novels',
'title': 'A Study in Scarlet',
'code': 'STUD',
'pub_date': '1887-11'}, … }
for filename in os.listdir(folder):
content = ''
with open(filename, 'r', encoding='utf-8') as file:
content = file.read()
doc = textacy.Doc(content, metadata=meta[filename])
docs.append(document)
for document in textacy.Corpus('en', docs=docs, metadatas=meta):
print(textacy.text_stats.readability_stats(document))
print(textacy.text_stats.readability_stats(corpus))
THE THREE GABLES
The Case-Book of Sherlock Holmes
Arthur Conan Doyle (1926-09)
Statistics:
- Characters: 24,704
- Syllables: 7,704
- Words: 6,188
- Unique words: 1,421
- Polysyllable words: 284
- Sentences: 460
- Avg characters per word: 3.99
- Avg words per sentence: 13.45
Indexes:
- Automated Readability: 4.10
- Coleman-Liau: 5.47
- Flesch-Kincaid: 4.35
- Flesch Ease Readability: 87.85
- Gunning-Fog: 7.22
- SMOG: 7.62
16. Now, I am counting upon joining it here…
A. C. Doyle
4 books,
56 novels
Total words 730,000
Unique words 20,000
Unique lemmas 15,000
Total sentences 39,000
Avg word length 3.88
Avg sentence length 18.68
17. Now, I am counting upon joining it here…
A. C. Doyle W. Shakespeare
4 books,
56 novels
38 plays,
154 sonnets
Total words 730,000 1,035,000
Unique words 20,000 27,000
Unique lemmas 15,000 22,000
Total sentences 39,000 93,000
Avg word length 3.88 4.41
Avg sentence length 18.68 11.08
18. Now, I am counting upon joining it here…
A. C. Doyle W. Shakespeare %
4 books,
56 novels
38 plays,
154 sonnets
Total words 730,000 1,035,000 -29.5
Unique words 20,000 27,000 -25.9
Unique lemmas 15,000 22,000 -31.8
Total sentences 39,000 93,000 -58.0
Avg word length 3.88 4.41 -12.0
Avg sentence length 18.68 11.08 +68.6
19. Now, I am counting upon joining it here…
• Prodigious vocabulary: only 1/3 words (1/4 unique words) less than
Shakespeare despite the much shorter corpus
• Back in time there were much less words (Shakespeare had to
“invent” many of them – i.e.: eyeball) but contemporary authors have
much more restrictions to obey to
• As a term of comparison, consider that modern English contains
around 250,000 terms, most of them being neologisms like:
robot, computer, internet, unleaded, twerking, …
20. CONTENT
& WORDS FREQUENCY
“Because I made a blunder, my dear Watson—which is, I
am afraid, a more common occurrence than any one would
think who only knew me through your memoirs.
21. I frequently found my thoughts turning
in her direction and wondering…
for document in corpus:
f = {}
for sentence in document:
for word in sentence:
if word is not punctuation:
f[word] += 1
print([(w, f[w]) for w in sorted(f, key=f.get, reverse=True)][10:])
22. I frequently found my thoughts turning
in her direction and wondering…
for document in corpus:
f = {}
for sentence in document:
for word in sentence:
if word is not punctuation:
f[word] += 1
print([(w, f[w]) for w in sorted(f, key=f.get, reverse=True)][10:])
“You broke the thread of my thoughts; but perhaps it is as well.”
“Perhaps that is why we are so subtly influenced by it.”
23. I frequently found my thoughts turning
in her direction and wondering…
for document in corpus:
f = {}
for sentence in document:
for word in sentence:
if word is not punctuation:
f[word] += 1
f[word.lower()] += 1
print([(w, f[w]) for w in sorted(f, key=f.get, reverse=True)][10:])
24. I frequently found my thoughts turning
in her direction and wondering…
for document in corpus:
f = {}
for sentence in document:
for word in sentence:
if word is not punctuation:
f[word] += 1
f[word.lower()] += 1
print([(w, f[w]) for w in sorted(f, key=f.get, reverse=True)][10:])
“Funny, she didn't say good-bye.”
“Your correspondent says two friends.”
25. I frequently found my thoughts turning
in her direction and wondering…
for document in corpus:
f = {}
for sentence in document:
for word in sentence:
if word is not punctuation:
f[word] += 1
f[word.lower()] += 1
f[word.lemma.lower()] += 1
print([(w, f[w]) for w in sorted(f, key=f.get, reverse=True)][10:])
26. I frequently found my thoughts turning
in her direction and wondering…
for document in corpus:
f = {}
for sentence in document:
for word in sentence:
if word is not punctuation:
f[word] += 1
f[word.lower()] += 1
f[word.lemma.lower()] += 1
print([(w, f[w]) for w in sorted(f, key=f.get, reverse=True)][10:])
Standing at the window, I watched her walking briskly down the street,
until the gray turban and white feather were but a speck in the sombre crowd.
27. The furniture and pictures
were of the most common
and vulgar description.
• Extremely common words have little or no
use when retrieving information from
documents.
• Such words are called stop-words and are
usually excluded completely from the
vocabulary.
• The general strategy to determine stop words
is to sort them by frequency, and pick the first
N (often hand-filtered for their semantic
content relative to the domain).
• It is possible to use other sources, such as:
• https://en.wikipedia.org/wiki/Most_common_
words_in_English
28. I frequently found my thoughts turning
in her direction and wondering…
stopwords = […]
for document in corpus:
f = {}
for sentence in document:
for word in sentence:
if word is not punctuation:
if word is not punctuation and not in stopwords:
f[word] += 1
f[word.lower()] += 1
f[word.lemma.lower()] += 1
print([(w, f[w]) for w in sorted(f, key=f.get, reverse=True)][10:])
29. I frequently found my thoughts turning
in her direction and wondering…
stopwords = […]
for document in corpus:
f = {}
for sentence in document:
for word in sentence:
if word is not punctuation:
if word is not punctuation and not in stopwords:
f[word] += 1
f[word.lower()] += 1
f[word.lemma.lower()] += 1
print([(w, f[w]) for w in sorted(f, key=f.get, reverse=True)][10:])
“A friend of Mr. Sherlock is always welcome!”
Sherlock Holmes rubbed his hands with delight.
It's all very well for you to laugh, Mr. Sherlock Holmes.
30. I frequently found my thoughts turning
in her direction and wondering…
• In linguistics, an n-gram is a contiguous sequence of n items (such
as phonemes, syllables, letters, words or base pairs) from a given
sequence of text or speech.
• An n-gram model is a type of probabilistic language model for
predicting the next item in such a sequence in the form of a (n-1)-
order Markov model (i.e.: predictive keyboards).
• They provide a measure of co-location frequency, therefore they
may help identify:
• Syntagmatic Associations (i.e.: cold + weather, Burkina + Faso, etc.)
• Paradigmatic Associations (i.e.: synonyms, co-reference resolution, etc.)
A STUDY IN SCARLET
VI. Tobias Gregson Shows What He Can Do
Arthur Conan Doyle (1887-11)
"Look here, Mr. Sherlock Holmes," he said.
"Look here, Mr. Sherlock Holmes," he said.
"Look here, Mr. Sherlock Holmes," he said.
"Look here, Mr. Sherlock Holmes," he said.
"Look here, Mr. Sherlock Holmes," he said.
"Look here, Mr. Sherlock Holmes," he said.
"Look here, Mr. Sherlock Holmes," he said.
"Look here, Mr. Sherlock Holmes," he said.
"Look here, Mr. Sherlock Holmes," he said.
"Look here, Mr. Sherlock Holmes," he said.
"Look here, Mr. Sherlock Holmes," he said.
"Look here, Mr. Sherlock Holmes," he said.
31. I frequently found my thoughts turning
in her direction and wondering…
corpus = textacy.Corpus('en', docs=docs, metadatas=meta)
for doc in corpus:
bot = doc.to_bag_of_terms(ngrams={1, 2, 3},
drop_determiners=True,
filter_stops=True,
filter_punct=True,
filter_nums=False,
as_strings=True)
print({term: bot[term]
for term in sorted(bot, key=bot.get, reverse=True)})
bot = corpus.to_bag_of_terms(ngrams={1, 2, 3},
drop_determiners=True,
filter_stops=True,
filter_punct=True,
filter_nums=False,
as_strings=True)
print({term: bot[term]
for term in sorted(bot, key=bot.get, reverse=True)})
THE THREE GABLES
The Case-Book of Sherlock Holmes
Arthur Conan Doyle (1926-09)
Occurrences:
- holmes (54)
- mr. holmes (18)
- masser holmes (15)
- susan (15)
- one (14)
- say holmes (13)
- maberley (10)
- watson (9)
- mrs. maberley (8)
- steve (6)
- first (6)
- be not (6)
- douglas (5)
- london (5)
...
32. “It is the brightest rift which
I can at present see in the clouds.”
stopwords = […]
corpus = textacy.Corpus('en', docs=docs, metadatas=meta)
for doc in corpus:
wordcloud = WordCloud(max_words=1000, margin=0,
random_state=1)
.generate(doc.text)
matplotlib.pyplot.imshow(wordcloud, interpolation='bilinear')
matplotlib.pyplot.axis('off')
matplotlib.pyplot.figure()
wordcloud = WordCloud(max_words=1000, margin=0,
random_state=1, stopwords=stopwords)
.generate(doc.text)
matplotlib.pyplot.imshow(wordcloud, interpolation='bilinear')
matplotlib.pyplot.axis('off')
matplotlib.pyplot.show()
33. “It is the brightest rift which I can at present see in the clouds.”
34. I frequently found my thoughts turning
in her direction and wondering…
• Holmes is never any lower than 4th most popular word; it’s 4th only in
The Hound of Baskerville, in which he is rarely on stage
• man (& synonyms) is much more frequent than woman:
victorian misogyny?
• say is definitely Doyle’s favourite speech attribution verb
• Language is very concrete (say, see, come, know, go and think), with
almost no place for emotions (cry): scientific approach vs. spiritism
• Only little and time (more subjective words) go into the top 15
• The word the accounts for the 6% of all the corpus
35. READABILITY
My companion gave a sudden chuckle of comprehension.
“And not a very obscure cipher, Watson,” said he. “Why,
of course, it is Italian! The A means that it is addressed to
a woman. ‘Beware! Beware! Beware!’ How's that,
Watson?
36. Now, I am counting upon joining it here…
meta = { 'stud.txt': {'author': 'Arthur Conan Doyle',
'collection': 'The Novels',
'title': 'A Study in Scarlet',
'code': 'STUD',
'pub_date': '1887-11'}, … }
for filename in os.listdir(folder):
content = ''
with open(filename, 'r', encoding='utf-8') as file:
content = file.read()
doc = textacy.Doc(content, metadata=meta[filename])
docs.append(document)
for document in textacy.Corpus('en', docs=docs, metadatas=meta):
print(textacy.text_stats.readability_stats(document))
print(textacy.text_stats.readability_stats(corpus))
THE THREE GABLES
The Case-Book of Sherlock Holmes
Arthur Conan Doyle (1926-09)
Statistics:
- Characters: 24,704
- Syllables: 7,704
- Words: 6,188
- Unique words: 1,421
- Polysyllable words: 284
- Sentences: 460
- Avg characters per word: 3.99
- Avg words per sentence: 13.45
Indexes:
- Automated Readability: 4.10
- Coleman-Liau: 5.47
- Flesch-Kincaid: 4.35
- Flesch Ease Readability: 87.85
- Gunning-Fog: 7.22
- SMOG: 7.62
37. Now, I am counting
upon joining it here…
• Readability is the ease with which a reader
can understand a written text.
• The readability depends on content (the
complexity of its vocabulary and syntax) and
presentation (typographic aspects such as font
size, line height, and line length).
• Researchers have proposed several formulas
to determine the readability of a text by
means features like average word length in
syllables (ASL), average sentence length (ASW),
etc.
• For instance:
FLESCH = 206.835 − (1.015 × ASL) − (84.6 × ASW)
38. Now, I am counting upon joining it here…
• These stories are still very popular: very popular vocabulary, easy to
read with ideas easy to grasp
• The series ran for over 40 years (no continuously), still Doyle
maintained the same focus on the basic language
• The density for new words in this corpus is 8-11% and it is considered
ideal for an 8 year old (3rd grade)
• Excluding the first 2 (shorter and less prone to repetitions) novellas,
the other 7 books perfectly fall in the above interval
39. CHARACTERS & CENTRALITY
We tied Toby to the hall table, and reascended the stairs.
The room was as we had left it, save that a sheet had been
draped over the central figure. A weary-looking police-
sergeant reclined in the corner.
40. nlp = spacy.load('en')
for doc in corpus:
names = []
tuples = []
for par in re.split('(r?n){2}', doc.text):
parser = nlp(par)
entities = []
for ent in parser.ents:
if ent.label in [PERSON, LOC, GPE]:
name = re.sub('[^0-9a-zA-Z]+', ' ', ent.text)
if name not in names:
names.append(name)
for entity in entities:
tuples.append((entity, name))
entities.append(name)
“Of course, we do not yet know
what the relations may have been…”
41. “Of course, we do not yet know
what the relations may have been…”
ig = igraph.Graph.TupleList(tuples)
vector = ig.eigenvector_centrality()
colors = []
label_colors = []
for value in vector:
color = colorsys.hsv_to_rgb(2.0 * (1.0 - value) / 3.0, 1.0, 1.0)
label_colors.append('gray' if value < 0.5 else 'black')
colors.append('#%02x%02x%02x' % (int(color[0] * 255), int(color[1] * 255), int(color[2] * 255)))
ig.vs['label'] = names
ig.vs['color'] = colors
ig.vs['label_color'] = label_colors
layout = ig.layout('kk')
ig.write_svg('%s.png' % doc.meta['code'], margin=50, layout=layout, border=50, width=1280, height=800)
42. “Of course, we do not yet know what the relations may have been…”
THETHREEGABLES
TheCase-BookofSherlockHolmes
ArthurConanDoyle(1926-09)
43. AUTOMATIC
SUMMARISATION
It was in the summer of '89, not long after my marriage,
that the events occurred which I am now about to
summarise.
44. He knitted his brows as though determined
not to omit anything in his narrative.
• Automatic Summarisation might either be Extraction-based or
Abstraction-based. Best results come when both are applied.
• TextRank and LexRank are graph-based algorithms where
sentences are vertices and edges model the similarity between
them.
• While LexRank uses TF-IDF and cosine similarity, TextRank uses
PageRank (a word appearing in two sentences is like a link
between them) to measure the similarity between sentences.
• Roughly speaking, a sentence containing many keywords that also
appear in other sentences is a hub and receives a higher score.
• The sentences are sorted by this value: since the top N more
likely cover all the topics (keywords) in the document, they are
considered as the summary.
45. He knitted his brows as though determined
not to omit anything in his narrative.
• Automatic Summarisation might either be Extraction-based or
Abstraction-based. Best results come when both are applied.
• TextRank and LexRank are graph-based algorithms where
sentences are vertices and edges model the similarity between
them.
• While LexRank uses TF-IDF and cosine similarity, TextRank uses
PageRank (a word appearing in two sentences is like a link
between them) to measure the similarity between sentences.
• Roughly speaking, a sentence containing many keywords that also
appear in other sentences is a hub and receives a higher score.
• The sentences are sorted by this value: since the top N more
likely cover all the topics (keywords) in the document, they are
considered as the summary.
46. He knitted his brows as though determined
not to omit anything in his narrative.
• Automatic Summarisation might either be Extraction-based or
Abstraction-based. Best results come when both are applied.
• TextRank and LexRank are graph-based algorithms where
sentences are vertices and edges model the similarity between
them.
• While LexRank uses TF-IDF and cosine similarity, TextRank uses
PageRank (a word appearing in two sentences is like a link
between them) to measure the similarity between sentences.
• Roughly speaking, a sentence containing many keywords that also
appear in other sentences is a hub and receives a higher score.
• The sentences are sorted by this value: since the top N more
likely cover all the topics (keywords) in the document, they are
considered as the summary.
47. He knitted his brows as though determined
not to omit anything in his narrative.
• Automatic Summarisation might either be Extraction-based or
Abstraction-based. Best results come when both are applied.
• TextRank and LexRank are graph-based algorithms where
sentences are vertices and edges model the similarity between
them.
• While LexRank uses TF-IDF and cosine similarity, TextRank uses
PageRank (a word appearing in two sentences is like a link
between them) to measure the similarity between sentences.
• Roughly speaking, a sentence containing many keywords that also
appear in other sentences is a hub and receives a higher score.
• The sentences are sorted by this value: since the top N more
likely cover all the topics (keywords) in the document, they are
considered as the summary.
48. He knitted his brows as though determined
not to omit anything in his narrative.
• Automatic Summarisation might either be Extraction-based or
Abstraction-based. Best results come when both are applied.
• TextRank and LexRank are graph-based algorithms where
sentences are vertices and edges model the similarity between
them.
• While LexRank uses TF-IDF and cosine similarity, TextRank uses
PageRank (a word appearing in two sentences is like a link
between them) to measure the similarity between sentences.
• Roughly speaking, a sentence containing many keywords that also
appear in other sentences is a hub and receives a higher score.
• The sentences are sorted by this value: since the top N more
likely cover all the topics (keywords) in the document, they are
considered as the summary.
49. He knitted his brows as though determined
not to omit anything in his narrative.
path = '…'
language = 'english'
tokenizer = Tokenizer(language)
parser = PlaintextParser.from_file(path, tokenizer)
stemmer = Stemmer(language)
summarizer = Summarizer(stemmer)
summarizer.stop_words = get_stop_words(language)
summary = summarizer(parser.document, 10)
for sentence in summary:
print(sentence)
50. He knitted his brows as though determined
not to omit anything in his narrative.
THE DYING DETECTIVE
His Last Bow
Arthur Conan Doyle (1913-11)
Not only was her first-floor flat invaded at all hours by throngs of singular and often undesirable characters but her remarkable lodger
showed an eccentricity and irregularity in his life which must have sorely tried her patience.
His incredible untidiness, his addiction to music at strange hours, his occasional revolver practice within doors, his weird and often
malodorous scientific experiments, and the atmosphere of violence and danger which hung around him made him the very worst tenant in London.
Knowing how genuine was her regard for him, I listened earnestly to her story when she came to my rooms in the second year of my married life
and told me of the sad condition to which my poor friend was reduced.
In the dim light of a foggy November day the sick room was a gloomy spot, but it was that gaunt, wasted face staring at me from the bed which
sent a chill to my heart.
His eyes had the brightness of fever, there was a hectic flush upon either cheek, and dark crusts clung to his lips; the thin hands upon the
coverlet twitched incessantly, his voice was croaking and spasmodic.
Then, unable to settle down to reading, I walked slowly round the room, examining the pictures of celebrated criminals with which every wall
was adorned.
I saw a great yellow face, coarse-grained and greasy, with heavy, double-chin, and two sullen, menacing gray eyes which glared at me from
under tufted and sandy brows.
The skull was of enormous capacity, and yet as I looked down I saw to my amazement that the figure of the man was small and frail, twisted in
the shoulders and back like one who has suffered from rickets in his childhood.
Then in an instant his sudden access of strength departed, and his masterful, purposeful talk droned away into the low, vague murmurings of a
semi-delirious man.
You will realize that among your many talents dissimulation finds no place, and that if you had shared my secret you would never have been
able to impress Smith with the urgent necessity of his presence, which was the vital point of the whole scheme.
51. WORD VECTORS &
CLUSTERING
Our coming was evidently a great event, for station-master
and porters clustered round us to carry out our luggage.
52. For every step increased the distance between them…
• The term frequency–inverse document frequency (TF–IDF) is as numerical statistics that reflects
how important is a word to a document in a corpus.
• It is often used as a weighting factor in information retrieval, text mining and user modelling.
• It consists of the product of two terms:
• the term frequency captures the importance of a term for a document,
• the inverse document frequency measures the specificity of a term for a document in a corpus.
• There are various ways of computing these values, the simplest one utilises:
• the raw frequency ft,d for TF,
• the logarithm of the ration between N = |D| and nt = |d∈D: t∈d| of documents containing the term t for IDF.
• In combination with cosine similarity, a measure of similarity between two non-zero vectors that
measures the cosine of the angle between them, it provides a crude measure of the distance
between documents:
A · B
• similarity = cos(Θ) = ––––––––
||A|| ||B||
53. For every step increased the distance between them…
• The term frequency–inverse document frequency (TF–IDF) is as numerical statistics that reflects
how important is a word to a document in a corpus.
• It is often used as a weighting factor in information retrieval, text mining and user modelling.
• It consists of the product of two terms:
• the term frequency captures the importance of a term for a document,
• the inverse document frequency measures the specificity of a term for a document in a corpus.
• There are various ways of computing these values, the simplest one utilises:
• the raw frequency ft,d for TF,
• the logarithm of the ration between N = |D| and nt = |d∈D: t∈d| of documents containing the term t for IDF.
• In combination with cosine similarity, a measure of similarity between two non-zero vectors that
measures the cosine of the angle between them, it provides a crude measure of the distance
between documents:
A · B
• similarity = cos(Θ) = ––––––––
||A|| ||B||
54. For every step increased the distance between them…
• The term frequency–inverse document frequency (TF–IDF) is as numerical statistics that reflects
how important is a word to a document in a corpus.
• It is often used as a weighting factor in information retrieval, text mining and user modelling.
• It consists of the product of two terms:
• the term frequency captures the importance of a term for a document,
• the inverse document frequency measures the specificity of a term for a document in a corpus.
• There are various ways of computing these values, the simplest one utilises:
• the raw frequency ft,d for TF,
• the logarithm of the ration between N = |D| and nt = |d∈D: t∈d| of documents containing the term t for IDF.
• In combination with cosine similarity, a measure of similarity between two non-zero vectors that
measures the cosine of the angle between them, it provides a crude measure of the distance
between documents:
A · B
• similarity = cos(Θ) = ––––––––
||A|| ||B||
55. For every step increased the distance between them…
• The term frequency–inverse document frequency (TF–IDF) is as numerical statistics that reflects
how important is a word to a document in a corpus.
• It is often used as a weighting factor in information retrieval, text mining and user modelling.
• It consists of the product of two terms:
• the term frequency captures the importance of a term for a document,
• the inverse document frequency measures the specificity of a term for a document in a corpus.
• There are various ways of computing these values, the simplest one utilises:
• the raw frequency ft,d for TF,
• the logarithm of the ration between N = |D| and nt = |d∈D: t∈d| of documents containing the term t for IDF.
• In combination with cosine similarity, a measure of similarity between two non-zero vectors that
measures the cosine of the angle between them, it provides a crude measure of the distance
between documents:
A · B
• similarity = cos(Θ) = ––––––––
||A|| ||B||
56. For every step increased the distance between them…
• The term frequency–inverse document frequency (TF–IDF) is as numerical statistics that reflects
how important is a word to a document in a corpus.
• It is often used as a weighting factor in information retrieval, text mining and user modelling.
• It consists of the product of two terms:
• the term frequency captures the importance of a term for a document,
• the inverse document frequency measures the specificity of a term for a document in a corpus.
• There are various ways of computing these values, the simplest one utilises:
• the raw frequency ft,d for TF,
• the logarithm of the ration between N = |D| and nt = |d∈D: t∈d| of documents containing the term t for IDF.
• In combination with cosine similarity, a measure of similarity between two non-zero vectors that
measures the cosine of the angle between them, it provides a crude measure of the distance
between documents:
A · B
• similarity = cos(Θ) = ––––––––
||A|| ||B||
57. For every step increased
the distance between them…
idf = corpus.word_doc_freqs(weighting='idf')
tfs = {doc.metadata['code']: doc.to_bag_of_words(weighting='freq')
for doc in corpus.docs}
tfidfs = {code: [] for code in tfs}
for key in sorted(idf.keys()):
for code in tfidfs:
if key in tfs[code]:
tfidfs[code].append(tfs[code][key] * idf[key])
else:
tfidfs[code].append(0.0)
for i, k_i in enumerate(tfidfs.keys()):
for j, k_j in enumerate(tfidfs.keys()):
v = textacy.math_utils.cosine_similarity(tfidfs[k_i], tfidfs[k_j])
print('%s vs. %s : %.3f' %
(METADATA[k_i]['title'], METADATA[k_j]['title'], v))
Lady Frances Carfax
vs.
His Last Bow : 0.905
The Greek Interpreter
vs.
Lady Frances Carfax : 0.938
The Greek Interpreter
vs.
The Bruce-Partington Plans : 0.957
The Bruce-Partington Plans
vs.
The Greek Interpreter : 0.957
The Greek Interpreter
vs.
The Greek Interpreter : 1.000
58. For every step increased
the distance between them…
corpus = textacy.Corpus('en', docs=documents)
terms = (doc.to_terms_list(ngrams={1}, normalize='lemma')
for doc in corpus)
tfidf, idx = textacy.vsm.doc_term_matrix(terms, weighting='tfidf')
sample = tfidf.toarray()
sample_pca = mlab.PCA(sample)
sample_cutoff = sample_pca.fracs[1]
sample_2d = sample_pca.project(sample, minfrac=sample_cutoff)
instance = optics(sample, 0.8125, 2)
instance.process()
clusters = instance.get_clusters()
noise = instance.get_noise()
visualizer = cluster_visualizer()
visualizer.append_cluster(noise, sample_2d, marker='x')
visualizer.append_clusters(clusters, sample_2d)
visualizer.show()
59. For every step increased
the distance between them…
• A word embedding (GloVe, word2vec) is a group
of related models to sort the words in a corpus.
• These models are simple 2-layers neural
networks that are trained to reconstruct the
linguistic context of words.
• They take large corpora as input and produce
a vector space of several hundreds dimensions
as output.
• In such vector spaces, each word is given a
precise position to whom corresponds a vector,
so that special spatial properties are maintained.
• QUEEN = KING – MAN + WOMAN
60. For every step increased
the distance between them…
• A word embedding (GloVe, word2vec) is a group
of related models to sort the words in a corpus.
• These models are simple 2-layers neural
networks that are trained to reconstruct the
linguistic context of words.
• They take large corpora as input and produce
a vector space of several hundreds dimensions
as output.
• In such vector spaces, each word is given a
precise position to whom corresponds a vector,
so that special spatial properties are maintained.
• QUEEN = KING – MAN + WOMAN
61. For every step increased
the distance between them…
• A word embedding (GloVe, word2vec) is a group
of related models to sort the words in a corpus.
• These models are simple 2-layers neural
networks that are trained to reconstruct the
linguistic context of words.
• They take large corpora as input and produce
a vector space of several hundreds dimensions
as output.
• In such vector spaces, each word is given a
precise position to whom corresponds a vector,
so that special spatial properties are maintained.
• QUEEN = KING – MAN + WOMAN
62. For every step increased
the distance between them…
• A word embedding (GloVe, word2vec) is a group
of related models to sort the words in a corpus.
• These models are simple 2-layers neural
networks that are trained to reconstruct the
linguistic context of words.
• They take large corpora as input and produce
a vector space of several hundreds dimensions
as output.
• In such vector spaces, each word is given a
precise position to whom corresponds a vector,
so that special spatial properties are maintained.
• QUEEN = KING – MAN + WOMAN
63. For every step increased
the distance between them…
• A word embedding (GloVe, word2vec) is a group
of related models to sort the words in a corpus.
• These models are simple 2-layers neural
networks that are trained to reconstruct the
linguistic context of words.
• They take large corpora as input and produce
a vector space of several hundreds dimensions
as output.
• In such vector spaces, each word is given a
precise position to whom corresponds a vector,
so that special spatial properties are maintained.
• KING – MAN + WOMAN = ?
64. For every step increased
the distance between them…
• A word embedding (GloVe, word2vec) is a group
of related models to sort the words in a corpus.
• These models are simple 2-layers neural
networks that are trained to reconstruct the
linguistic context of words.
• They take large corpora as input and produce
a vector space of several hundreds dimensions
as output.
• In such vector spaces, each word is given a
precise position to whom corresponds a vector,
so that special spatial properties are maintained.
• KING – MAN + WOMAN = ?
65. For every step increased
the distance between them…
• A word embedding (GloVe, word2vec) is a group
of related models to sort the words in a corpus.
• These models are simple 2-layers neural
networks that are trained to reconstruct the
linguistic context of words.
• They take large corpora as input and produce
a vector space of several hundreds dimensions
as output.
• In such vector spaces, each word is given a
precise position to whom corresponds a vector,
so that special spatial properties are maintained.
• KING – MAN + WOMAN = ?
66. For every step increased
the distance between them…
• A word embedding (GloVe, word2vec) is a group
of related models to sort the words in a corpus.
• These models are simple 2-layers neural
networks that are trained to reconstruct the
linguistic context of words.
• They take large corpora as input and produce
a vector space of several hundreds dimensions
as output.
• In such vector spaces, each word is given a
precise position to whom corresponds a vector,
so that special spatial properties are maintained.
• KING – MAN + WOMAN = ?
67. For every step increased
the distance between them…
• A word embedding (GloVe, word2vec) is a group
of related models to sort the words in a corpus.
• These models are simple 2-layers neural
networks that are trained to reconstruct the
linguistic context of words.
• They take large corpora as input and produce
a vector space of several hundreds dimensions
as output.
• In such vector spaces, each word is given a
precise position to whom corresponds a vector,
so that special spatial properties are maintained.
• KING – MAN + WOMAN = ?
68. For every step increased
the distance between them…
• A word embedding (GloVe, word2vec) is a group
of related models to sort the words in a corpus.
• These models are simple 2-layers neural
networks that are trained to reconstruct the
linguistic context of words.
• They take large corpora as input and produce
a vector space of several hundreds dimensions
as output.
• In such vector spaces, each word is given a
precise position to whom corresponds a vector,
so that special spatial properties are maintained.
• KING – MAN + WOMAN = QUEEN
69. SENTIMENT &
SUBJECTIVITY
I felt of all Holmes's criminals this was the one whom
he would find it hardest to face.
However, he was immune from sentiment.
70. When this deduction is confirmed point by point,
then the subjective becomes objective.
• Sentiment analysis (sometimes known as opinion mining or emotion
AI) refers to the use of natural language processing to systematically
identify the affective states and subjective information in a text.
• Generally speaking, sentiment analysis aims to determine the
attitude of a speaker, writer, or other subject with respect to some
topic.
• Alternatively, sentiment analysis aims at identifying the overall
polarity and subjectivity or emotional reaction to a document.
• More sophisticated approaches are able to distinguish among a wider
selection of emotional states.
71. When this deduction is confirmed point by point,
then the subjective becomes objective.
• Sentiment analysis (sometimes known as opinion mining or emotion
AI) refers to the use of natural language processing to systematically
identify the affective states and subjective information in a text.
• Generally speaking, sentiment analysis aims to determine the
attitude of a speaker, writer, or other subject with respect to some
topic.
• Alternatively, sentiment analysis aims at identifying the overall
polarity and subjectivity or emotional reaction to a document.
• More sophisticated approaches are able to distinguish among a wider
selection of emotional states.
72. When this deduction is confirmed point by point,
then the subjective becomes objective.
• Sentiment analysis (sometimes known as opinion mining or emotion
AI) refers to the use of natural language processing to systematically
identify the affective states and subjective information in a text.
• Generally speaking, sentiment analysis aims to determine the
attitude of a speaker, writer, or other subject with respect to some
topic.
• Alternatively, sentiment analysis aims at identifying the overall
polarity and subjectivity or emotional reaction to a document.
• More sophisticated approaches are able to distinguish among a wider
selection of emotional states.
73. When this deduction is confirmed point by point,
then the subjective becomes objective.
• Sentiment analysis (sometimes known as opinion mining or emotion
AI) refers to the use of natural language processing to systematically
identify the affective states and subjective information in a text.
• Generally speaking, sentiment analysis aims to determine the
attitude of a speaker, writer, or other subject with respect to some
topic.
• Alternatively, sentiment analysis aims at identifying the overall
polarity and subjectivity or emotional reaction to a document.
• More sophisticated approaches are able to distinguish among a wider
selection of emotional states.
74. When this deduction is confirmed point by
point, then the subjective becomes objective.
for document in corpus:
blob = TextBlob(document.text)
for i, sentence in enumerate(blob.sentences):
print('%s)tpol: %.3f, sub: %.3f' %
(i, sentence.sentiment.polarity,
sentence.sentiment.subjectivity))
THE THREE GABLES
The Case-Book of Sherlock Holmes
Arthur Conan Doyle (1926-09)
0) pol: -0.125, sub: 1.000
1) pol: 0.136, sub: 0.455
2) pol: -0.052, sub: 0.196
3) pol: -0.625, sub: 1.000
4) pol: 0.200, sub: 0.700
5) pol: 0.127, sub: 0.833
6) pol: -0.071, sub: 0.362
7) pol: 0.000, sub: 0.000
8) pol: 0.000, sub: 0.000
9) pol: 0.300, sub: 0.100
10) pol: 0.000, sub: 0.000
11) pol: 0.000, sub: 0.000
12) pol: -0.425, sub: 0.675
13) pol: -0.125, sub: 0.375
14) pol: 0.600, sub: 1.000
15) pol: 0.000, sub: 0.000
16) pol: 0.000, sub: 0.000
17) pol: 0.417, sub: 0.500
18) pol: 0.000, sub: 0.000
19) pol: 0.417, sub: 0.500
20) pol: 0.000, sub: 0.000
...
75. When this deduction is confirmed point by point, then the subjective becomes objective.
76. When this deduction is confirmed point by
point, then the subjective becomes objective.
for doc in corpus:
for i, sent in enumerate(doc.sents):
scores = textacy.lexicon_methods.emotional_valence(sent)
values = ['%s: %.3f' % (k, scores[k]) for k in sorted(scores.keys())]
print('%s)t%s' % (i, 'nt'.join(values)))
THE THREE GABLES
The Case-Book of Sherlock Holmes
Arthur Conan Doyle (1926-09)
I don't think that any of my adventures with Mr.
Sherlock Holmes opened quite so abruptly, or so
dramatically, as that which I associate with The
Three Gables. I had not seen Holmes for some
days and had no idea of the new channel into
which his activities had been directed. He was
in a chatty mood that morning, however, and had
just settled me into the well-worn low armchair
on one side of the fire, while he had curled
down with his pipe in his mouth upon the
opposite chair, when our visitor arrived. If I
had said that a mad bull had arrived it would
give a clearer impression of what occurred.
77. LATENT TOPICS
“I have known him for some time,” said I,
“but I never knew him do anything yet without
a very good reason,” and with that our conversation
drifted off on to other topics.
78. He was face to face with an infinite possibility
of latent evil…
• Latent Dirichlet Allocation (LDA) is a generative model that automatically
discovers the topics that the sentences contain.
• It represents documents as mixtures of topics from where words are pulled out
with a certain probabilities.
• It assumes that each document
- has a number N of words (according to a Poisson distribution),
- has a topic mixture over a fixed set of K topics (according to a Poisson
distribution).
• Then, for each word in each document:
- a topic is picked randomly (according to the distribution sampled above),
- it randomly generates the word itself (according to the other distribution).
• Assuming this generative model for a collection of documents, LDA then tries to
backtrack from the documents to find a set of topics that have likely generated
the collection (Gibbs sampling).
79. He was face to face with an infinite possibility
of latent evil…
• Latent Dirichlet Allocation (LDA) is a generative model that automatically
discovers the topics that the sentences contain.
• It represents documents as mixtures of topics from where words are pulled out
with a certain probabilities.
• It assumes that each document
- has a number N of words (according to a Poisson distribution),
- has a topic mixture over a fixed set of K topics (according to a Poisson
distribution).
• Then, for each word in each document:
- a topic is picked randomly (according to the distribution sampled above),
- it randomly generates the word itself (according to the other distribution).
• Assuming this generative model for a collection of documents, LDA then tries to
backtrack from the documents to find a set of topics that have likely generated
the collection (Gibbs sampling).
80. He was face to face with an infinite possibility
of latent evil…
• Latent Dirichlet Allocation (LDA) is a generative model that automatically
discovers the topics that the sentences contain.
• It represents documents as mixtures of topics from where words are pulled out
with a certain probabilities.
• It assumes that each document
- has a number N of words (according to a Poisson distribution),
- has a topic mixture over a fixed set of K topics (according to a Poisson
distribution).
• Then, for each word in each document:
- a topic is picked randomly (according to the distribution sampled above),
- it randomly generates the word itself (according to the other distribution).
• Assuming this generative model for a collection of documents, LDA then tries to
backtrack from the documents to find a set of topics that have likely generated
the collection (Gibbs sampling).
81. He was face to face with an infinite possibility
of latent evil…
• Latent Dirichlet Allocation (LDA) is a generative model that automatically
discovers the topics that the sentences contain.
• It represents documents as mixtures of topics from where words are pulled out
with a certain probabilities.
• It assumes that each document
- has a number N of words (according to a Poisson distribution),
- has a topic mixture over a fixed set of K topics (according to a Poisson
distribution).
• Then, for each word in each document:
- a topic is picked randomly (according to the distribution sampled above),
- it randomly generates the word itself (according to the other distribution).
• Assuming this generative model for a collection of documents, LDA then tries to
backtrack from the documents to find a set of topics that have likely generated
the collection (Gibbs sampling).
82. He was face to face with an infinite possibility
of latent evil…
• Latent Dirichlet Allocation (LDA) is a generative model that automatically
discovers the topics that the sentences contain.
• It represents documents as mixtures of topics from where words are pulled out
with a certain probabilities.
• It assumes that each document
- has a number N of words (according to a Poisson distribution),
- has a topic mixture over a fixed set of K topics (according to a Poisson
distribution).
• Then, for each word in each document:
- a topic is picked randomly (according to the distribution sampled above),
- it randomly generates the word itself (according to the other distribution).
• Assuming this generative model for a collection of documents, LDA then tries to
backtrack from the documents to find a set of topics that have likely generated
the collection (Gibbs sampling).
83. He was face to face with
an infinite possibility of latent evil…
corpus = textacy.Corpus('en', docs=documents)
terms = (doc.to_terms_list(ngrams={1}, normalize='lemma')
for doc in corpus)
tfidf, idx = textacy.vsm.doc_term_matrix(terms, weighting='tfidf')
model = textacy.tm.TopicModel('lda', n_topics=60)
model.fit(tfidx)
for topic_idx, top_terms in model.top_topic_terms(idx, top_n=5):
print('Topic #%s: %s' % (topic_idx, 'tt'.join(top_terms)))
topics = model.transform(tfidf)
for doc_idx, top_topics in model.top_doc_topics(topics):
print('%s: %s' % (corpus.docs[doc_idx].metadata['title'],
'tt'.join(['Topic #%s (%.2f)' % (t[0], 100 * t[1])
for t in top_topics])))
model.termite_plot(tfidf, idx)
84. He was face to face with an infinite possibility of latent evil…
Topic #0: lestrade london woman window lady
miss street inspector hour sherlock
Topic #6: jones wilson hopkins inspector sholto
trevor league office birmingham pinner
Topic #9: gregson mycroft mcmurdo warren garcia
douglas barker susan inspector greek
Topic #10: moor mortimer henry duke grace
american charles bicycle hopkins wilder
Topic #11: mcmurdo douglas susan barker robert
steve barney jones smith sholto
Topic #12: robert ferguson smith trevor woodley
carruthers jones mason sholto gregson
...
The Sign of the Four: Topic #0 (46.77) Topic #12 (25.02) Topic #6 (23.45)
A Study in Scarlet: Topic #0 (53.95) Topic #52 (35.67) Topic #51 (33.71)
The Hound of the Baskervilles: Topic #10 (50.89) Topic #0 (44.51) Topic #54 (38.52)
The Valley of Fear: Topic #11 (49.42) Topic #9 (28.17) Topic #0 (27.12)
...
85. He was face to face with an infinite possibility of latent evil…
86. “You are very welcome to put any
questions that you like to me now,
and there is no danger that I will
refuse to answer them.”