SlideShare una empresa de Scribd logo
1 de 86
Descargar para leer sin conexión
A STUDY IN (P)ROSE
NLP Applied to Sherlock Holmes Stories
Stefano Bragaglia
The shadow was seated in a chair,
black outline upon the luminous
screen of the window.
• Corpora
• Basic	Statistics
• Content	&	Word	Frequency
• Readability
• Characters	&	Centrality
• Automatic	Summarisation
• Word	Vectors	&	Clustering
• Sentiment	&	Subjectivity
• Latent	Topics
221BBakerStreet
“I only require a few missing links to have
an entirely connected case.”
• http://nbviewer.jupyter.org/github/skipgram/modern-nlp-in-
python/blob/master/executable/Modern_NLP_in_Python.ipynb
• http://brandonrose.org/clustering
• https://theinvisibleevent.wordpress.com/2015/11/08/35-the-language-of-
sherlock-holmes-a-study-in-consistency/
• http://www.christianpeccei.com/holmes/
• https://github.com/sgsinclair/alta/blob/master/ipynb/Python.ipynb
• http://data-mining.philippe-fournier-viger.com/tutorial-how-to-discover-hidden-
patterns-in-text-documents/
• http://sujitpal.blogspot.co.uk/2015/07/discovering-entity-relationships-in.html
• All	the	pictures	are	copyright	of	the	respective	authors.
“I had an idea that he might, and I took the liberty
of bringing the tools with me.”
• matplotlib – http://matplotlib.org
• newspaper3k	– https://github.com/codelucas/newspaper
• python-igraph – http://igraph.org/python/#pyinstallosx
• pyclusterig – https://github.com/annoviko/pyclustering
• spaCy – https://spacy.io
• sumy – https://github.com/miso-belica/sumy
• textaCy – https://textacy.readthedocs.io/en/latest/index.html
• textblob – https://textblob.readthedocs.io/en/dev/
• word_cloud – https://github.com/amueller/word_cloud
CORPORA
“I have some documents here,” said my friend Sherlock
Holmes, as we sat one winter's night on either side of the
fire, “which I really think, Watson, that it would be
worth your while to glance over.
“I seem to have heard some queer stories about him.”
• In	linguistics,	a	corpus	(plural	corpora)	or	text	corpus	is	a	large	and	
structured	set	of	texts	(nowadays	usually	electronically	stored	and	
processed).
• The	texts	may	be	in	a	single	language	(monolingual	corpus)	or	in	
multiple	languages	(multilingual	corpus).	If	formatted	for	side-by-side	
comparison,	they	are	called aligned	parallel	corpora	(translation	
corpus for	translations,	else comparable	corpus).	
• They	are	often	subjected	to	annotation to	make	them	more	useful,	
i.e.	POS-tagging:	information	about	words’	part	of	speech	are	added	
as	tags.	If	they	contain	further structured levels	of	analysis,	they	are	
called	Treebanks or	Parsed	Corpora.
“I seem to have heard some queer stories about him.”
• In	linguistics,	a	corpus	(plural	corpora)	or	text	corpus	is	a	large	and	
structured	set	of	texts	(nowadays	usually	electronically	stored	and	
processed).
• The	texts	may	be	in	a	single	language	(monolingual	corpus)	or	in	
multiple	languages	(multilingual	corpus).	If	formatted	for	side-by-side	
comparison,	they	are	called aligned	parallel	corpora	(translation	
corpus for	translations,	else comparable	corpus).	
• They	are	often	subjected	to	annotation to	make	them	more	useful,	
i.e.	POS-tagging:	information	about	words’	part	of	speech	are	added	
as	tags.	If	they	contain	further structured levels	of	analysis,	they	are	
called	Treebanks or	Parsed	Corpora.
“I seem to have heard some queer stories about him.”
• In	linguistics,	a	corpus	(plural	corpora)	or	text	corpus	is	a	large	and	
structured	set	of	texts	(nowadays	usually	electronically	stored	and	
processed).
• The	texts	may	be	in	a	single	language	(monolingual	corpus)	or	in	
multiple	languages	(multilingual	corpus).	If	formatted	for	side-by-side	
comparison,	they	are	called aligned	parallel	corpora	(translation	
corpus for	translations,	else comparable	corpus).	
• They	are	often	subjected	to	annotation to	make	them	more	useful,	
i.e.	POS-tagging:	information	about	words’	part	of	speech	are	added	
as	tags.	If	they	contain	further structured levels	of	analysis,	they	are	
called	Treebanks or	Parsed	Corpora.
“I seem to have heard some
queer stories about him.”
The	complete	Sherlock	Homes	Canon:
• 60 adventures	in	9 books:
• 4 novels
• 56 short	stories	in	5 collections
• Freely	available	in	several	formats:
• https://sherlock-holm.es/
“I seem to have heard some queer stories about him.”
The	Novels The	Adventures	of	Sherlock Holmes The	Memoirs	of	Sherlock	Holmes
STUD A	Study	in	Scarlet 1887-10 SCAN A	Scandal	in	Bohemia 1891-07 SILV Silver	Blaze 1892-12
SIGN The	Sign	of	the	Four 1890-02 REDH The	Red-Headed	League 1891-08 YELL Yellow	Face 1893-02
HOUN The	Hound	of	the	Baskerville 1901-08 IDEN A	Case	of	Identity 1891-09 STOC The	Stockbroker’s	Clerk 1893-03
VALL The	Valley	of	Fear 1914-09 BOSC The	Boscombe Valley	Mystery 1891-10 GLOR The	“Gloria	Scott” 1893-04
FIVE The	Five	Orange	Pips 1891-11 MUSG The	Musgrave	Ritual 1893-05
TWIS The	Man	with	the	Twisted	Lip 1891-12 REIG The	Reigate Puzzle 1893-06
BLUE The	Adventure	of	the	Blue	Carbuncle 1892-01 CROO The	Crooked	Man 1893-07
SPEC The	Adventure	of	the	Speckled	Band 1892-02 RESI The	Resident	Patient 1893-08
ENGR The	Adventure	of	the	Engineer’s	Thumb 1892-03 GREE The	Greek	Interpreter 1893-09
NOBL The	Adventure	of	the	Noble	Bachelor 1892-04 NAVA The	Naval	Treaty 1893-10
BERY The	Adventure	of	the	Beryl	Coronet 1892-05 FINA The	Final	Problem 1893-12
COPP The	Adventure of	the	Copper	Beeches 1892-06
“I seem to have heard some queer stories about him.”
The	Return	of	Sherlock	Holmes His	Last	Bow The	Case-Book of	Sherlock	Holmes
EMPT The	Adventure	of	the	Empty	House 1903-09 WIST The	Adventure	of	Wisteria	Lodge 1908-08 ILLU The	Illustrious	Client 1924-11
NORW The	Adventure	of	the	Norwood	Builder 1903-10 CARD The	Adventure	of	the	Cardboard	Box 1893-01 BLAN The	Blanched	Soldier 1926-10
DANC The	Adventure	of	the	Dancing	Men 1903-12 REDC The	Adventure	of	the	Red	Circle 1911-03 MAZA The	Adventure	of	the	Mazarin	Stone 1921-10
SOLI The Adventure	of	the	Solitary	Cyclist 1903-12 BRUC The	Adventure	of	the	Bruce-Partington
Plans
1908-12 3GAB The	Adventure	of	the	Three	Gables 1926-09
PRIO The	Adventure	of	the	Priory	School 1904-01 DYIN The	Adventure	of	the	Dying	Detective 1913-11 SUSS The	Adventure	of	the	Sussex	Vampire 1924-01
BLAC The	Adventure	of	Black	Peter 1904-02 LADY The	Disappearance	of	Lady	Frances	Carfax 1911-12 3GAR The	Adventure	of	the	Three	Garridebs 1924-10
CHAS The	Adventure	of	Charles	Augustus	
Milverton
1904-03 DEVI The	Adventure	of	the	Devil’s	Foot 1910-12 THOR The Problem	of	Thor	Bridge 1922-02
SIXN The	Adventure	of	the	Six	Napoleons 1904-04 LAST His	Last	Bow 1917-09 CREE The	Adventure	of	the	Creeping	Man 1923-03
3STU The	Adventure	of	the	Three	Students 1904-06 LION The	Adventure	of	the	Lion’s	Mane 1926-11
GOLD The	Adventure	of	the	Golden	Pince-Nez 1904-07 VEIL The	Adventure	of	the	Veiled	Lodger 1927-01
MISS The	Adventure	of	the	Missing	Three-
Quarter
1904-08 SHOS Adventure	of	Shoscombe Old	Place 1927-03
ABBE The	Adventure	of	the	Abbey	Grande 1904-09 RETI The	Adventure	of	the	Retired	Colourman 1926-12
SECO The	Adventure	of	the	Second	Stain 1904-12
BASIC STATISTICS
“You can, for example, never foretell what any one man
will do, but you can say with precision what an average
number will be up to. Individuals vary, but percentages
remain constant. So says the statistician.”
Now, I am counting upon joining it
here…
for document	in corpus:
statistics['words']	= 0
statistics['sentences']	= 0
statistics['characters']	= 0
for sentence	in document:
statistics['sentences']	+= 1
for word	in sentence:
if word	is	not	punctuation:
statistics['words']	+= 1
statistics['characters']	+= len(word)
statistics['length']	= statistics['characters']	/ statistics['words']
statistics['size']	= statistics['words']	/ statistics['sentences']
Now, I am counting upon joining it
here…
meta	= { 'stud.txt': {'author': 'Arthur Conan Doyle',
'collection': 'The Novels',
'title': 'A Study in Scarlet',
'code': 'STUD',
'pub_date': '1887-11'}, … }
for filename in os.listdir(folder):
content =	''
with open(filename,	'r',	encoding='utf-8')	as file:
content =	file.read()
doc	= textacy.Doc(content,	metadata=meta[filename])
docs.append(document)
for document	in textacy.Corpus('en',	docs=docs,	metadatas=meta):
print(textacy.text_stats.readability_stats(document))
print(textacy.text_stats.readability_stats(corpus))
THE THREE GABLES
The Case-Book of Sherlock Holmes
Arthur Conan Doyle (1926-09)
Statistics:
- Characters: 24,704
- Syllables: 7,704
- Words: 6,188
- Unique words: 1,421
- Polysyllable words: 284
- Sentences: 460
- Avg characters per word: 3.99
- Avg words per sentence: 13.45
Indexes:
- Automated Readability: 4.10
- Coleman-Liau: 5.47
- Flesch-Kincaid: 4.35
- Flesch Ease Readability: 87.85
- Gunning-Fog: 7.22
- SMOG: 7.62
Now, I am counting upon joining it here…
Now, I am counting upon joining it here…
A.	C.	Doyle
4	books,	
56	novels
Total	words 730,000
Unique words 20,000
Unique	lemmas 15,000
Total	sentences 39,000
Avg word	length 3.88
Avg sentence length 18.68
Now, I am counting upon joining it here…
A.	C.	Doyle W.	Shakespeare
4	books,	
56	novels
38	plays,	
154	sonnets
Total	words 730,000 1,035,000
Unique words 20,000 27,000
Unique	lemmas 15,000 22,000
Total	sentences 39,000 93,000
Avg word	length 3.88 4.41
Avg sentence length 18.68 11.08
Now, I am counting upon joining it here…
A.	C.	Doyle W.	Shakespeare %
4	books,	
56	novels
38	plays,	
154	sonnets
Total	words 730,000 1,035,000 -29.5
Unique words 20,000 27,000 -25.9
Unique	lemmas 15,000 22,000 -31.8
Total	sentences 39,000 93,000 -58.0
Avg word	length 3.88 4.41 -12.0
Avg sentence length 18.68 11.08 +68.6
Now, I am counting upon joining it here…
• Prodigious	vocabulary:	only	1/3	words	(1/4	unique	words)	less	than	
Shakespeare	despite	the	much	shorter	corpus
• Back	in	time	there	were	much	less	words	(Shakespeare	had	to	
“invent”	many	of	them	– i.e.:	eyeball)	but	contemporary	authors	have	
much	more	restrictions	to	obey	to
• As	a	term	of	comparison,	consider	that	modern	English	contains	
around	250,000	terms,	most	of	them	being	neologisms	like:	
robot,	computer,	internet,	unleaded,	twerking,	…
CONTENT
& WORDS FREQUENCY
“Because I made a blunder, my dear Watson—which is, I
am afraid, a more common occurrence than any one would
think who only knew me through your memoirs.
I frequently found my thoughts turning
in her direction and wondering…
for document	in corpus:
f	= {}
for sentence	in document:
for word	in sentence:
if word	is	not	punctuation:
f[word]	+= 1
print([(w,	f[w])	for w	in sorted(f,	key=f.get,	reverse=True)][10:])
I frequently found my thoughts turning
in her direction and wondering…
for document	in corpus:
f	= {}
for sentence	in document:
for word	in sentence:
if word	is	not	punctuation:
f[word]	+= 1
print([(w,	f[w])	for w	in sorted(f,	key=f.get,	reverse=True)][10:])
“You broke the thread of my thoughts; but perhaps it is as well.”
“Perhaps that is why we are so subtly influenced by it.”
I frequently found my thoughts turning
in her direction and wondering…
for document	in corpus:
f	= {}
for sentence	in document:
for word	in sentence:
if word	is	not	punctuation:
f[word]	+= 1
f[word.lower()]	+= 1
print([(w,	f[w])	for w	in sorted(f,	key=f.get,	reverse=True)][10:])
I frequently found my thoughts turning
in her direction and wondering…
for document	in corpus:
f	= {}
for sentence	in document:
for word	in sentence:
if word	is	not	punctuation:
f[word]	+= 1
f[word.lower()]	+= 1
print([(w,	f[w])	for w	in sorted(f,	key=f.get,	reverse=True)][10:])
“Funny, she didn't say good-bye.”
“Your correspondent says two friends.”
I frequently found my thoughts turning
in her direction and wondering…
for document	in corpus:
f	= {}
for sentence	in document:
for word	in sentence:
if word	is	not	punctuation:
f[word]	+= 1
f[word.lower()]	+= 1
f[word.lemma.lower()]	+= 1
print([(w,	f[w])	for w	in sorted(f,	key=f.get,	reverse=True)][10:])
I frequently found my thoughts turning
in her direction and wondering…
for document	in corpus:
f	= {}
for sentence	in document:
for word	in sentence:
if word	is	not	punctuation:
f[word]	+= 1
f[word.lower()]	+= 1
f[word.lemma.lower()]	+= 1
print([(w,	f[w])	for w	in sorted(f,	key=f.get,	reverse=True)][10:])
Standing at the window, I watched her walking briskly down the street,
until the gray turban and white feather were but a speck in the sombre crowd.
The furniture and pictures
were of the most common
and vulgar description.
• Extremely	common	words	have	little	or	no	
use	when	retrieving	information	from	
documents.
• Such	words	are	called	stop-words	and	are	
usually	excluded	completely	from	the	
vocabulary.
• The	general	strategy	to	determine	stop	words	
is	to	sort	them	by frequency,	and	pick	the	first	
N	(often	hand-filtered	for	their	semantic	
content	relative	to	the	domain).
• It	is	possible	to	use	other	sources,	such	as:
• https://en.wikipedia.org/wiki/Most_common_
words_in_English
I frequently found my thoughts turning
in her direction and wondering…
stopwords =	[…]
for document	in corpus:
f	= {}
for sentence	in document:
for word	in sentence:
if word	is	not	punctuation:
if word	is	not	punctuation	and	not	in stopwords:
f[word]	+= 1
f[word.lower()]	+= 1
f[word.lemma.lower()]	+= 1
print([(w,	f[w])	for w	in sorted(f,	key=f.get,	reverse=True)][10:])
I frequently found my thoughts turning
in her direction and wondering…
stopwords =	[…]
for document	in corpus:
f	= {}
for sentence	in document:
for word	in sentence:
if word	is	not	punctuation:
if word	is	not	punctuation	and	not	in stopwords:
f[word]	+= 1
f[word.lower()]	+= 1
f[word.lemma.lower()]	+= 1
print([(w,	f[w])	for w	in sorted(f,	key=f.get,	reverse=True)][10:])
“A friend of Mr. Sherlock is always welcome!”
Sherlock Holmes rubbed his hands with delight.
It's all very well for you to laugh, Mr. Sherlock Holmes.
I frequently found my thoughts turning
in her direction and wondering…
• In	linguistics,	an n-gram is	a	contiguous	sequence	of n items	(such	
as	phonemes,	syllables,	letters,	words	or	base	pairs)	from	a	given	
sequence of	text	or	speech.	
• An	n-gram	model	is	a	type	of	probabilistic	language	model	for	
predicting	the	next	item	in	such	a	sequence	in	the	form	of	a	(n-1)-
order	Markov	model	(i.e.:	predictive	keyboards).
• They	provide	a	measure	of	co-location	frequency,	therefore	they	
may	help	identify:
• Syntagmatic	Associations	(i.e.:	cold	+	weather,	Burkina	+	Faso,	etc.)
• Paradigmatic	Associations	(i.e.:	synonyms,	co-reference	resolution,	etc.)
A STUDY IN SCARLET
VI. Tobias Gregson Shows What He Can Do
Arthur Conan Doyle (1887-11)
"Look here, Mr. Sherlock Holmes," he said.
"Look here, Mr. Sherlock Holmes," he said.
"Look here, Mr. Sherlock Holmes," he said.
"Look here, Mr. Sherlock Holmes," he said.
"Look here, Mr. Sherlock Holmes," he said.
"Look here, Mr. Sherlock Holmes," he said.
"Look here, Mr. Sherlock Holmes," he said.
"Look here, Mr. Sherlock Holmes," he said.
"Look here, Mr. Sherlock Holmes," he said.
"Look here, Mr. Sherlock Holmes," he said.
"Look here, Mr. Sherlock Holmes," he said.
"Look here, Mr. Sherlock Holmes," he said.
I frequently found my thoughts turning
in her direction and wondering…
corpus	=	textacy.Corpus('en',	docs=docs,	metadatas=meta)
for doc	in corpus:
bot =	doc.to_bag_of_terms(ngrams={1,	2,	3},
drop_determiners=True,
filter_stops=True,
filter_punct=True,
filter_nums=False,
as_strings=True)
print({term: bot[term]	
for term	in sorted(bot,	key=bot.get,	reverse=True)})
bot =	corpus.to_bag_of_terms(ngrams={1,	2,	3},
drop_determiners=True,
filter_stops=True,
filter_punct=True,
filter_nums=False,
as_strings=True)
print({term: bot[term]	
for term	in sorted(bot,	key=bot.get,	reverse=True)})
THE THREE GABLES
The Case-Book of Sherlock Holmes
Arthur Conan Doyle (1926-09)
Occurrences:
- holmes (54)
- mr. holmes (18)
- masser holmes (15)
- susan (15)
- one (14)
- say holmes (13)
- maberley (10)
- watson (9)
- mrs. maberley (8)
- steve (6)
- first (6)
- be not (6)
- douglas (5)
- london (5)
...
“It is the brightest rift which
I can at present see in the clouds.”
stopwords = […]
corpus	= textacy.Corpus('en',	docs=docs,	metadatas=meta)
for doc	in corpus:
wordcloud = WordCloud(max_words=1000,	margin=0,	
random_state=1)
.generate(doc.text)
matplotlib.pyplot.imshow(wordcloud,	interpolation='bilinear')
matplotlib.pyplot.axis('off')
matplotlib.pyplot.figure()
wordcloud = WordCloud(max_words=1000,	margin=0,	
random_state=1,	stopwords=stopwords)
.generate(doc.text)
matplotlib.pyplot.imshow(wordcloud,	interpolation='bilinear')
matplotlib.pyplot.axis('off')
matplotlib.pyplot.show()
“It is the brightest rift which I can at present see in the clouds.”
I frequently found my thoughts turning
in her direction and wondering…
• Holmes is	never	any	lower	than	4th most	popular	word;	it’s	4th only	in	
The	Hound	of	Baskerville,	in	which	he	is	rarely	on	stage
• man (&	synonyms)	is	much	more	frequent	than	woman:	
victorian misogyny?
• say is	definitely	Doyle’s	favourite	speech	attribution	verb
• Language	is	very	concrete (say,	see,	come,	know,	go and	think),	with	
almost	no	place	for	emotions	(cry):	scientific	approach	vs.	spiritism
• Only	little and	time	(more	subjective	words)	go	into	the	top	15
• The	word	the accounts	for	the	6%	of	all	the	corpus
READABILITY
My companion gave a sudden chuckle of comprehension.
“And not a very obscure cipher, Watson,” said he. “Why,
of course, it is Italian! The A means that it is addressed to
a woman. ‘Beware! Beware! Beware!’ How's that,
Watson?
Now, I am counting upon joining it here…
meta	= { 'stud.txt': {'author': 'Arthur Conan Doyle',
'collection': 'The Novels',
'title': 'A Study in Scarlet',
'code': 'STUD',
'pub_date': '1887-11'}, … }
for filename in os.listdir(folder):
content =	''
with open(filename,	'r',	encoding='utf-8')	as file:
content =	file.read()
doc	= textacy.Doc(content,	metadata=meta[filename])
docs.append(document)
for document	in textacy.Corpus('en',	docs=docs,	metadatas=meta):
print(textacy.text_stats.readability_stats(document))
print(textacy.text_stats.readability_stats(corpus))
THE THREE GABLES
The Case-Book of Sherlock Holmes
Arthur Conan Doyle (1926-09)
Statistics:
- Characters: 24,704
- Syllables: 7,704
- Words: 6,188
- Unique words: 1,421
- Polysyllable words: 284
- Sentences: 460
- Avg characters per word: 3.99
- Avg words per sentence: 13.45
Indexes:
- Automated Readability: 4.10
- Coleman-Liau: 5.47
- Flesch-Kincaid: 4.35
- Flesch Ease Readability: 87.85
- Gunning-Fog: 7.22
- SMOG: 7.62
Now, I am counting
upon joining it here…
• Readability is	the	ease	with	which	a	reader	
can	understand	a	written	text.
• The	readability	depends	on	content (the	
complexity	of	its	vocabulary and	syntax)	and	
presentation (typographic	aspects	such	as	font	
size,	line	height,	and	line	length).
• Researchers	have	proposed	several	formulas	
to	determine	the	readability	of	a	text	by	
means	features	like	average	word	length	in	
syllables	(ASL),	average	sentence	length	(ASW),	
etc.
• For	instance:
FLESCH =	206.835	−	(1.015	× ASL)	−	(84.6	× ASW)
Now, I am counting upon joining it here…
• These	stories	are	still	very	popular:	very	popular	vocabulary,	easy	to	
read	with	ideas	easy	to	grasp
• The	series	ran	for	over	40	years	(no	continuously),	still	Doyle	
maintained	the	same	focus	on	the	basic	language
• The	density	for	new	words	in	this	corpus	is	8-11%	and	it	is	considered	
ideal	for	an	8	year	old	(3rd grade)
• Excluding	the	first	2	(shorter	and	less	prone	to	repetitions)	novellas,	
the	other	7	books	perfectly	fall	in	the	above	interval
CHARACTERS & CENTRALITY
We tied Toby to the hall table, and reascended the stairs.
The room was as we had left it, save that a sheet had been
draped over the central figure. A weary-looking police-
sergeant reclined in the corner.
nlp = spacy.load('en')
for doc	in corpus:
names	= []
tuples	= []
for	par	in	re.split('(r?n){2}',	doc.text):
parser	= nlp(par)
entities	= []
for ent in parser.ents:
if ent.label in [PERSON,	LOC,	GPE]:
name	= re.sub('[^0-9a-zA-Z]+',	'	',	ent.text)
if name	not	in	names:
names.append(name)
for	entity	in	entities:
tuples.append((entity,	name))
entities.append(name)
“Of course, we do not yet know
what the relations may have been…”
“Of course, we do not yet know
what the relations may have been…”
ig = igraph.Graph.TupleList(tuples)
vector	= ig.eigenvector_centrality()
colors = []
label_colors = []
for	value	in	vector:
color = colorsys.hsv_to_rgb(2.0 * (1.0 - value)	/ 3.0,	1.0,	1.0)
label_colors.append('gray' if value	< 0.5 else 'black')
colors.append('#%02x%02x%02x' % (int(color[0]	* 255),	int(color[1]	* 255),	int(color[2]	* 255)))
ig.vs['label']	= names
ig.vs['color']	= colors
ig.vs['label_color']	= label_colors
layout	= ig.layout('kk')
ig.write_svg('%s.png' % doc.meta['code'],	margin=50,	layout=layout,	border=50,	width=1280,	height=800)
“Of course, we do not yet know what the relations may have been…”
THETHREEGABLES
TheCase-BookofSherlockHolmes
ArthurConanDoyle(1926-09)
AUTOMATIC
SUMMARISATION
It was in the summer of '89, not long after my marriage,
that the events occurred which I am now about to
summarise.
He knitted his brows as though determined
not to omit anything in his narrative.
• Automatic	Summarisation might	either	be	Extraction-based	or	
Abstraction-based.	Best	results	come	when	both	are	applied.
• TextRank and	LexRank are	graph-based	algorithms	where	
sentences	are	vertices	and	edges	model	the	similarity	between	
them.
• While	LexRank uses	TF-IDF and	cosine	similarity,	TextRank uses	
PageRank (a	word	appearing	in	two	sentences	is	like	a	link	
between	them)	to	measure	the	similarity	between	sentences.	
• Roughly	speaking,	a	sentence	containing	many	keywords	that	also	
appear	in	other	sentences	is	a	hub	and	receives	a	higher	score.
• The	sentences	are	sorted	by	this	value:	since	the	top	N	more	
likely	cover	all	the	topics	(keywords)	in	the	document,	they	are	
considered	as	the	summary.
He knitted his brows as though determined
not to omit anything in his narrative.
• Automatic	Summarisation might	either	be	Extraction-based	or	
Abstraction-based.	Best	results	come	when	both	are	applied.
• TextRank and	LexRank are	graph-based	algorithms	where	
sentences	are	vertices	and	edges	model	the	similarity	between	
them.
• While	LexRank uses	TF-IDF and	cosine	similarity,	TextRank uses	
PageRank (a	word	appearing	in	two	sentences	is	like	a	link	
between	them)	to	measure	the	similarity	between	sentences.	
• Roughly	speaking,	a	sentence	containing	many	keywords	that	also	
appear	in	other	sentences	is	a	hub	and	receives	a	higher	score.
• The	sentences	are	sorted	by	this	value:	since	the	top	N	more	
likely	cover	all	the	topics	(keywords)	in	the	document,	they	are	
considered	as	the	summary.
He knitted his brows as though determined
not to omit anything in his narrative.
• Automatic	Summarisation might	either	be	Extraction-based	or	
Abstraction-based.	Best	results	come	when	both	are	applied.
• TextRank and	LexRank are	graph-based	algorithms	where	
sentences	are	vertices	and	edges	model	the	similarity	between	
them.
• While	LexRank uses	TF-IDF and	cosine	similarity,	TextRank uses	
PageRank (a	word	appearing	in	two	sentences	is	like	a	link	
between	them)	to	measure	the	similarity	between	sentences.	
• Roughly	speaking,	a	sentence	containing	many	keywords	that	also	
appear	in	other	sentences	is	a	hub	and	receives	a	higher	score.
• The	sentences	are	sorted	by	this	value:	since	the	top	N	more	
likely	cover	all	the	topics	(keywords)	in	the	document,	they	are	
considered	as	the	summary.
He knitted his brows as though determined
not to omit anything in his narrative.
• Automatic	Summarisation might	either	be	Extraction-based	or	
Abstraction-based.	Best	results	come	when	both	are	applied.
• TextRank and	LexRank are	graph-based	algorithms	where	
sentences	are	vertices	and	edges	model	the	similarity	between	
them.
• While	LexRank uses	TF-IDF and	cosine	similarity,	TextRank uses	
PageRank (a	word	appearing	in	two	sentences	is	like	a	link	
between	them)	to	measure	the	similarity	between	sentences.	
• Roughly	speaking,	a	sentence	containing	many	keywords	that	also	
appear	in	other	sentences	is	a	hub	and	receives	a	higher	score.
• The	sentences	are	sorted	by	this	value:	since	the	top	N	more	
likely	cover	all	the	topics	(keywords)	in	the	document,	they	are	
considered	as	the	summary.
He knitted his brows as though determined
not to omit anything in his narrative.
• Automatic	Summarisation might	either	be	Extraction-based	or	
Abstraction-based.	Best	results	come	when	both	are	applied.
• TextRank and	LexRank are	graph-based	algorithms	where	
sentences	are	vertices	and	edges	model	the	similarity	between	
them.
• While	LexRank uses	TF-IDF and	cosine	similarity,	TextRank uses	
PageRank (a	word	appearing	in	two	sentences	is	like	a	link	
between	them)	to	measure	the	similarity	between	sentences.	
• Roughly	speaking,	a	sentence	containing	many	keywords	that	also	
appear	in	other	sentences	is	a	hub	and	receives	a	higher	score.
• The	sentences	are	sorted	by	this	value:	since	the	top	N	more	
likely	cover	all	the	topics	(keywords)	in	the	document,	they	are	
considered	as	the	summary.
He knitted his brows as though determined
not to omit anything in his narrative.
path = '…'
language = 'english'
tokenizer = Tokenizer(language)
parser = PlaintextParser.from_file(path,	tokenizer)
stemmer = Stemmer(language)
summarizer = Summarizer(stemmer)
summarizer.stop_words = get_stop_words(language)
summary = summarizer(parser.document,	10)
for sentence in summary:
print(sentence)
He knitted his brows as though determined
not to omit anything in his narrative.
THE DYING DETECTIVE
His Last Bow
Arthur Conan Doyle (1913-11)
Not only was her first-floor flat invaded at all hours by throngs of singular and often undesirable characters but her remarkable lodger
showed an eccentricity and irregularity in his life which must have sorely tried her patience.
His incredible untidiness, his addiction to music at strange hours, his occasional revolver practice within doors, his weird and often
malodorous scientific experiments, and the atmosphere of violence and danger which hung around him made him the very worst tenant in London.
Knowing how genuine was her regard for him, I listened earnestly to her story when she came to my rooms in the second year of my married life
and told me of the sad condition to which my poor friend was reduced.
In the dim light of a foggy November day the sick room was a gloomy spot, but it was that gaunt, wasted face staring at me from the bed which
sent a chill to my heart.
His eyes had the brightness of fever, there was a hectic flush upon either cheek, and dark crusts clung to his lips; the thin hands upon the
coverlet twitched incessantly, his voice was croaking and spasmodic.
Then, unable to settle down to reading, I walked slowly round the room, examining the pictures of celebrated criminals with which every wall
was adorned.
I saw a great yellow face, coarse-grained and greasy, with heavy, double-chin, and two sullen, menacing gray eyes which glared at me from
under tufted and sandy brows.
The skull was of enormous capacity, and yet as I looked down I saw to my amazement that the figure of the man was small and frail, twisted in
the shoulders and back like one who has suffered from rickets in his childhood.
Then in an instant his sudden access of strength departed, and his masterful, purposeful talk droned away into the low, vague murmurings of a
semi-delirious man.
You will realize that among your many talents dissimulation finds no place, and that if you had shared my secret you would never have been
able to impress Smith with the urgent necessity of his presence, which was the vital point of the whole scheme.
WORD VECTORS &
CLUSTERING
Our coming was evidently a great event, for station-master
and porters clustered round us to carry out our luggage.
For every step increased the distance between them…
• The	term	frequency–inverse	document	frequency	(TF–IDF)	is	as	numerical	statistics	that	reflects	
how	important	is	a	word	to	a	document	in	a	corpus.
• It	is	often	used	as	a	weighting	factor	in	information	retrieval,	text	mining	and	user	modelling.
• It	consists	of	the	product	of	two	terms:
• the	term	frequency captures	the	importance of	a	term	for	a	document,
• the	inverse	document	frequency	measures	the	specificity of	a	term	for	a	document	in	a	corpus.
• There	are	various	ways	of	computing	these	values,	the	simplest	one	utilises:
• the	raw	frequency	ft,d for	TF,	
• the	logarithm	of	the	ration	between	N = |D| and	nt = |d∈D: t∈d| of	documents	containing	the	term	t for	IDF.
• In	combination	with	cosine	similarity,	a	measure	of	similarity	between	two	non-zero	vectors	that	
measures	the	cosine	of	the	angle	between	them,	it	provides	a	crude	measure	of	the	distance	
between	documents:
A · B
• similarity = cos(Θ) = ––––––––
||A|| ||B||
For every step increased the distance between them…
• The	term	frequency–inverse	document	frequency	(TF–IDF)	is	as	numerical	statistics	that	reflects	
how	important	is	a	word	to	a	document	in	a	corpus.
• It	is	often	used	as	a	weighting	factor	in	information	retrieval,	text	mining	and	user	modelling.
• It	consists	of	the	product	of	two	terms:
• the	term	frequency captures	the	importance of	a	term	for	a	document,
• the	inverse	document	frequency	measures	the	specificity of	a	term	for	a	document	in	a	corpus.
• There	are	various	ways	of	computing	these	values,	the	simplest	one	utilises:
• the	raw	frequency	ft,d for	TF,	
• the	logarithm	of	the	ration	between	N = |D| and	nt = |d∈D: t∈d| of	documents	containing	the	term	t for	IDF.
• In	combination	with	cosine	similarity,	a	measure	of	similarity	between	two	non-zero	vectors	that	
measures	the	cosine	of	the	angle	between	them,	it	provides	a	crude	measure	of	the	distance	
between	documents:
A · B
• similarity = cos(Θ) = ––––––––
||A|| ||B||
For every step increased the distance between them…
• The	term	frequency–inverse	document	frequency	(TF–IDF)	is	as	numerical	statistics	that	reflects	
how	important	is	a	word	to	a	document	in	a	corpus.
• It	is	often	used	as	a	weighting	factor	in	information	retrieval,	text	mining	and	user	modelling.
• It	consists	of	the	product	of	two	terms:
• the	term	frequency captures	the	importance of	a	term	for	a	document,
• the	inverse	document	frequency	measures	the	specificity of	a	term	for	a	document	in	a	corpus.
• There	are	various	ways	of	computing	these	values,	the	simplest	one	utilises:
• the	raw	frequency	ft,d for	TF,	
• the	logarithm	of	the	ration	between	N = |D| and	nt = |d∈D: t∈d| of	documents	containing	the	term	t for	IDF.
• In	combination	with	cosine	similarity,	a	measure	of	similarity	between	two	non-zero	vectors	that	
measures	the	cosine	of	the	angle	between	them,	it	provides	a	crude	measure	of	the	distance	
between	documents:
A · B
• similarity = cos(Θ) = ––––––––
||A|| ||B||
For every step increased the distance between them…
• The	term	frequency–inverse	document	frequency	(TF–IDF)	is	as	numerical	statistics	that	reflects	
how	important	is	a	word	to	a	document	in	a	corpus.
• It	is	often	used	as	a	weighting	factor	in	information	retrieval,	text	mining	and	user	modelling.
• It	consists	of	the	product	of	two	terms:
• the	term	frequency captures	the	importance of	a	term	for	a	document,
• the	inverse	document	frequency	measures	the	specificity of	a	term	for	a	document	in	a	corpus.
• There	are	various	ways	of	computing	these	values,	the	simplest	one	utilises:
• the	raw	frequency	ft,d for	TF,	
• the	logarithm	of	the	ration	between	N = |D| and	nt = |d∈D: t∈d| of	documents	containing	the	term	t for	IDF.
• In	combination	with	cosine	similarity,	a	measure	of	similarity	between	two	non-zero	vectors	that	
measures	the	cosine	of	the	angle	between	them,	it	provides	a	crude	measure	of	the	distance	
between	documents:
A · B
• similarity = cos(Θ) = ––––––––
||A|| ||B||
For every step increased the distance between them…
• The	term	frequency–inverse	document	frequency	(TF–IDF)	is	as	numerical	statistics	that	reflects	
how	important	is	a	word	to	a	document	in	a	corpus.
• It	is	often	used	as	a	weighting	factor	in	information	retrieval,	text	mining	and	user	modelling.
• It	consists	of	the	product	of	two	terms:
• the	term	frequency captures	the	importance of	a	term	for	a	document,
• the	inverse	document	frequency	measures	the	specificity of	a	term	for	a	document	in	a	corpus.
• There	are	various	ways	of	computing	these	values,	the	simplest	one	utilises:
• the	raw	frequency	ft,d for	TF,	
• the	logarithm	of	the	ration	between	N = |D| and	nt = |d∈D: t∈d| of	documents	containing	the	term	t for	IDF.
• In	combination	with	cosine	similarity,	a	measure	of	similarity	between	two	non-zero	vectors	that	
measures	the	cosine	of	the	angle	between	them,	it	provides	a	crude	measure	of	the	distance	
between	documents:
A · B
• similarity = cos(Θ) = ––––––––
||A|| ||B||
For every step increased
the distance between them…
idf = corpus.word_doc_freqs(weighting='idf')
tfs = {doc.metadata['code']: doc.to_bag_of_words(weighting='freq')
for doc	in corpus.docs}
tfidfs = {code: []	for code	in tfs}
for key	in sorted(idf.keys()):
for code	in tfidfs:
if key	in tfs[code]:
tfidfs[code].append(tfs[code][key]	* idf[key])
else:
tfidfs[code].append(0.0)
for i,	k_i in enumerate(tfidfs.keys()):
for j,	k_j in enumerate(tfidfs.keys()):
v	= textacy.math_utils.cosine_similarity(tfidfs[k_i],	tfidfs[k_j])
print('%s	vs.	%s	:	%.3f' %
(METADATA[k_i]['title'],	METADATA[k_j]['title'],	v))
Lady Frances Carfax
vs.
His Last Bow : 0.905
The Greek Interpreter
vs.
Lady Frances Carfax : 0.938
The Greek Interpreter
vs.
The Bruce-Partington Plans : 0.957
The Bruce-Partington Plans
vs.
The Greek Interpreter : 0.957
The Greek Interpreter
vs.
The Greek Interpreter : 1.000
For every step increased
the distance between them…
corpus	= textacy.Corpus('en',	docs=documents)
terms	= (doc.to_terms_list(ngrams={1},	normalize='lemma')
for doc	in corpus)
tfidf,	idx = textacy.vsm.doc_term_matrix(terms,	weighting='tfidf')
sample	= tfidf.toarray()
sample_pca = mlab.PCA(sample)
sample_cutoff = sample_pca.fracs[1]
sample_2d	= sample_pca.project(sample,	minfrac=sample_cutoff)
instance	= optics(sample,	0.8125,	2)
instance.process()
clusters	= instance.get_clusters()
noise	= instance.get_noise()
visualizer	= cluster_visualizer()
visualizer.append_cluster(noise,	sample_2d,	marker='x')
visualizer.append_clusters(clusters,	sample_2d)
visualizer.show()
For every step increased
the distance between them…
• A	word	embedding (GloVe,	word2vec)	is	a	group	
of	related	models	to	sort	the	words	in	a	corpus.
• These	models	are	simple	2-layers	neural	
networks	that	are	trained	to	reconstruct	the	
linguistic	context	of	words.
• They	take	large	corpora	as	input	and	produce	
a	vector	space	of	several	hundreds	dimensions	
as	output.
• In	such	vector	spaces,	each	word	is	given	a	
precise	position	to	whom	corresponds	a	vector,	
so	that	special	spatial	properties	are	maintained.
• QUEEN =	KING – MAN +	WOMAN
For every step increased
the distance between them…
• A	word	embedding (GloVe,	word2vec)	is	a	group	
of	related	models	to	sort	the	words	in	a	corpus.
• These	models	are	simple	2-layers	neural	
networks	that	are	trained	to	reconstruct	the	
linguistic	context	of	words.
• They	take	large	corpora	as	input	and	produce	
a	vector	space	of	several	hundreds	dimensions	
as	output.
• In	such	vector	spaces,	each	word	is	given	a	
precise	position	to	whom	corresponds	a	vector,	
so	that	special	spatial	properties	are	maintained.
• QUEEN =	KING – MAN +	WOMAN
For every step increased
the distance between them…
• A	word	embedding (GloVe,	word2vec)	is	a	group	
of	related	models	to	sort	the	words	in	a	corpus.
• These	models	are	simple	2-layers	neural	
networks	that	are	trained	to	reconstruct	the	
linguistic	context	of	words.
• They	take	large	corpora	as	input	and	produce	
a	vector	space	of	several	hundreds	dimensions	
as	output.
• In	such	vector	spaces,	each	word	is	given	a	
precise	position	to	whom	corresponds	a	vector,	
so	that	special	spatial	properties	are	maintained.
• QUEEN =	KING – MAN +	WOMAN
For every step increased
the distance between them…
• A	word	embedding (GloVe,	word2vec)	is	a	group	
of	related	models	to	sort	the	words	in	a	corpus.
• These	models	are	simple	2-layers	neural	
networks	that	are	trained	to	reconstruct	the	
linguistic	context	of	words.
• They	take	large	corpora	as	input	and	produce	
a	vector	space	of	several	hundreds	dimensions	
as	output.
• In	such	vector	spaces,	each	word	is	given	a	
precise	position	to	whom	corresponds	a	vector,	
so	that	special	spatial	properties	are	maintained.
• QUEEN =	KING – MAN +	WOMAN
For every step increased
the distance between them…
• A	word	embedding (GloVe,	word2vec)	is	a	group	
of	related	models	to	sort	the	words	in	a	corpus.
• These	models	are	simple	2-layers	neural	
networks	that	are	trained	to	reconstruct	the	
linguistic	context	of	words.
• They	take	large	corpora	as	input	and	produce	
a	vector	space	of	several	hundreds	dimensions	
as	output.
• In	such	vector	spaces,	each	word	is	given	a	
precise	position	to	whom	corresponds	a	vector,	
so	that	special	spatial	properties	are	maintained.
• KING – MAN +	WOMAN	= ?
For every step increased
the distance between them…
• A	word	embedding (GloVe,	word2vec)	is	a	group	
of	related	models	to	sort	the	words	in	a	corpus.
• These	models	are	simple	2-layers	neural	
networks	that	are	trained	to	reconstruct	the	
linguistic	context	of	words.
• They	take	large	corpora	as	input	and	produce	
a	vector	space	of	several	hundreds	dimensions	
as	output.
• In	such	vector	spaces,	each	word	is	given	a	
precise	position	to	whom	corresponds	a	vector,	
so	that	special	spatial	properties	are	maintained.
• KING – MAN +	WOMAN	=	?
For every step increased
the distance between them…
• A	word	embedding (GloVe,	word2vec)	is	a	group	
of	related	models	to	sort	the	words	in	a	corpus.
• These	models	are	simple	2-layers	neural	
networks	that	are	trained	to	reconstruct	the	
linguistic	context	of	words.
• They	take	large	corpora	as	input	and	produce	
a	vector	space	of	several	hundreds	dimensions	
as	output.
• In	such	vector	spaces,	each	word	is	given	a	
precise	position	to	whom	corresponds	a	vector,	
so	that	special	spatial	properties	are	maintained.
• KING – MAN +	WOMAN	=	?
For every step increased
the distance between them…
• A	word	embedding (GloVe,	word2vec)	is	a	group	
of	related	models	to	sort	the	words	in	a	corpus.
• These	models	are	simple	2-layers	neural	
networks	that	are	trained	to	reconstruct	the	
linguistic	context	of	words.
• They	take	large	corpora	as	input	and	produce	
a	vector	space	of	several	hundreds	dimensions	
as	output.
• In	such	vector	spaces,	each	word	is	given	a	
precise	position	to	whom	corresponds	a	vector,	
so	that	special	spatial	properties	are	maintained.
• KING – MAN +	WOMAN	=	?
For every step increased
the distance between them…
• A	word	embedding (GloVe,	word2vec)	is	a	group	
of	related	models	to	sort	the	words	in	a	corpus.
• These	models	are	simple	2-layers	neural	
networks	that	are	trained	to	reconstruct	the	
linguistic	context	of	words.
• They	take	large	corpora	as	input	and	produce	
a	vector	space	of	several	hundreds	dimensions	
as	output.
• In	such	vector	spaces,	each	word	is	given	a	
precise	position	to	whom	corresponds	a	vector,	
so	that	special	spatial	properties	are	maintained.
• KING – MAN +	WOMAN	=	?
For every step increased
the distance between them…
• A	word	embedding (GloVe,	word2vec)	is	a	group	
of	related	models	to	sort	the	words	in	a	corpus.
• These	models	are	simple	2-layers	neural	
networks	that	are	trained	to	reconstruct	the	
linguistic	context	of	words.
• They	take	large	corpora	as	input	and	produce	
a	vector	space	of	several	hundreds	dimensions	
as	output.
• In	such	vector	spaces,	each	word	is	given	a	
precise	position	to	whom	corresponds	a	vector,	
so	that	special	spatial	properties	are	maintained.
• KING – MAN +	WOMAN	=	QUEEN
SENTIMENT &
SUBJECTIVITY
I felt of all Holmes's criminals this was the one whom
he would find it hardest to face.
However, he was immune from sentiment.
When this deduction is confirmed point by point,
then the subjective becomes objective.
• Sentiment	analysis (sometimes	known	as opinion	mining or emotion	
AI)	refers	to	the	use	of natural	language	processing to	systematically	
identify	the	affective	states	and	subjective	information	in	a	text.	
• Generally	speaking,	sentiment	analysis	aims	to	determine	the	
attitude of	a	speaker,	writer,	or	other	subject	with	respect	to	some	
topic.
• Alternatively,	sentiment	analysis	aims	at	identifying	the	overall	
polarity and	subjectivity or	emotional	reaction	to	a	document.	
• More	sophisticated	approaches	are	able	to	distinguish	among	a	wider	
selection	of	emotional	states.
When this deduction is confirmed point by point,
then the subjective becomes objective.
• Sentiment	analysis (sometimes	known	as opinion	mining or emotion	
AI)	refers	to	the	use	of natural	language	processing to	systematically	
identify	the	affective	states	and	subjective	information	in	a	text.	
• Generally	speaking,	sentiment	analysis	aims	to	determine	the	
attitude of	a	speaker,	writer,	or	other	subject	with	respect	to	some	
topic.
• Alternatively,	sentiment	analysis	aims	at	identifying	the	overall	
polarity and	subjectivity or	emotional	reaction	to	a	document.	
• More	sophisticated	approaches	are	able	to	distinguish	among	a	wider	
selection	of	emotional	states.
When this deduction is confirmed point by point,
then the subjective becomes objective.
• Sentiment	analysis (sometimes	known	as opinion	mining or emotion	
AI)	refers	to	the	use	of natural	language	processing to	systematically	
identify	the	affective	states	and	subjective	information	in	a	text.	
• Generally	speaking,	sentiment	analysis	aims	to	determine	the	
attitude of	a	speaker,	writer,	or	other	subject	with	respect	to	some	
topic.
• Alternatively,	sentiment	analysis	aims	at	identifying	the	overall	
polarity and	subjectivity or	emotional	reaction	to	a	document.	
• More	sophisticated	approaches	are	able	to	distinguish	among	a	wider	
selection	of	emotional	states.
When this deduction is confirmed point by point,
then the subjective becomes objective.
• Sentiment	analysis (sometimes	known	as opinion	mining or emotion	
AI)	refers	to	the	use	of natural	language	processing to	systematically	
identify	the	affective	states	and	subjective	information	in	a	text.	
• Generally	speaking,	sentiment	analysis	aims	to	determine	the	
attitude of	a	speaker,	writer,	or	other	subject	with	respect	to	some	
topic.
• Alternatively,	sentiment	analysis	aims	at	identifying	the	overall	
polarity and	subjectivity or	emotional	reaction	to	a	document.	
• More	sophisticated	approaches	are	able	to	distinguish	among	a	wider	
selection	of	emotional	states.
When this deduction is confirmed point by
point, then the subjective becomes objective.
for document	in corpus:
blob	= TextBlob(document.text)
for i,	sentence	in enumerate(blob.sentences):
print('%s)tpol:	%.3f,	sub:	%.3f' %
(i,	sentence.sentiment.polarity,
sentence.sentiment.subjectivity))
THE THREE GABLES
The Case-Book of Sherlock Holmes
Arthur Conan Doyle (1926-09)
0) pol: -0.125, sub: 1.000
1) pol: 0.136, sub: 0.455
2) pol: -0.052, sub: 0.196
3) pol: -0.625, sub: 1.000
4) pol: 0.200, sub: 0.700
5) pol: 0.127, sub: 0.833
6) pol: -0.071, sub: 0.362
7) pol: 0.000, sub: 0.000
8) pol: 0.000, sub: 0.000
9) pol: 0.300, sub: 0.100
10) pol: 0.000, sub: 0.000
11) pol: 0.000, sub: 0.000
12) pol: -0.425, sub: 0.675
13) pol: -0.125, sub: 0.375
14) pol: 0.600, sub: 1.000
15) pol: 0.000, sub: 0.000
16) pol: 0.000, sub: 0.000
17) pol: 0.417, sub: 0.500
18) pol: 0.000, sub: 0.000
19) pol: 0.417, sub: 0.500
20) pol: 0.000, sub: 0.000
...
When this deduction is confirmed point by point, then the subjective becomes objective.
When this deduction is confirmed point by
point, then the subjective becomes objective.
for doc	in corpus:
for i, sent	in enumerate(doc.sents):
scores	= textacy.lexicon_methods.emotional_valence(sent)
values	= ['%s:	%.3f' % (k,	scores[k])	for k	in sorted(scores.keys())]
print('%s)t%s' % (i,	'nt'.join(values)))
THE THREE GABLES
The Case-Book of Sherlock Holmes
Arthur Conan Doyle (1926-09)
I don't think that any of my adventures with Mr.
Sherlock Holmes opened quite so abruptly, or so
dramatically, as that which I associate with The
Three Gables. I had not seen Holmes for some
days and had no idea of the new channel into
which his activities had been directed. He was
in a chatty mood that morning, however, and had
just settled me into the well-worn low armchair
on one side of the fire, while he had curled
down with his pipe in his mouth upon the
opposite chair, when our visitor arrived. If I
had said that a mad bull had arrived it would
give a clearer impression of what occurred.
LATENT TOPICS
“I have known him for some time,” said I,
“but I never knew him do anything yet without
a very good reason,” and with that our conversation
drifted off on to other topics.
He was face to face with an infinite possibility
of latent evil…
• Latent	Dirichlet Allocation	(LDA)	is	a	generative	model	that	automatically	
discovers	the	topics	that	the	sentences	contain.	
• It	represents	documents	as	mixtures	of	topics	from	where	words	are	pulled	out	
with	a	certain	probabilities.
• It	assumes	that	each	document	
- has	a	number	N of	words	(according	to	a	Poisson	distribution),
- has	a	topic	mixture	over	a	fixed	set	of	K topics	(according	to	a	Poisson	
distribution).
• Then,	for	each	word	in	each	document:
- a	topic	is	picked	randomly	(according	to	the	distribution	sampled	above),
- it	randomly	generates	the	word	itself	(according	to	the	other	distribution).
• Assuming	this	generative	model	for	a	collection	of	documents,	LDA	then	tries	to	
backtrack	from	the	documents	to	find	a	set	of	topics	that	have	likely	generated	
the	collection	(Gibbs	sampling).
He was face to face with an infinite possibility
of latent evil…
• Latent	Dirichlet Allocation	(LDA)	is	a	generative	model	that	automatically	
discovers	the	topics	that	the	sentences	contain.	
• It	represents	documents	as	mixtures	of	topics	from	where	words	are	pulled	out	
with	a	certain	probabilities.
• It	assumes	that	each	document	
- has	a	number	N of	words	(according	to	a	Poisson	distribution),
- has	a	topic	mixture	over	a	fixed	set	of	K topics	(according	to	a	Poisson	
distribution).
• Then,	for	each	word	in	each	document:
- a	topic	is	picked	randomly	(according	to	the	distribution	sampled	above),
- it	randomly	generates	the	word	itself	(according	to	the	other	distribution).
• Assuming	this	generative	model	for	a	collection	of	documents,	LDA	then	tries	to	
backtrack	from	the	documents	to	find	a	set	of	topics	that	have	likely	generated	
the	collection	(Gibbs	sampling).
He was face to face with an infinite possibility
of latent evil…
• Latent	Dirichlet Allocation	(LDA)	is	a	generative	model	that	automatically	
discovers	the	topics	that	the	sentences	contain.	
• It	represents	documents	as	mixtures	of	topics	from	where	words	are	pulled	out	
with	a	certain	probabilities.
• It	assumes	that	each	document	
- has	a	number	N of	words	(according	to	a	Poisson	distribution),
- has	a	topic	mixture	over	a	fixed	set	of	K topics	(according	to	a	Poisson	
distribution).
• Then,	for	each	word	in	each	document:
- a	topic	is	picked	randomly	(according	to	the	distribution	sampled	above),
- it	randomly	generates	the	word	itself	(according	to	the	other	distribution).
• Assuming	this	generative	model	for	a	collection	of	documents,	LDA	then	tries	to	
backtrack	from	the	documents	to	find	a	set	of	topics	that	have	likely	generated	
the	collection	(Gibbs	sampling).
He was face to face with an infinite possibility
of latent evil…
• Latent	Dirichlet Allocation	(LDA)	is	a	generative	model	that	automatically	
discovers	the	topics	that	the	sentences	contain.	
• It	represents	documents	as	mixtures	of	topics	from	where	words	are	pulled	out	
with	a	certain	probabilities.
• It	assumes	that	each	document	
- has	a	number	N of	words	(according	to	a	Poisson	distribution),
- has	a	topic	mixture	over	a	fixed	set	of	K topics	(according	to	a	Poisson	
distribution).
• Then,	for	each	word	in	each	document:
- a	topic	is	picked	randomly	(according	to	the	distribution	sampled	above),
- it	randomly	generates	the	word	itself	(according	to	the	other	distribution).
• Assuming	this	generative	model	for	a	collection	of	documents,	LDA	then	tries	to	
backtrack	from	the	documents	to	find	a	set	of	topics	that	have	likely	generated	
the	collection	(Gibbs	sampling).
He was face to face with an infinite possibility
of latent evil…
• Latent	Dirichlet Allocation	(LDA)	is	a	generative	model	that	automatically	
discovers	the	topics	that	the	sentences	contain.	
• It	represents	documents	as	mixtures	of	topics	from	where	words	are	pulled	out	
with	a	certain	probabilities.
• It	assumes	that	each	document	
- has	a	number	N of	words	(according	to	a	Poisson	distribution),
- has	a	topic	mixture	over	a	fixed	set	of	K topics	(according	to	a	Poisson	
distribution).
• Then,	for	each	word	in	each	document:
- a	topic	is	picked	randomly	(according	to	the	distribution	sampled	above),
- it	randomly	generates	the	word	itself	(according	to	the	other	distribution).
• Assuming	this	generative	model	for	a	collection	of	documents,	LDA	then	tries	to	
backtrack	from	the	documents	to	find	a	set	of	topics	that	have	likely	generated	
the	collection	(Gibbs	sampling).
He was face to face with
an infinite possibility of latent evil…
corpus	= textacy.Corpus('en',	docs=documents)
terms	= (doc.to_terms_list(ngrams={1},	normalize='lemma')
for doc	in corpus)
tfidf,	idx = textacy.vsm.doc_term_matrix(terms,	weighting='tfidf')
model	= textacy.tm.TopicModel('lda',	n_topics=60)	
model.fit(tfidx)
for topic_idx,	top_terms in model.top_topic_terms(idx,	top_n=5):
print('Topic	#%s:	%s' % (topic_idx,	'tt'.join(top_terms)))
topics	= model.transform(tfidf)
for doc_idx,	top_topics in model.top_doc_topics(topics):
print('%s:	%s' % (corpus.docs[doc_idx].metadata['title'],
'tt'.join(['Topic	#%s	(%.2f)' % (t[0],	100 * t[1])
for t	in top_topics])))
model.termite_plot(tfidf,	idx)
He was face to face with an infinite possibility of latent evil…
Topic #0: lestrade london woman window lady
miss street inspector hour sherlock
Topic #6: jones wilson hopkins inspector sholto
trevor league office birmingham pinner
Topic #9: gregson mycroft mcmurdo warren garcia
douglas barker susan inspector greek
Topic #10: moor mortimer henry duke grace
american charles bicycle hopkins wilder
Topic #11: mcmurdo douglas susan barker robert
steve barney jones smith sholto
Topic #12: robert ferguson smith trevor woodley
carruthers jones mason sholto gregson
...
The Sign of the Four: Topic #0 (46.77) Topic #12 (25.02) Topic #6 (23.45)
A Study in Scarlet: Topic #0 (53.95) Topic #52 (35.67) Topic #51 (33.71)
The Hound of the Baskervilles: Topic #10 (50.89) Topic #0 (44.51) Topic #54 (38.52)
The Valley of Fear: Topic #11 (49.42) Topic #9 (28.17) Topic #0 (27.12)
...
He was face to face with an infinite possibility of latent evil…
“You are very welcome to put any
questions that you like to me now,
and there is no danger that I will
refuse to answer them.”

Más contenido relacionado

Último

Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 

Último (20)

Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 

Destacado

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Destacado (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

A Study in (P)rose

  • 1. A STUDY IN (P)ROSE NLP Applied to Sherlock Holmes Stories Stefano Bragaglia
  • 2. The shadow was seated in a chair, black outline upon the luminous screen of the window. • Corpora • Basic Statistics • Content & Word Frequency • Readability • Characters & Centrality • Automatic Summarisation • Word Vectors & Clustering • Sentiment & Subjectivity • Latent Topics 221BBakerStreet
  • 3. “I only require a few missing links to have an entirely connected case.” • http://nbviewer.jupyter.org/github/skipgram/modern-nlp-in- python/blob/master/executable/Modern_NLP_in_Python.ipynb • http://brandonrose.org/clustering • https://theinvisibleevent.wordpress.com/2015/11/08/35-the-language-of- sherlock-holmes-a-study-in-consistency/ • http://www.christianpeccei.com/holmes/ • https://github.com/sgsinclair/alta/blob/master/ipynb/Python.ipynb • http://data-mining.philippe-fournier-viger.com/tutorial-how-to-discover-hidden- patterns-in-text-documents/ • http://sujitpal.blogspot.co.uk/2015/07/discovering-entity-relationships-in.html • All the pictures are copyright of the respective authors.
  • 4. “I had an idea that he might, and I took the liberty of bringing the tools with me.” • matplotlib – http://matplotlib.org • newspaper3k – https://github.com/codelucas/newspaper • python-igraph – http://igraph.org/python/#pyinstallosx • pyclusterig – https://github.com/annoviko/pyclustering • spaCy – https://spacy.io • sumy – https://github.com/miso-belica/sumy • textaCy – https://textacy.readthedocs.io/en/latest/index.html • textblob – https://textblob.readthedocs.io/en/dev/ • word_cloud – https://github.com/amueller/word_cloud
  • 5. CORPORA “I have some documents here,” said my friend Sherlock Holmes, as we sat one winter's night on either side of the fire, “which I really think, Watson, that it would be worth your while to glance over.
  • 6. “I seem to have heard some queer stories about him.” • In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). • The texts may be in a single language (monolingual corpus) or in multiple languages (multilingual corpus). If formatted for side-by-side comparison, they are called aligned parallel corpora (translation corpus for translations, else comparable corpus). • They are often subjected to annotation to make them more useful, i.e. POS-tagging: information about words’ part of speech are added as tags. If they contain further structured levels of analysis, they are called Treebanks or Parsed Corpora.
  • 7. “I seem to have heard some queer stories about him.” • In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). • The texts may be in a single language (monolingual corpus) or in multiple languages (multilingual corpus). If formatted for side-by-side comparison, they are called aligned parallel corpora (translation corpus for translations, else comparable corpus). • They are often subjected to annotation to make them more useful, i.e. POS-tagging: information about words’ part of speech are added as tags. If they contain further structured levels of analysis, they are called Treebanks or Parsed Corpora.
  • 8. “I seem to have heard some queer stories about him.” • In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). • The texts may be in a single language (monolingual corpus) or in multiple languages (multilingual corpus). If formatted for side-by-side comparison, they are called aligned parallel corpora (translation corpus for translations, else comparable corpus). • They are often subjected to annotation to make them more useful, i.e. POS-tagging: information about words’ part of speech are added as tags. If they contain further structured levels of analysis, they are called Treebanks or Parsed Corpora.
  • 9. “I seem to have heard some queer stories about him.” The complete Sherlock Homes Canon: • 60 adventures in 9 books: • 4 novels • 56 short stories in 5 collections • Freely available in several formats: • https://sherlock-holm.es/
  • 10. “I seem to have heard some queer stories about him.” The Novels The Adventures of Sherlock Holmes The Memoirs of Sherlock Holmes STUD A Study in Scarlet 1887-10 SCAN A Scandal in Bohemia 1891-07 SILV Silver Blaze 1892-12 SIGN The Sign of the Four 1890-02 REDH The Red-Headed League 1891-08 YELL Yellow Face 1893-02 HOUN The Hound of the Baskerville 1901-08 IDEN A Case of Identity 1891-09 STOC The Stockbroker’s Clerk 1893-03 VALL The Valley of Fear 1914-09 BOSC The Boscombe Valley Mystery 1891-10 GLOR The “Gloria Scott” 1893-04 FIVE The Five Orange Pips 1891-11 MUSG The Musgrave Ritual 1893-05 TWIS The Man with the Twisted Lip 1891-12 REIG The Reigate Puzzle 1893-06 BLUE The Adventure of the Blue Carbuncle 1892-01 CROO The Crooked Man 1893-07 SPEC The Adventure of the Speckled Band 1892-02 RESI The Resident Patient 1893-08 ENGR The Adventure of the Engineer’s Thumb 1892-03 GREE The Greek Interpreter 1893-09 NOBL The Adventure of the Noble Bachelor 1892-04 NAVA The Naval Treaty 1893-10 BERY The Adventure of the Beryl Coronet 1892-05 FINA The Final Problem 1893-12 COPP The Adventure of the Copper Beeches 1892-06
  • 11. “I seem to have heard some queer stories about him.” The Return of Sherlock Holmes His Last Bow The Case-Book of Sherlock Holmes EMPT The Adventure of the Empty House 1903-09 WIST The Adventure of Wisteria Lodge 1908-08 ILLU The Illustrious Client 1924-11 NORW The Adventure of the Norwood Builder 1903-10 CARD The Adventure of the Cardboard Box 1893-01 BLAN The Blanched Soldier 1926-10 DANC The Adventure of the Dancing Men 1903-12 REDC The Adventure of the Red Circle 1911-03 MAZA The Adventure of the Mazarin Stone 1921-10 SOLI The Adventure of the Solitary Cyclist 1903-12 BRUC The Adventure of the Bruce-Partington Plans 1908-12 3GAB The Adventure of the Three Gables 1926-09 PRIO The Adventure of the Priory School 1904-01 DYIN The Adventure of the Dying Detective 1913-11 SUSS The Adventure of the Sussex Vampire 1924-01 BLAC The Adventure of Black Peter 1904-02 LADY The Disappearance of Lady Frances Carfax 1911-12 3GAR The Adventure of the Three Garridebs 1924-10 CHAS The Adventure of Charles Augustus Milverton 1904-03 DEVI The Adventure of the Devil’s Foot 1910-12 THOR The Problem of Thor Bridge 1922-02 SIXN The Adventure of the Six Napoleons 1904-04 LAST His Last Bow 1917-09 CREE The Adventure of the Creeping Man 1923-03 3STU The Adventure of the Three Students 1904-06 LION The Adventure of the Lion’s Mane 1926-11 GOLD The Adventure of the Golden Pince-Nez 1904-07 VEIL The Adventure of the Veiled Lodger 1927-01 MISS The Adventure of the Missing Three- Quarter 1904-08 SHOS Adventure of Shoscombe Old Place 1927-03 ABBE The Adventure of the Abbey Grande 1904-09 RETI The Adventure of the Retired Colourman 1926-12 SECO The Adventure of the Second Stain 1904-12
  • 12. BASIC STATISTICS “You can, for example, never foretell what any one man will do, but you can say with precision what an average number will be up to. Individuals vary, but percentages remain constant. So says the statistician.”
  • 13. Now, I am counting upon joining it here… for document in corpus: statistics['words'] = 0 statistics['sentences'] = 0 statistics['characters'] = 0 for sentence in document: statistics['sentences'] += 1 for word in sentence: if word is not punctuation: statistics['words'] += 1 statistics['characters'] += len(word) statistics['length'] = statistics['characters'] / statistics['words'] statistics['size'] = statistics['words'] / statistics['sentences']
  • 14. Now, I am counting upon joining it here… meta = { 'stud.txt': {'author': 'Arthur Conan Doyle', 'collection': 'The Novels', 'title': 'A Study in Scarlet', 'code': 'STUD', 'pub_date': '1887-11'}, … } for filename in os.listdir(folder): content = '' with open(filename, 'r', encoding='utf-8') as file: content = file.read() doc = textacy.Doc(content, metadata=meta[filename]) docs.append(document) for document in textacy.Corpus('en', docs=docs, metadatas=meta): print(textacy.text_stats.readability_stats(document)) print(textacy.text_stats.readability_stats(corpus)) THE THREE GABLES The Case-Book of Sherlock Holmes Arthur Conan Doyle (1926-09) Statistics: - Characters: 24,704 - Syllables: 7,704 - Words: 6,188 - Unique words: 1,421 - Polysyllable words: 284 - Sentences: 460 - Avg characters per word: 3.99 - Avg words per sentence: 13.45 Indexes: - Automated Readability: 4.10 - Coleman-Liau: 5.47 - Flesch-Kincaid: 4.35 - Flesch Ease Readability: 87.85 - Gunning-Fog: 7.22 - SMOG: 7.62
  • 15. Now, I am counting upon joining it here…
  • 16. Now, I am counting upon joining it here… A. C. Doyle 4 books, 56 novels Total words 730,000 Unique words 20,000 Unique lemmas 15,000 Total sentences 39,000 Avg word length 3.88 Avg sentence length 18.68
  • 17. Now, I am counting upon joining it here… A. C. Doyle W. Shakespeare 4 books, 56 novels 38 plays, 154 sonnets Total words 730,000 1,035,000 Unique words 20,000 27,000 Unique lemmas 15,000 22,000 Total sentences 39,000 93,000 Avg word length 3.88 4.41 Avg sentence length 18.68 11.08
  • 18. Now, I am counting upon joining it here… A. C. Doyle W. Shakespeare % 4 books, 56 novels 38 plays, 154 sonnets Total words 730,000 1,035,000 -29.5 Unique words 20,000 27,000 -25.9 Unique lemmas 15,000 22,000 -31.8 Total sentences 39,000 93,000 -58.0 Avg word length 3.88 4.41 -12.0 Avg sentence length 18.68 11.08 +68.6
  • 19. Now, I am counting upon joining it here… • Prodigious vocabulary: only 1/3 words (1/4 unique words) less than Shakespeare despite the much shorter corpus • Back in time there were much less words (Shakespeare had to “invent” many of them – i.e.: eyeball) but contemporary authors have much more restrictions to obey to • As a term of comparison, consider that modern English contains around 250,000 terms, most of them being neologisms like: robot, computer, internet, unleaded, twerking, …
  • 20. CONTENT & WORDS FREQUENCY “Because I made a blunder, my dear Watson—which is, I am afraid, a more common occurrence than any one would think who only knew me through your memoirs.
  • 21. I frequently found my thoughts turning in her direction and wondering… for document in corpus: f = {} for sentence in document: for word in sentence: if word is not punctuation: f[word] += 1 print([(w, f[w]) for w in sorted(f, key=f.get, reverse=True)][10:])
  • 22. I frequently found my thoughts turning in her direction and wondering… for document in corpus: f = {} for sentence in document: for word in sentence: if word is not punctuation: f[word] += 1 print([(w, f[w]) for w in sorted(f, key=f.get, reverse=True)][10:]) “You broke the thread of my thoughts; but perhaps it is as well.” “Perhaps that is why we are so subtly influenced by it.”
  • 23. I frequently found my thoughts turning in her direction and wondering… for document in corpus: f = {} for sentence in document: for word in sentence: if word is not punctuation: f[word] += 1 f[word.lower()] += 1 print([(w, f[w]) for w in sorted(f, key=f.get, reverse=True)][10:])
  • 24. I frequently found my thoughts turning in her direction and wondering… for document in corpus: f = {} for sentence in document: for word in sentence: if word is not punctuation: f[word] += 1 f[word.lower()] += 1 print([(w, f[w]) for w in sorted(f, key=f.get, reverse=True)][10:]) “Funny, she didn't say good-bye.” “Your correspondent says two friends.”
  • 25. I frequently found my thoughts turning in her direction and wondering… for document in corpus: f = {} for sentence in document: for word in sentence: if word is not punctuation: f[word] += 1 f[word.lower()] += 1 f[word.lemma.lower()] += 1 print([(w, f[w]) for w in sorted(f, key=f.get, reverse=True)][10:])
  • 26. I frequently found my thoughts turning in her direction and wondering… for document in corpus: f = {} for sentence in document: for word in sentence: if word is not punctuation: f[word] += 1 f[word.lower()] += 1 f[word.lemma.lower()] += 1 print([(w, f[w]) for w in sorted(f, key=f.get, reverse=True)][10:]) Standing at the window, I watched her walking briskly down the street, until the gray turban and white feather were but a speck in the sombre crowd.
  • 27. The furniture and pictures were of the most common and vulgar description. • Extremely common words have little or no use when retrieving information from documents. • Such words are called stop-words and are usually excluded completely from the vocabulary. • The general strategy to determine stop words is to sort them by frequency, and pick the first N (often hand-filtered for their semantic content relative to the domain). • It is possible to use other sources, such as: • https://en.wikipedia.org/wiki/Most_common_ words_in_English
  • 28. I frequently found my thoughts turning in her direction and wondering… stopwords = […] for document in corpus: f = {} for sentence in document: for word in sentence: if word is not punctuation: if word is not punctuation and not in stopwords: f[word] += 1 f[word.lower()] += 1 f[word.lemma.lower()] += 1 print([(w, f[w]) for w in sorted(f, key=f.get, reverse=True)][10:])
  • 29. I frequently found my thoughts turning in her direction and wondering… stopwords = […] for document in corpus: f = {} for sentence in document: for word in sentence: if word is not punctuation: if word is not punctuation and not in stopwords: f[word] += 1 f[word.lower()] += 1 f[word.lemma.lower()] += 1 print([(w, f[w]) for w in sorted(f, key=f.get, reverse=True)][10:]) “A friend of Mr. Sherlock is always welcome!” Sherlock Holmes rubbed his hands with delight. It's all very well for you to laugh, Mr. Sherlock Holmes.
  • 30. I frequently found my thoughts turning in her direction and wondering… • In linguistics, an n-gram is a contiguous sequence of n items (such as phonemes, syllables, letters, words or base pairs) from a given sequence of text or speech. • An n-gram model is a type of probabilistic language model for predicting the next item in such a sequence in the form of a (n-1)- order Markov model (i.e.: predictive keyboards). • They provide a measure of co-location frequency, therefore they may help identify: • Syntagmatic Associations (i.e.: cold + weather, Burkina + Faso, etc.) • Paradigmatic Associations (i.e.: synonyms, co-reference resolution, etc.) A STUDY IN SCARLET VI. Tobias Gregson Shows What He Can Do Arthur Conan Doyle (1887-11) "Look here, Mr. Sherlock Holmes," he said. "Look here, Mr. Sherlock Holmes," he said. "Look here, Mr. Sherlock Holmes," he said. "Look here, Mr. Sherlock Holmes," he said. "Look here, Mr. Sherlock Holmes," he said. "Look here, Mr. Sherlock Holmes," he said. "Look here, Mr. Sherlock Holmes," he said. "Look here, Mr. Sherlock Holmes," he said. "Look here, Mr. Sherlock Holmes," he said. "Look here, Mr. Sherlock Holmes," he said. "Look here, Mr. Sherlock Holmes," he said. "Look here, Mr. Sherlock Holmes," he said.
  • 31. I frequently found my thoughts turning in her direction and wondering… corpus = textacy.Corpus('en', docs=docs, metadatas=meta) for doc in corpus: bot = doc.to_bag_of_terms(ngrams={1, 2, 3}, drop_determiners=True, filter_stops=True, filter_punct=True, filter_nums=False, as_strings=True) print({term: bot[term] for term in sorted(bot, key=bot.get, reverse=True)}) bot = corpus.to_bag_of_terms(ngrams={1, 2, 3}, drop_determiners=True, filter_stops=True, filter_punct=True, filter_nums=False, as_strings=True) print({term: bot[term] for term in sorted(bot, key=bot.get, reverse=True)}) THE THREE GABLES The Case-Book of Sherlock Holmes Arthur Conan Doyle (1926-09) Occurrences: - holmes (54) - mr. holmes (18) - masser holmes (15) - susan (15) - one (14) - say holmes (13) - maberley (10) - watson (9) - mrs. maberley (8) - steve (6) - first (6) - be not (6) - douglas (5) - london (5) ...
  • 32. “It is the brightest rift which I can at present see in the clouds.” stopwords = […] corpus = textacy.Corpus('en', docs=docs, metadatas=meta) for doc in corpus: wordcloud = WordCloud(max_words=1000, margin=0, random_state=1) .generate(doc.text) matplotlib.pyplot.imshow(wordcloud, interpolation='bilinear') matplotlib.pyplot.axis('off') matplotlib.pyplot.figure() wordcloud = WordCloud(max_words=1000, margin=0, random_state=1, stopwords=stopwords) .generate(doc.text) matplotlib.pyplot.imshow(wordcloud, interpolation='bilinear') matplotlib.pyplot.axis('off') matplotlib.pyplot.show()
  • 33. “It is the brightest rift which I can at present see in the clouds.”
  • 34. I frequently found my thoughts turning in her direction and wondering… • Holmes is never any lower than 4th most popular word; it’s 4th only in The Hound of Baskerville, in which he is rarely on stage • man (& synonyms) is much more frequent than woman: victorian misogyny? • say is definitely Doyle’s favourite speech attribution verb • Language is very concrete (say, see, come, know, go and think), with almost no place for emotions (cry): scientific approach vs. spiritism • Only little and time (more subjective words) go into the top 15 • The word the accounts for the 6% of all the corpus
  • 35. READABILITY My companion gave a sudden chuckle of comprehension. “And not a very obscure cipher, Watson,” said he. “Why, of course, it is Italian! The A means that it is addressed to a woman. ‘Beware! Beware! Beware!’ How's that, Watson?
  • 36. Now, I am counting upon joining it here… meta = { 'stud.txt': {'author': 'Arthur Conan Doyle', 'collection': 'The Novels', 'title': 'A Study in Scarlet', 'code': 'STUD', 'pub_date': '1887-11'}, … } for filename in os.listdir(folder): content = '' with open(filename, 'r', encoding='utf-8') as file: content = file.read() doc = textacy.Doc(content, metadata=meta[filename]) docs.append(document) for document in textacy.Corpus('en', docs=docs, metadatas=meta): print(textacy.text_stats.readability_stats(document)) print(textacy.text_stats.readability_stats(corpus)) THE THREE GABLES The Case-Book of Sherlock Holmes Arthur Conan Doyle (1926-09) Statistics: - Characters: 24,704 - Syllables: 7,704 - Words: 6,188 - Unique words: 1,421 - Polysyllable words: 284 - Sentences: 460 - Avg characters per word: 3.99 - Avg words per sentence: 13.45 Indexes: - Automated Readability: 4.10 - Coleman-Liau: 5.47 - Flesch-Kincaid: 4.35 - Flesch Ease Readability: 87.85 - Gunning-Fog: 7.22 - SMOG: 7.62
  • 37. Now, I am counting upon joining it here… • Readability is the ease with which a reader can understand a written text. • The readability depends on content (the complexity of its vocabulary and syntax) and presentation (typographic aspects such as font size, line height, and line length). • Researchers have proposed several formulas to determine the readability of a text by means features like average word length in syllables (ASL), average sentence length (ASW), etc. • For instance: FLESCH = 206.835 − (1.015 × ASL) − (84.6 × ASW)
  • 38. Now, I am counting upon joining it here… • These stories are still very popular: very popular vocabulary, easy to read with ideas easy to grasp • The series ran for over 40 years (no continuously), still Doyle maintained the same focus on the basic language • The density for new words in this corpus is 8-11% and it is considered ideal for an 8 year old (3rd grade) • Excluding the first 2 (shorter and less prone to repetitions) novellas, the other 7 books perfectly fall in the above interval
  • 39. CHARACTERS & CENTRALITY We tied Toby to the hall table, and reascended the stairs. The room was as we had left it, save that a sheet had been draped over the central figure. A weary-looking police- sergeant reclined in the corner.
  • 40. nlp = spacy.load('en') for doc in corpus: names = [] tuples = [] for par in re.split('(r?n){2}', doc.text): parser = nlp(par) entities = [] for ent in parser.ents: if ent.label in [PERSON, LOC, GPE]: name = re.sub('[^0-9a-zA-Z]+', ' ', ent.text) if name not in names: names.append(name) for entity in entities: tuples.append((entity, name)) entities.append(name) “Of course, we do not yet know what the relations may have been…”
  • 41. “Of course, we do not yet know what the relations may have been…” ig = igraph.Graph.TupleList(tuples) vector = ig.eigenvector_centrality() colors = [] label_colors = [] for value in vector: color = colorsys.hsv_to_rgb(2.0 * (1.0 - value) / 3.0, 1.0, 1.0) label_colors.append('gray' if value < 0.5 else 'black') colors.append('#%02x%02x%02x' % (int(color[0] * 255), int(color[1] * 255), int(color[2] * 255))) ig.vs['label'] = names ig.vs['color'] = colors ig.vs['label_color'] = label_colors layout = ig.layout('kk') ig.write_svg('%s.png' % doc.meta['code'], margin=50, layout=layout, border=50, width=1280, height=800)
  • 42. “Of course, we do not yet know what the relations may have been…” THETHREEGABLES TheCase-BookofSherlockHolmes ArthurConanDoyle(1926-09)
  • 43. AUTOMATIC SUMMARISATION It was in the summer of '89, not long after my marriage, that the events occurred which I am now about to summarise.
  • 44. He knitted his brows as though determined not to omit anything in his narrative. • Automatic Summarisation might either be Extraction-based or Abstraction-based. Best results come when both are applied. • TextRank and LexRank are graph-based algorithms where sentences are vertices and edges model the similarity between them. • While LexRank uses TF-IDF and cosine similarity, TextRank uses PageRank (a word appearing in two sentences is like a link between them) to measure the similarity between sentences. • Roughly speaking, a sentence containing many keywords that also appear in other sentences is a hub and receives a higher score. • The sentences are sorted by this value: since the top N more likely cover all the topics (keywords) in the document, they are considered as the summary.
  • 45. He knitted his brows as though determined not to omit anything in his narrative. • Automatic Summarisation might either be Extraction-based or Abstraction-based. Best results come when both are applied. • TextRank and LexRank are graph-based algorithms where sentences are vertices and edges model the similarity between them. • While LexRank uses TF-IDF and cosine similarity, TextRank uses PageRank (a word appearing in two sentences is like a link between them) to measure the similarity between sentences. • Roughly speaking, a sentence containing many keywords that also appear in other sentences is a hub and receives a higher score. • The sentences are sorted by this value: since the top N more likely cover all the topics (keywords) in the document, they are considered as the summary.
  • 46. He knitted his brows as though determined not to omit anything in his narrative. • Automatic Summarisation might either be Extraction-based or Abstraction-based. Best results come when both are applied. • TextRank and LexRank are graph-based algorithms where sentences are vertices and edges model the similarity between them. • While LexRank uses TF-IDF and cosine similarity, TextRank uses PageRank (a word appearing in two sentences is like a link between them) to measure the similarity between sentences. • Roughly speaking, a sentence containing many keywords that also appear in other sentences is a hub and receives a higher score. • The sentences are sorted by this value: since the top N more likely cover all the topics (keywords) in the document, they are considered as the summary.
  • 47. He knitted his brows as though determined not to omit anything in his narrative. • Automatic Summarisation might either be Extraction-based or Abstraction-based. Best results come when both are applied. • TextRank and LexRank are graph-based algorithms where sentences are vertices and edges model the similarity between them. • While LexRank uses TF-IDF and cosine similarity, TextRank uses PageRank (a word appearing in two sentences is like a link between them) to measure the similarity between sentences. • Roughly speaking, a sentence containing many keywords that also appear in other sentences is a hub and receives a higher score. • The sentences are sorted by this value: since the top N more likely cover all the topics (keywords) in the document, they are considered as the summary.
  • 48. He knitted his brows as though determined not to omit anything in his narrative. • Automatic Summarisation might either be Extraction-based or Abstraction-based. Best results come when both are applied. • TextRank and LexRank are graph-based algorithms where sentences are vertices and edges model the similarity between them. • While LexRank uses TF-IDF and cosine similarity, TextRank uses PageRank (a word appearing in two sentences is like a link between them) to measure the similarity between sentences. • Roughly speaking, a sentence containing many keywords that also appear in other sentences is a hub and receives a higher score. • The sentences are sorted by this value: since the top N more likely cover all the topics (keywords) in the document, they are considered as the summary.
  • 49. He knitted his brows as though determined not to omit anything in his narrative. path = '…' language = 'english' tokenizer = Tokenizer(language) parser = PlaintextParser.from_file(path, tokenizer) stemmer = Stemmer(language) summarizer = Summarizer(stemmer) summarizer.stop_words = get_stop_words(language) summary = summarizer(parser.document, 10) for sentence in summary: print(sentence)
  • 50. He knitted his brows as though determined not to omit anything in his narrative. THE DYING DETECTIVE His Last Bow Arthur Conan Doyle (1913-11) Not only was her first-floor flat invaded at all hours by throngs of singular and often undesirable characters but her remarkable lodger showed an eccentricity and irregularity in his life which must have sorely tried her patience. His incredible untidiness, his addiction to music at strange hours, his occasional revolver practice within doors, his weird and often malodorous scientific experiments, and the atmosphere of violence and danger which hung around him made him the very worst tenant in London. Knowing how genuine was her regard for him, I listened earnestly to her story when she came to my rooms in the second year of my married life and told me of the sad condition to which my poor friend was reduced. In the dim light of a foggy November day the sick room was a gloomy spot, but it was that gaunt, wasted face staring at me from the bed which sent a chill to my heart. His eyes had the brightness of fever, there was a hectic flush upon either cheek, and dark crusts clung to his lips; the thin hands upon the coverlet twitched incessantly, his voice was croaking and spasmodic. Then, unable to settle down to reading, I walked slowly round the room, examining the pictures of celebrated criminals with which every wall was adorned. I saw a great yellow face, coarse-grained and greasy, with heavy, double-chin, and two sullen, menacing gray eyes which glared at me from under tufted and sandy brows. The skull was of enormous capacity, and yet as I looked down I saw to my amazement that the figure of the man was small and frail, twisted in the shoulders and back like one who has suffered from rickets in his childhood. Then in an instant his sudden access of strength departed, and his masterful, purposeful talk droned away into the low, vague murmurings of a semi-delirious man. You will realize that among your many talents dissimulation finds no place, and that if you had shared my secret you would never have been able to impress Smith with the urgent necessity of his presence, which was the vital point of the whole scheme.
  • 51. WORD VECTORS & CLUSTERING Our coming was evidently a great event, for station-master and porters clustered round us to carry out our luggage.
  • 52. For every step increased the distance between them… • The term frequency–inverse document frequency (TF–IDF) is as numerical statistics that reflects how important is a word to a document in a corpus. • It is often used as a weighting factor in information retrieval, text mining and user modelling. • It consists of the product of two terms: • the term frequency captures the importance of a term for a document, • the inverse document frequency measures the specificity of a term for a document in a corpus. • There are various ways of computing these values, the simplest one utilises: • the raw frequency ft,d for TF, • the logarithm of the ration between N = |D| and nt = |d∈D: t∈d| of documents containing the term t for IDF. • In combination with cosine similarity, a measure of similarity between two non-zero vectors that measures the cosine of the angle between them, it provides a crude measure of the distance between documents: A · B • similarity = cos(Θ) = –––––––– ||A|| ||B||
  • 53. For every step increased the distance between them… • The term frequency–inverse document frequency (TF–IDF) is as numerical statistics that reflects how important is a word to a document in a corpus. • It is often used as a weighting factor in information retrieval, text mining and user modelling. • It consists of the product of two terms: • the term frequency captures the importance of a term for a document, • the inverse document frequency measures the specificity of a term for a document in a corpus. • There are various ways of computing these values, the simplest one utilises: • the raw frequency ft,d for TF, • the logarithm of the ration between N = |D| and nt = |d∈D: t∈d| of documents containing the term t for IDF. • In combination with cosine similarity, a measure of similarity between two non-zero vectors that measures the cosine of the angle between them, it provides a crude measure of the distance between documents: A · B • similarity = cos(Θ) = –––––––– ||A|| ||B||
  • 54. For every step increased the distance between them… • The term frequency–inverse document frequency (TF–IDF) is as numerical statistics that reflects how important is a word to a document in a corpus. • It is often used as a weighting factor in information retrieval, text mining and user modelling. • It consists of the product of two terms: • the term frequency captures the importance of a term for a document, • the inverse document frequency measures the specificity of a term for a document in a corpus. • There are various ways of computing these values, the simplest one utilises: • the raw frequency ft,d for TF, • the logarithm of the ration between N = |D| and nt = |d∈D: t∈d| of documents containing the term t for IDF. • In combination with cosine similarity, a measure of similarity between two non-zero vectors that measures the cosine of the angle between them, it provides a crude measure of the distance between documents: A · B • similarity = cos(Θ) = –––––––– ||A|| ||B||
  • 55. For every step increased the distance between them… • The term frequency–inverse document frequency (TF–IDF) is as numerical statistics that reflects how important is a word to a document in a corpus. • It is often used as a weighting factor in information retrieval, text mining and user modelling. • It consists of the product of two terms: • the term frequency captures the importance of a term for a document, • the inverse document frequency measures the specificity of a term for a document in a corpus. • There are various ways of computing these values, the simplest one utilises: • the raw frequency ft,d for TF, • the logarithm of the ration between N = |D| and nt = |d∈D: t∈d| of documents containing the term t for IDF. • In combination with cosine similarity, a measure of similarity between two non-zero vectors that measures the cosine of the angle between them, it provides a crude measure of the distance between documents: A · B • similarity = cos(Θ) = –––––––– ||A|| ||B||
  • 56. For every step increased the distance between them… • The term frequency–inverse document frequency (TF–IDF) is as numerical statistics that reflects how important is a word to a document in a corpus. • It is often used as a weighting factor in information retrieval, text mining and user modelling. • It consists of the product of two terms: • the term frequency captures the importance of a term for a document, • the inverse document frequency measures the specificity of a term for a document in a corpus. • There are various ways of computing these values, the simplest one utilises: • the raw frequency ft,d for TF, • the logarithm of the ration between N = |D| and nt = |d∈D: t∈d| of documents containing the term t for IDF. • In combination with cosine similarity, a measure of similarity between two non-zero vectors that measures the cosine of the angle between them, it provides a crude measure of the distance between documents: A · B • similarity = cos(Θ) = –––––––– ||A|| ||B||
  • 57. For every step increased the distance between them… idf = corpus.word_doc_freqs(weighting='idf') tfs = {doc.metadata['code']: doc.to_bag_of_words(weighting='freq') for doc in corpus.docs} tfidfs = {code: [] for code in tfs} for key in sorted(idf.keys()): for code in tfidfs: if key in tfs[code]: tfidfs[code].append(tfs[code][key] * idf[key]) else: tfidfs[code].append(0.0) for i, k_i in enumerate(tfidfs.keys()): for j, k_j in enumerate(tfidfs.keys()): v = textacy.math_utils.cosine_similarity(tfidfs[k_i], tfidfs[k_j]) print('%s vs. %s : %.3f' % (METADATA[k_i]['title'], METADATA[k_j]['title'], v)) Lady Frances Carfax vs. His Last Bow : 0.905 The Greek Interpreter vs. Lady Frances Carfax : 0.938 The Greek Interpreter vs. The Bruce-Partington Plans : 0.957 The Bruce-Partington Plans vs. The Greek Interpreter : 0.957 The Greek Interpreter vs. The Greek Interpreter : 1.000
  • 58. For every step increased the distance between them… corpus = textacy.Corpus('en', docs=documents) terms = (doc.to_terms_list(ngrams={1}, normalize='lemma') for doc in corpus) tfidf, idx = textacy.vsm.doc_term_matrix(terms, weighting='tfidf') sample = tfidf.toarray() sample_pca = mlab.PCA(sample) sample_cutoff = sample_pca.fracs[1] sample_2d = sample_pca.project(sample, minfrac=sample_cutoff) instance = optics(sample, 0.8125, 2) instance.process() clusters = instance.get_clusters() noise = instance.get_noise() visualizer = cluster_visualizer() visualizer.append_cluster(noise, sample_2d, marker='x') visualizer.append_clusters(clusters, sample_2d) visualizer.show()
  • 59. For every step increased the distance between them… • A word embedding (GloVe, word2vec) is a group of related models to sort the words in a corpus. • These models are simple 2-layers neural networks that are trained to reconstruct the linguistic context of words. • They take large corpora as input and produce a vector space of several hundreds dimensions as output. • In such vector spaces, each word is given a precise position to whom corresponds a vector, so that special spatial properties are maintained. • QUEEN = KING – MAN + WOMAN
  • 60. For every step increased the distance between them… • A word embedding (GloVe, word2vec) is a group of related models to sort the words in a corpus. • These models are simple 2-layers neural networks that are trained to reconstruct the linguistic context of words. • They take large corpora as input and produce a vector space of several hundreds dimensions as output. • In such vector spaces, each word is given a precise position to whom corresponds a vector, so that special spatial properties are maintained. • QUEEN = KING – MAN + WOMAN
  • 61. For every step increased the distance between them… • A word embedding (GloVe, word2vec) is a group of related models to sort the words in a corpus. • These models are simple 2-layers neural networks that are trained to reconstruct the linguistic context of words. • They take large corpora as input and produce a vector space of several hundreds dimensions as output. • In such vector spaces, each word is given a precise position to whom corresponds a vector, so that special spatial properties are maintained. • QUEEN = KING – MAN + WOMAN
  • 62. For every step increased the distance between them… • A word embedding (GloVe, word2vec) is a group of related models to sort the words in a corpus. • These models are simple 2-layers neural networks that are trained to reconstruct the linguistic context of words. • They take large corpora as input and produce a vector space of several hundreds dimensions as output. • In such vector spaces, each word is given a precise position to whom corresponds a vector, so that special spatial properties are maintained. • QUEEN = KING – MAN + WOMAN
  • 63. For every step increased the distance between them… • A word embedding (GloVe, word2vec) is a group of related models to sort the words in a corpus. • These models are simple 2-layers neural networks that are trained to reconstruct the linguistic context of words. • They take large corpora as input and produce a vector space of several hundreds dimensions as output. • In such vector spaces, each word is given a precise position to whom corresponds a vector, so that special spatial properties are maintained. • KING – MAN + WOMAN = ?
  • 64. For every step increased the distance between them… • A word embedding (GloVe, word2vec) is a group of related models to sort the words in a corpus. • These models are simple 2-layers neural networks that are trained to reconstruct the linguistic context of words. • They take large corpora as input and produce a vector space of several hundreds dimensions as output. • In such vector spaces, each word is given a precise position to whom corresponds a vector, so that special spatial properties are maintained. • KING – MAN + WOMAN = ?
  • 65. For every step increased the distance between them… • A word embedding (GloVe, word2vec) is a group of related models to sort the words in a corpus. • These models are simple 2-layers neural networks that are trained to reconstruct the linguistic context of words. • They take large corpora as input and produce a vector space of several hundreds dimensions as output. • In such vector spaces, each word is given a precise position to whom corresponds a vector, so that special spatial properties are maintained. • KING – MAN + WOMAN = ?
  • 66. For every step increased the distance between them… • A word embedding (GloVe, word2vec) is a group of related models to sort the words in a corpus. • These models are simple 2-layers neural networks that are trained to reconstruct the linguistic context of words. • They take large corpora as input and produce a vector space of several hundreds dimensions as output. • In such vector spaces, each word is given a precise position to whom corresponds a vector, so that special spatial properties are maintained. • KING – MAN + WOMAN = ?
  • 67. For every step increased the distance between them… • A word embedding (GloVe, word2vec) is a group of related models to sort the words in a corpus. • These models are simple 2-layers neural networks that are trained to reconstruct the linguistic context of words. • They take large corpora as input and produce a vector space of several hundreds dimensions as output. • In such vector spaces, each word is given a precise position to whom corresponds a vector, so that special spatial properties are maintained. • KING – MAN + WOMAN = ?
  • 68. For every step increased the distance between them… • A word embedding (GloVe, word2vec) is a group of related models to sort the words in a corpus. • These models are simple 2-layers neural networks that are trained to reconstruct the linguistic context of words. • They take large corpora as input and produce a vector space of several hundreds dimensions as output. • In such vector spaces, each word is given a precise position to whom corresponds a vector, so that special spatial properties are maintained. • KING – MAN + WOMAN = QUEEN
  • 69. SENTIMENT & SUBJECTIVITY I felt of all Holmes's criminals this was the one whom he would find it hardest to face. However, he was immune from sentiment.
  • 70. When this deduction is confirmed point by point, then the subjective becomes objective. • Sentiment analysis (sometimes known as opinion mining or emotion AI) refers to the use of natural language processing to systematically identify the affective states and subjective information in a text. • Generally speaking, sentiment analysis aims to determine the attitude of a speaker, writer, or other subject with respect to some topic. • Alternatively, sentiment analysis aims at identifying the overall polarity and subjectivity or emotional reaction to a document. • More sophisticated approaches are able to distinguish among a wider selection of emotional states.
  • 71. When this deduction is confirmed point by point, then the subjective becomes objective. • Sentiment analysis (sometimes known as opinion mining or emotion AI) refers to the use of natural language processing to systematically identify the affective states and subjective information in a text. • Generally speaking, sentiment analysis aims to determine the attitude of a speaker, writer, or other subject with respect to some topic. • Alternatively, sentiment analysis aims at identifying the overall polarity and subjectivity or emotional reaction to a document. • More sophisticated approaches are able to distinguish among a wider selection of emotional states.
  • 72. When this deduction is confirmed point by point, then the subjective becomes objective. • Sentiment analysis (sometimes known as opinion mining or emotion AI) refers to the use of natural language processing to systematically identify the affective states and subjective information in a text. • Generally speaking, sentiment analysis aims to determine the attitude of a speaker, writer, or other subject with respect to some topic. • Alternatively, sentiment analysis aims at identifying the overall polarity and subjectivity or emotional reaction to a document. • More sophisticated approaches are able to distinguish among a wider selection of emotional states.
  • 73. When this deduction is confirmed point by point, then the subjective becomes objective. • Sentiment analysis (sometimes known as opinion mining or emotion AI) refers to the use of natural language processing to systematically identify the affective states and subjective information in a text. • Generally speaking, sentiment analysis aims to determine the attitude of a speaker, writer, or other subject with respect to some topic. • Alternatively, sentiment analysis aims at identifying the overall polarity and subjectivity or emotional reaction to a document. • More sophisticated approaches are able to distinguish among a wider selection of emotional states.
  • 74. When this deduction is confirmed point by point, then the subjective becomes objective. for document in corpus: blob = TextBlob(document.text) for i, sentence in enumerate(blob.sentences): print('%s)tpol: %.3f, sub: %.3f' % (i, sentence.sentiment.polarity, sentence.sentiment.subjectivity)) THE THREE GABLES The Case-Book of Sherlock Holmes Arthur Conan Doyle (1926-09) 0) pol: -0.125, sub: 1.000 1) pol: 0.136, sub: 0.455 2) pol: -0.052, sub: 0.196 3) pol: -0.625, sub: 1.000 4) pol: 0.200, sub: 0.700 5) pol: 0.127, sub: 0.833 6) pol: -0.071, sub: 0.362 7) pol: 0.000, sub: 0.000 8) pol: 0.000, sub: 0.000 9) pol: 0.300, sub: 0.100 10) pol: 0.000, sub: 0.000 11) pol: 0.000, sub: 0.000 12) pol: -0.425, sub: 0.675 13) pol: -0.125, sub: 0.375 14) pol: 0.600, sub: 1.000 15) pol: 0.000, sub: 0.000 16) pol: 0.000, sub: 0.000 17) pol: 0.417, sub: 0.500 18) pol: 0.000, sub: 0.000 19) pol: 0.417, sub: 0.500 20) pol: 0.000, sub: 0.000 ...
  • 75. When this deduction is confirmed point by point, then the subjective becomes objective.
  • 76. When this deduction is confirmed point by point, then the subjective becomes objective. for doc in corpus: for i, sent in enumerate(doc.sents): scores = textacy.lexicon_methods.emotional_valence(sent) values = ['%s: %.3f' % (k, scores[k]) for k in sorted(scores.keys())] print('%s)t%s' % (i, 'nt'.join(values))) THE THREE GABLES The Case-Book of Sherlock Holmes Arthur Conan Doyle (1926-09) I don't think that any of my adventures with Mr. Sherlock Holmes opened quite so abruptly, or so dramatically, as that which I associate with The Three Gables. I had not seen Holmes for some days and had no idea of the new channel into which his activities had been directed. He was in a chatty mood that morning, however, and had just settled me into the well-worn low armchair on one side of the fire, while he had curled down with his pipe in his mouth upon the opposite chair, when our visitor arrived. If I had said that a mad bull had arrived it would give a clearer impression of what occurred.
  • 77. LATENT TOPICS “I have known him for some time,” said I, “but I never knew him do anything yet without a very good reason,” and with that our conversation drifted off on to other topics.
  • 78. He was face to face with an infinite possibility of latent evil… • Latent Dirichlet Allocation (LDA) is a generative model that automatically discovers the topics that the sentences contain. • It represents documents as mixtures of topics from where words are pulled out with a certain probabilities. • It assumes that each document - has a number N of words (according to a Poisson distribution), - has a topic mixture over a fixed set of K topics (according to a Poisson distribution). • Then, for each word in each document: - a topic is picked randomly (according to the distribution sampled above), - it randomly generates the word itself (according to the other distribution). • Assuming this generative model for a collection of documents, LDA then tries to backtrack from the documents to find a set of topics that have likely generated the collection (Gibbs sampling).
  • 79. He was face to face with an infinite possibility of latent evil… • Latent Dirichlet Allocation (LDA) is a generative model that automatically discovers the topics that the sentences contain. • It represents documents as mixtures of topics from where words are pulled out with a certain probabilities. • It assumes that each document - has a number N of words (according to a Poisson distribution), - has a topic mixture over a fixed set of K topics (according to a Poisson distribution). • Then, for each word in each document: - a topic is picked randomly (according to the distribution sampled above), - it randomly generates the word itself (according to the other distribution). • Assuming this generative model for a collection of documents, LDA then tries to backtrack from the documents to find a set of topics that have likely generated the collection (Gibbs sampling).
  • 80. He was face to face with an infinite possibility of latent evil… • Latent Dirichlet Allocation (LDA) is a generative model that automatically discovers the topics that the sentences contain. • It represents documents as mixtures of topics from where words are pulled out with a certain probabilities. • It assumes that each document - has a number N of words (according to a Poisson distribution), - has a topic mixture over a fixed set of K topics (according to a Poisson distribution). • Then, for each word in each document: - a topic is picked randomly (according to the distribution sampled above), - it randomly generates the word itself (according to the other distribution). • Assuming this generative model for a collection of documents, LDA then tries to backtrack from the documents to find a set of topics that have likely generated the collection (Gibbs sampling).
  • 81. He was face to face with an infinite possibility of latent evil… • Latent Dirichlet Allocation (LDA) is a generative model that automatically discovers the topics that the sentences contain. • It represents documents as mixtures of topics from where words are pulled out with a certain probabilities. • It assumes that each document - has a number N of words (according to a Poisson distribution), - has a topic mixture over a fixed set of K topics (according to a Poisson distribution). • Then, for each word in each document: - a topic is picked randomly (according to the distribution sampled above), - it randomly generates the word itself (according to the other distribution). • Assuming this generative model for a collection of documents, LDA then tries to backtrack from the documents to find a set of topics that have likely generated the collection (Gibbs sampling).
  • 82. He was face to face with an infinite possibility of latent evil… • Latent Dirichlet Allocation (LDA) is a generative model that automatically discovers the topics that the sentences contain. • It represents documents as mixtures of topics from where words are pulled out with a certain probabilities. • It assumes that each document - has a number N of words (according to a Poisson distribution), - has a topic mixture over a fixed set of K topics (according to a Poisson distribution). • Then, for each word in each document: - a topic is picked randomly (according to the distribution sampled above), - it randomly generates the word itself (according to the other distribution). • Assuming this generative model for a collection of documents, LDA then tries to backtrack from the documents to find a set of topics that have likely generated the collection (Gibbs sampling).
  • 83. He was face to face with an infinite possibility of latent evil… corpus = textacy.Corpus('en', docs=documents) terms = (doc.to_terms_list(ngrams={1}, normalize='lemma') for doc in corpus) tfidf, idx = textacy.vsm.doc_term_matrix(terms, weighting='tfidf') model = textacy.tm.TopicModel('lda', n_topics=60) model.fit(tfidx) for topic_idx, top_terms in model.top_topic_terms(idx, top_n=5): print('Topic #%s: %s' % (topic_idx, 'tt'.join(top_terms))) topics = model.transform(tfidf) for doc_idx, top_topics in model.top_doc_topics(topics): print('%s: %s' % (corpus.docs[doc_idx].metadata['title'], 'tt'.join(['Topic #%s (%.2f)' % (t[0], 100 * t[1]) for t in top_topics]))) model.termite_plot(tfidf, idx)
  • 84. He was face to face with an infinite possibility of latent evil… Topic #0: lestrade london woman window lady miss street inspector hour sherlock Topic #6: jones wilson hopkins inspector sholto trevor league office birmingham pinner Topic #9: gregson mycroft mcmurdo warren garcia douglas barker susan inspector greek Topic #10: moor mortimer henry duke grace american charles bicycle hopkins wilder Topic #11: mcmurdo douglas susan barker robert steve barney jones smith sholto Topic #12: robert ferguson smith trevor woodley carruthers jones mason sholto gregson ... The Sign of the Four: Topic #0 (46.77) Topic #12 (25.02) Topic #6 (23.45) A Study in Scarlet: Topic #0 (53.95) Topic #52 (35.67) Topic #51 (33.71) The Hound of the Baskervilles: Topic #10 (50.89) Topic #0 (44.51) Topic #54 (38.52) The Valley of Fear: Topic #11 (49.42) Topic #9 (28.17) Topic #0 (27.12) ...
  • 85. He was face to face with an infinite possibility of latent evil…
  • 86. “You are very welcome to put any questions that you like to me now, and there is no danger that I will refuse to answer them.”