A Study in (P)rose

1. A STUDY IN (P)ROSE NLP Applied to Sherlock Holmes Stories Stefano Bragaglia

2. The shadow was seated in a chair, black outline upon the luminous screen of the window. • Corpora • Basic Statistics • Content & Word Frequency • Readability • Characters & Centrality • Automatic Summarisation • Word Vectors & Clustering • Sentiment & Subjectivity • Latent Topics 221BBakerStreet

3. “I only require a few missing links to have an entirely connected case.” • http://nbviewer.jupyter.org/github/skipgram/modern-nlp-in- python/blob/master/executable/Modern_NLP_in_Python.ipynb • http://brandonrose.org/clustering • https://theinvisibleevent.wordpress.com/2015/11/08/35-the-language-of- sherlock-holmes-a-study-in-consistency/ • http://www.christianpeccei.com/holmes/ • https://github.com/sgsinclair/alta/blob/master/ipynb/Python.ipynb • http://data-mining.philippe-fournier-viger.com/tutorial-how-to-discover-hidden- patterns-in-text-documents/ • http://sujitpal.blogspot.co.uk/2015/07/discovering-entity-relationships-in.html • All the pictures are copyright of the respective authors.

4. “I had an idea that he might, and I took the liberty of bringing the tools with me.” • matplotlib – http://matplotlib.org • newspaper3k – https://github.com/codelucas/newspaper • python-igraph – http://igraph.org/python/#pyinstallosx • pyclusterig – https://github.com/annoviko/pyclustering • spaCy – https://spacy.io • sumy – https://github.com/miso-belica/sumy • textaCy – https://textacy.readthedocs.io/en/latest/index.html • textblob – https://textblob.readthedocs.io/en/dev/ • word_cloud – https://github.com/amueller/word_cloud

5. CORPORA “I have some documents here,” said my friend Sherlock Holmes, as we sat one winter's night on either side of the fire, “which I really think, Watson, that it would be worth your while to glance over.

6. “I seem to have heard some queer stories about him.” • In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). • The texts may be in a single language (monolingual corpus) or in multiple languages (multilingual corpus). If formatted for side-by-side comparison, they are called aligned parallel corpora (translation corpus for translations, else comparable corpus). • They are often subjected to annotation to make them more useful, i.e. POS-tagging: information about words’ part of speech are added as tags. If they contain further structured levels of analysis, they are called Treebanks or Parsed Corpora.

9. “I seem to have heard some queer stories about him.” The complete Sherlock Homes Canon: • 60 adventures in 9 books: • 4 novels • 56 short stories in 5 collections • Freely available in several formats: • https://sherlock-holm.es/

10. “I seem to have heard some queer stories about him.” The Novels The Adventures of Sherlock Holmes The Memoirs of Sherlock Holmes STUD A Study in Scarlet 1887-10 SCAN A Scandal in Bohemia 1891-07 SILV Silver Blaze 1892-12 SIGN The Sign of the Four 1890-02 REDH The Red-Headed League 1891-08 YELL Yellow Face 1893-02 HOUN The Hound of the Baskerville 1901-08 IDEN A Case of Identity 1891-09 STOC The Stockbroker’s Clerk 1893-03 VALL The Valley of Fear 1914-09 BOSC The Boscombe Valley Mystery 1891-10 GLOR The “Gloria Scott” 1893-04 FIVE The Five Orange Pips 1891-11 MUSG The Musgrave Ritual 1893-05 TWIS The Man with the Twisted Lip 1891-12 REIG The Reigate Puzzle 1893-06 BLUE The Adventure of the Blue Carbuncle 1892-01 CROO The Crooked Man 1893-07 SPEC The Adventure of the Speckled Band 1892-02 RESI The Resident Patient 1893-08 ENGR The Adventure of the Engineer’s Thumb 1892-03 GREE The Greek Interpreter 1893-09 NOBL The Adventure of the Noble Bachelor 1892-04 NAVA The Naval Treaty 1893-10 BERY The Adventure of the Beryl Coronet 1892-05 FINA The Final Problem 1893-12 COPP The Adventure of the Copper Beeches 1892-06

11. “I seem to have heard some queer stories about him.” The Return of Sherlock Holmes His Last Bow The Case-Book of Sherlock Holmes EMPT The Adventure of the Empty House 1903-09 WIST The Adventure of Wisteria Lodge 1908-08 ILLU The Illustrious Client 1924-11 NORW The Adventure of the Norwood Builder 1903-10 CARD The Adventure of the Cardboard Box 1893-01 BLAN The Blanched Soldier 1926-10 DANC The Adventure of the Dancing Men 1903-12 REDC The Adventure of the Red Circle 1911-03 MAZA The Adventure of the Mazarin Stone 1921-10 SOLI The Adventure of the Solitary Cyclist 1903-12 BRUC The Adventure of the Bruce-Partington Plans 1908-12 3GAB The Adventure of the Three Gables 1926-09 PRIO The Adventure of the Priory School 1904-01 DYIN The Adventure of the Dying Detective 1913-11 SUSS The Adventure of the Sussex Vampire 1924-01 BLAC The Adventure of Black Peter 1904-02 LADY The Disappearance of Lady Frances Carfax 1911-12 3GAR The Adventure of the Three Garridebs 1924-10 CHAS The Adventure of Charles Augustus Milverton 1904-03 DEVI The Adventure of the Devil’s Foot 1910-12 THOR The Problem of Thor Bridge 1922-02 SIXN The Adventure of the Six Napoleons 1904-04 LAST His Last Bow 1917-09 CREE The Adventure of the Creeping Man 1923-03 3STU The Adventure of the Three Students 1904-06 LION The Adventure of the Lion’s Mane 1926-11 GOLD The Adventure of the Golden Pince-Nez 1904-07 VEIL The Adventure of the Veiled Lodger 1927-01 MISS The Adventure of the Missing Three- Quarter 1904-08 SHOS Adventure of Shoscombe Old Place 1927-03 ABBE The Adventure of the Abbey Grande 1904-09 RETI The Adventure of the Retired Colourman 1926-12 SECO The Adventure of the Second Stain 1904-12

12. BASIC STATISTICS “You can, for example, never foretell what any one man will do, but you can say with precision what an average number will be up to. Individuals vary, but percentages remain constant. So says the statistician.”

13. Now, I am counting upon joining it here… for document in corpus: statistics['words'] = 0 statistics['sentences'] = 0 statistics['characters'] = 0 for sentence in document: statistics['sentences'] += 1 for word in sentence: if word is not punctuation: statistics['words'] += 1 statistics['characters'] += len(word) statistics['length'] = statistics['characters'] / statistics['words'] statistics['size'] = statistics['words'] / statistics['sentences']

14. Now, I am counting upon joining it here… meta = { 'stud.txt': {'author': 'Arthur Conan Doyle', 'collection': 'The Novels', 'title': 'A Study in Scarlet', 'code': 'STUD', 'pub_date': '1887-11'}, … } for filename in os.listdir(folder): content = '' with open(filename, 'r', encoding='utf-8') as file: content = file.read() doc = textacy.Doc(content, metadata=meta[filename]) docs.append(document) for document in textacy.Corpus('en', docs=docs, metadatas=meta): print(textacy.text_stats.readability_stats(document)) print(textacy.text_stats.readability_stats(corpus)) THE THREE GABLES The Case-Book of Sherlock Holmes Arthur Conan Doyle (1926-09) Statistics: - Characters: 24,704 - Syllables: 7,704 - Words: 6,188 - Unique words: 1,421 - Polysyllable words: 284 - Sentences: 460 - Avg characters per word: 3.99 - Avg words per sentence: 13.45 Indexes: - Automated Readability: 4.10 - Coleman-Liau: 5.47 - Flesch-Kincaid: 4.35 - Flesch Ease Readability: 87.85 - Gunning-Fog: 7.22 - SMOG: 7.62

15. Now, I am counting upon joining it here…

16. Now, I am counting upon joining it here… A. C. Doyle 4 books, 56 novels Total words 730,000 Unique words 20,000 Unique lemmas 15,000 Total sentences 39,000 Avg word length 3.88 Avg sentence length 18.68

17. Now, I am counting upon joining it here… A. C. Doyle W. Shakespeare 4 books, 56 novels 38 plays, 154 sonnets Total words 730,000 1,035,000 Unique words 20,000 27,000 Unique lemmas 15,000 22,000 Total sentences 39,000 93,000 Avg word length 3.88 4.41 Avg sentence length 18.68 11.08

18. Now, I am counting upon joining it here… A. C. Doyle W. Shakespeare % 4 books, 56 novels 38 plays, 154 sonnets Total words 730,000 1,035,000 -29.5 Unique words 20,000 27,000 -25.9 Unique lemmas 15,000 22,000 -31.8 Total sentences 39,000 93,000 -58.0 Avg word length 3.88 4.41 -12.0 Avg sentence length 18.68 11.08 +68.6

19. Now, I am counting upon joining it here… • Prodigious vocabulary: only 1/3 words (1/4 unique words) less than Shakespeare despite the much shorter corpus • Back in time there were much less words (Shakespeare had to “invent” many of them – i.e.: eyeball) but contemporary authors have much more restrictions to obey to • As a term of comparison, consider that modern English contains around 250,000 terms, most of them being neologisms like: robot, computer, internet, unleaded, twerking, …

20. CONTENT & WORDS FREQUENCY “Because I made a blunder, my dear Watson—which is, I am afraid, a more common occurrence than any one would think who only knew me through your memoirs.

21. I frequently found my thoughts turning in her direction and wondering… for document in corpus: f = {} for sentence in document: for word in sentence: if word is not punctuation: f[word] += 1 print([(w, f[w]) for w in sorted(f, key=f.get, reverse=True)][10:])

22. I frequently found my thoughts turning in her direction and wondering… for document in corpus: f = {} for sentence in document: for word in sentence: if word is not punctuation: f[word] += 1 print([(w, f[w]) for w in sorted(f, key=f.get, reverse=True)][10:]) “You broke the thread of my thoughts; but perhaps it is as well.” “Perhaps that is why we are so subtly influenced by it.”

23. I frequently found my thoughts turning in her direction and wondering… for document in corpus: f = {} for sentence in document: for word in sentence: if word is not punctuation: f[word] += 1 f[word.lower()] += 1 print([(w, f[w]) for w in sorted(f, key=f.get, reverse=True)][10:])

24. I frequently found my thoughts turning in her direction and wondering… for document in corpus: f = {} for sentence in document: for word in sentence: if word is not punctuation: f[word] += 1 f[word.lower()] += 1 print([(w, f[w]) for w in sorted(f, key=f.get, reverse=True)][10:]) “Funny, she didn't say good-bye.” “Your correspondent says two friends.”

25. I frequently found my thoughts turning in her direction and wondering… for document in corpus: f = {} for sentence in document: for word in sentence: if word is not punctuation: f[word] += 1 f[word.lower()] += 1 f[word.lemma.lower()] += 1 print([(w, f[w]) for w in sorted(f, key=f.get, reverse=True)][10:])

26. I frequently found my thoughts turning in her direction and wondering… for document in corpus: f = {} for sentence in document: for word in sentence: if word is not punctuation: f[word] += 1 f[word.lower()] += 1 f[word.lemma.lower()] += 1 print([(w, f[w]) for w in sorted(f, key=f.get, reverse=True)][10:]) Standing at the window, I watched her walking briskly down the street, until the gray turban and white feather were but a speck in the sombre crowd.

27. The furniture and pictures were of the most common and vulgar description. • Extremely common words have little or no use when retrieving information from documents. • Such words are called stop-words and are usually excluded completely from the vocabulary. • The general strategy to determine stop words is to sort them by frequency, and pick the first N (often hand-filtered for their semantic content relative to the domain). • It is possible to use other sources, such as: • https://en.wikipedia.org/wiki/Most_common_ words_in_English

28. I frequently found my thoughts turning in her direction and wondering… stopwords = […] for document in corpus: f = {} for sentence in document: for word in sentence: if word is not punctuation: if word is not punctuation and not in stopwords: f[word] += 1 f[word.lower()] += 1 f[word.lemma.lower()] += 1 print([(w, f[w]) for w in sorted(f, key=f.get, reverse=True)][10:])

29. I frequently found my thoughts turning in her direction and wondering… stopwords = […] for document in corpus: f = {} for sentence in document: for word in sentence: if word is not punctuation: if word is not punctuation and not in stopwords: f[word] += 1 f[word.lower()] += 1 f[word.lemma.lower()] += 1 print([(w, f[w]) for w in sorted(f, key=f.get, reverse=True)][10:]) “A friend of Mr. Sherlock is always welcome!” Sherlock Holmes rubbed his hands with delight. It's all very well for you to laugh, Mr. Sherlock Holmes.

30. I frequently found my thoughts turning in her direction and wondering… • In linguistics, an n-gram is a contiguous sequence of n items (such as phonemes, syllables, letters, words or base pairs) from a given sequence of text or speech. • An n-gram model is a type of probabilistic language model for predicting the next item in such a sequence in the form of a (n-1)- order Markov model (i.e.: predictive keyboards). • They provide a measure of co-location frequency, therefore they may help identify: • Syntagmatic Associations (i.e.: cold + weather, Burkina + Faso, etc.) • Paradigmatic Associations (i.e.: synonyms, co-reference resolution, etc.) A STUDY IN SCARLET VI. Tobias Gregson Shows What He Can Do Arthur Conan Doyle (1887-11) "Look here, Mr. Sherlock Holmes," he said. "Look here, Mr. Sherlock Holmes," he said. "Look here, Mr. Sherlock Holmes," he said. "Look here, Mr. Sherlock Holmes," he said. "Look here, Mr. Sherlock Holmes," he said. "Look here, Mr. Sherlock Holmes," he said. "Look here, Mr. Sherlock Holmes," he said. "Look here, Mr. Sherlock Holmes," he said. "Look here, Mr. Sherlock Holmes," he said. "Look here, Mr. Sherlock Holmes," he said. "Look here, Mr. Sherlock Holmes," he said. "Look here, Mr. Sherlock Holmes," he said.

31. I frequently found my thoughts turning in her direction and wondering… corpus = textacy.Corpus('en', docs=docs, metadatas=meta) for doc in corpus: bot = doc.to_bag_of_terms(ngrams={1, 2, 3}, drop_determiners=True, filter_stops=True, filter_punct=True, filter_nums=False, as_strings=True) print({term: bot[term] for term in sorted(bot, key=bot.get, reverse=True)}) bot = corpus.to_bag_of_terms(ngrams={1, 2, 3}, drop_determiners=True, filter_stops=True, filter_punct=True, filter_nums=False, as_strings=True) print({term: bot[term] for term in sorted(bot, key=bot.get, reverse=True)}) THE THREE GABLES The Case-Book of Sherlock Holmes Arthur Conan Doyle (1926-09) Occurrences: - holmes (54) - mr. holmes (18) - masser holmes (15) - susan (15) - one (14) - say holmes (13) - maberley (10) - watson (9) - mrs. maberley (8) - steve (6) - first (6) - be not (6) - douglas (5) - london (5) ...

32. “It is the brightest rift which I can at present see in the clouds.” stopwords = […] corpus = textacy.Corpus('en', docs=docs, metadatas=meta) for doc in corpus: wordcloud = WordCloud(max_words=1000, margin=0, random_state=1) .generate(doc.text) matplotlib.pyplot.imshow(wordcloud, interpolation='bilinear') matplotlib.pyplot.axis('off') matplotlib.pyplot.figure() wordcloud = WordCloud(max_words=1000, margin=0, random_state=1, stopwords=stopwords) .generate(doc.text) matplotlib.pyplot.imshow(wordcloud, interpolation='bilinear') matplotlib.pyplot.axis('off') matplotlib.pyplot.show()

33. “It is the brightest rift which I can at present see in the clouds.”

34. I frequently found my thoughts turning in her direction and wondering… • Holmes is never any lower than 4th most popular word; it’s 4th only in The Hound of Baskerville, in which he is rarely on stage • man (& synonyms) is much more frequent than woman: victorian misogyny? • say is definitely Doyle’s favourite speech attribution verb • Language is very concrete (say, see, come, know, go and think), with almost no place for emotions (cry): scientific approach vs. spiritism • Only little and time (more subjective words) go into the top 15 • The word the accounts for the 6% of all the corpus

35. READABILITY My companion gave a sudden chuckle of comprehension. “And not a very obscure cipher, Watson,” said he. “Why, of course, it is Italian! The A means that it is addressed to a woman. ‘Beware! Beware! Beware!’ How's that, Watson?

36. Now, I am counting upon joining it here… meta = { 'stud.txt': {'author': 'Arthur Conan Doyle', 'collection': 'The Novels', 'title': 'A Study in Scarlet', 'code': 'STUD', 'pub_date': '1887-11'}, … } for filename in os.listdir(folder): content = '' with open(filename, 'r', encoding='utf-8') as file: content = file.read() doc = textacy.Doc(content, metadata=meta[filename]) docs.append(document) for document in textacy.Corpus('en', docs=docs, metadatas=meta): print(textacy.text_stats.readability_stats(document)) print(textacy.text_stats.readability_stats(corpus)) THE THREE GABLES The Case-Book of Sherlock Holmes Arthur Conan Doyle (1926-09) Statistics: - Characters: 24,704 - Syllables: 7,704 - Words: 6,188 - Unique words: 1,421 - Polysyllable words: 284 - Sentences: 460 - Avg characters per word: 3.99 - Avg words per sentence: 13.45 Indexes: - Automated Readability: 4.10 - Coleman-Liau: 5.47 - Flesch-Kincaid: 4.35 - Flesch Ease Readability: 87.85 - Gunning-Fog: 7.22 - SMOG: 7.62

37. Now, I am counting upon joining it here… • Readability is the ease with which a reader can understand a written text. • The readability depends on content (the complexity of its vocabulary and syntax) and presentation (typographic aspects such as font size, line height, and line length). • Researchers have proposed several formulas to determine the readability of a text by means features like average word length in syllables (ASL), average sentence length (ASW), etc. • For instance: FLESCH = 206.835 − (1.015 × ASL) − (84.6 × ASW)

38. Now, I am counting upon joining it here… • These stories are still very popular: very popular vocabulary, easy to read with ideas easy to grasp • The series ran for over 40 years (no continuously), still Doyle maintained the same focus on the basic language • The density for new words in this corpus is 8-11% and it is considered ideal for an 8 year old (3rd grade) • Excluding the first 2 (shorter and less prone to repetitions) novellas, the other 7 books perfectly fall in the above interval

39. CHARACTERS & CENTRALITY We tied Toby to the hall table, and reascended the stairs. The room was as we had left it, save that a sheet had been draped over the central figure. A weary-looking police- sergeant reclined in the corner.

40. nlp = spacy.load('en') for doc in corpus: names = [] tuples = [] for par in re.split('(r?n){2}', doc.text): parser = nlp(par) entities = [] for ent in parser.ents: if ent.label in [PERSON, LOC, GPE]: name = re.sub('[^0-9a-zA-Z]+', ' ', ent.text) if name not in names: names.append(name) for entity in entities: tuples.append((entity, name)) entities.append(name) “Of course, we do not yet know what the relations may have been…”

41. “Of course, we do not yet know what the relations may have been…” ig = igraph.Graph.TupleList(tuples) vector = ig.eigenvector_centrality() colors = [] label_colors = [] for value in vector: color = colorsys.hsv_to_rgb(2.0 * (1.0 - value) / 3.0, 1.0, 1.0) label_colors.append('gray' if value < 0.5 else 'black') colors.append('#%02x%02x%02x' % (int(color[0] * 255), int(color[1] * 255), int(color[2] * 255))) ig.vs['label'] = names ig.vs['color'] = colors ig.vs['label_color'] = label_colors layout = ig.layout('kk') ig.write_svg('%s.png' % doc.meta['code'], margin=50, layout=layout, border=50, width=1280, height=800)

42. “Of course, we do not yet know what the relations may have been…” THETHREEGABLES TheCase-BookofSherlockHolmes ArthurConanDoyle(1926-09)

43. AUTOMATIC SUMMARISATION It was in the summer of '89, not long after my marriage, that the events occurred which I am now about to summarise.

44. He knitted his brows as though determined not to omit anything in his narrative. • Automatic Summarisation might either be Extraction-based or Abstraction-based. Best results come when both are applied. • TextRank and LexRank are graph-based algorithms where sentences are vertices and edges model the similarity between them. • While LexRank uses TF-IDF and cosine similarity, TextRank uses PageRank (a word appearing in two sentences is like a link between them) to measure the similarity between sentences. • Roughly speaking, a sentence containing many keywords that also appear in other sentences is a hub and receives a higher score. • The sentences are sorted by this value: since the top N more likely cover all the topics (keywords) in the document, they are considered as the summary.

49. He knitted his brows as though determined not to omit anything in his narrative. path = '…' language = 'english' tokenizer = Tokenizer(language) parser = PlaintextParser.from_file(path, tokenizer) stemmer = Stemmer(language) summarizer = Summarizer(stemmer) summarizer.stop_words = get_stop_words(language) summary = summarizer(parser.document, 10) for sentence in summary: print(sentence)

50. He knitted his brows as though determined not to omit anything in his narrative. THE DYING DETECTIVE His Last Bow Arthur Conan Doyle (1913-11) Not only was her first-floor flat invaded at all hours by throngs of singular and often undesirable characters but her remarkable lodger showed an eccentricity and irregularity in his life which must have sorely tried her patience. His incredible untidiness, his addiction to music at strange hours, his occasional revolver practice within doors, his weird and often malodorous scientific experiments, and the atmosphere of violence and danger which hung around him made him the very worst tenant in London. Knowing how genuine was her regard for him, I listened earnestly to her story when she came to my rooms in the second year of my married life and told me of the sad condition to which my poor friend was reduced. In the dim light of a foggy November day the sick room was a gloomy spot, but it was that gaunt, wasted face staring at me from the bed which sent a chill to my heart. His eyes had the brightness of fever, there was a hectic flush upon either cheek, and dark crusts clung to his lips; the thin hands upon the coverlet twitched incessantly, his voice was croaking and spasmodic. Then, unable to settle down to reading, I walked slowly round the room, examining the pictures of celebrated criminals with which every wall was adorned. I saw a great yellow face, coarse-grained and greasy, with heavy, double-chin, and two sullen, menacing gray eyes which glared at me from under tufted and sandy brows. The skull was of enormous capacity, and yet as I looked down I saw to my amazement that the figure of the man was small and frail, twisted in the shoulders and back like one who has suffered from rickets in his childhood. Then in an instant his sudden access of strength departed, and his masterful, purposeful talk droned away into the low, vague murmurings of a semi-delirious man. You will realize that among your many talents dissimulation finds no place, and that if you had shared my secret you would never have been able to impress Smith with the urgent necessity of his presence, which was the vital point of the whole scheme.

51. WORD VECTORS & CLUSTERING Our coming was evidently a great event, for station-master and porters clustered round us to carry out our luggage.

52. For every step increased the distance between them… • The term frequency–inverse document frequency (TF–IDF) is as numerical statistics that reflects how important is a word to a document in a corpus. • It is often used as a weighting factor in information retrieval, text mining and user modelling. • It consists of the product of two terms: • the term frequency captures the importance of a term for a document, • the inverse document frequency measures the specificity of a term for a document in a corpus. • There are various ways of computing these values, the simplest one utilises: • the raw frequency ft,d for TF, • the logarithm of the ration between N = |D| and nt = |d∈D: t∈d| of documents containing the term t for IDF. • In combination with cosine similarity, a measure of similarity between two non-zero vectors that measures the cosine of the angle between them, it provides a crude measure of the distance between documents: A · B • similarity = cos(Θ) = –––––––– ||A|| ||B||

57. For every step increased the distance between them… idf = corpus.word_doc_freqs(weighting='idf') tfs = {doc.metadata['code']: doc.to_bag_of_words(weighting='freq') for doc in corpus.docs} tfidfs = {code: [] for code in tfs} for key in sorted(idf.keys()): for code in tfidfs: if key in tfs[code]: tfidfs[code].append(tfs[code][key] * idf[key]) else: tfidfs[code].append(0.0) for i, k_i in enumerate(tfidfs.keys()): for j, k_j in enumerate(tfidfs.keys()): v = textacy.math_utils.cosine_similarity(tfidfs[k_i], tfidfs[k_j]) print('%s vs. %s : %.3f' % (METADATA[k_i]['title'], METADATA[k_j]['title'], v)) Lady Frances Carfax vs. His Last Bow : 0.905 The Greek Interpreter vs. Lady Frances Carfax : 0.938 The Greek Interpreter vs. The Bruce-Partington Plans : 0.957 The Bruce-Partington Plans vs. The Greek Interpreter : 0.957 The Greek Interpreter vs. The Greek Interpreter : 1.000

58. For every step increased the distance between them… corpus = textacy.Corpus('en', docs=documents) terms = (doc.to_terms_list(ngrams={1}, normalize='lemma') for doc in corpus) tfidf, idx = textacy.vsm.doc_term_matrix(terms, weighting='tfidf') sample = tfidf.toarray() sample_pca = mlab.PCA(sample) sample_cutoff = sample_pca.fracs[1] sample_2d = sample_pca.project(sample, minfrac=sample_cutoff) instance = optics(sample, 0.8125, 2) instance.process() clusters = instance.get_clusters() noise = instance.get_noise() visualizer = cluster_visualizer() visualizer.append_cluster(noise, sample_2d, marker='x') visualizer.append_clusters(clusters, sample_2d) visualizer.show()

59. For every step increased the distance between them… • A word embedding (GloVe, word2vec) is a group of related models to sort the words in a corpus. • These models are simple 2-layers neural networks that are trained to reconstruct the linguistic context of words. • They take large corpora as input and produce a vector space of several hundreds dimensions as output. • In such vector spaces, each word is given a precise position to whom corresponds a vector, so that special spatial properties are maintained. • QUEEN = KING – MAN + WOMAN

63. For every step increased the distance between them… • A word embedding (GloVe, word2vec) is a group of related models to sort the words in a corpus. • These models are simple 2-layers neural networks that are trained to reconstruct the linguistic context of words. • They take large corpora as input and produce a vector space of several hundreds dimensions as output. • In such vector spaces, each word is given a precise position to whom corresponds a vector, so that special spatial properties are maintained. • KING – MAN + WOMAN = ?

68. For every step increased the distance between them… • A word embedding (GloVe, word2vec) is a group of related models to sort the words in a corpus. • These models are simple 2-layers neural networks that are trained to reconstruct the linguistic context of words. • They take large corpora as input and produce a vector space of several hundreds dimensions as output. • In such vector spaces, each word is given a precise position to whom corresponds a vector, so that special spatial properties are maintained. • KING – MAN + WOMAN = QUEEN

69. SENTIMENT & SUBJECTIVITY I felt of all Holmes's criminals this was the one whom he would find it hardest to face. However, he was immune from sentiment.

70. When this deduction is confirmed point by point, then the subjective becomes objective. • Sentiment analysis (sometimes known as opinion mining or emotion AI) refers to the use of natural language processing to systematically identify the affective states and subjective information in a text. • Generally speaking, sentiment analysis aims to determine the attitude of a speaker, writer, or other subject with respect to some topic. • Alternatively, sentiment analysis aims at identifying the overall polarity and subjectivity or emotional reaction to a document. • More sophisticated approaches are able to distinguish among a wider selection of emotional states.

74. When this deduction is confirmed point by point, then the subjective becomes objective. for document in corpus: blob = TextBlob(document.text) for i, sentence in enumerate(blob.sentences): print('%s)tpol: %.3f, sub: %.3f' % (i, sentence.sentiment.polarity, sentence.sentiment.subjectivity)) THE THREE GABLES The Case-Book of Sherlock Holmes Arthur Conan Doyle (1926-09) 0) pol: -0.125, sub: 1.000 1) pol: 0.136, sub: 0.455 2) pol: -0.052, sub: 0.196 3) pol: -0.625, sub: 1.000 4) pol: 0.200, sub: 0.700 5) pol: 0.127, sub: 0.833 6) pol: -0.071, sub: 0.362 7) pol: 0.000, sub: 0.000 8) pol: 0.000, sub: 0.000 9) pol: 0.300, sub: 0.100 10) pol: 0.000, sub: 0.000 11) pol: 0.000, sub: 0.000 12) pol: -0.425, sub: 0.675 13) pol: -0.125, sub: 0.375 14) pol: 0.600, sub: 1.000 15) pol: 0.000, sub: 0.000 16) pol: 0.000, sub: 0.000 17) pol: 0.417, sub: 0.500 18) pol: 0.000, sub: 0.000 19) pol: 0.417, sub: 0.500 20) pol: 0.000, sub: 0.000 ...

75. When this deduction is confirmed point by point, then the subjective becomes objective.

76. When this deduction is confirmed point by point, then the subjective becomes objective. for doc in corpus: for i, sent in enumerate(doc.sents): scores = textacy.lexicon_methods.emotional_valence(sent) values = ['%s: %.3f' % (k, scores[k]) for k in sorted(scores.keys())] print('%s)t%s' % (i, 'nt'.join(values))) THE THREE GABLES The Case-Book of Sherlock Holmes Arthur Conan Doyle (1926-09) I don't think that any of my adventures with Mr. Sherlock Holmes opened quite so abruptly, or so dramatically, as that which I associate with The Three Gables. I had not seen Holmes for some days and had no idea of the new channel into which his activities had been directed. He was in a chatty mood that morning, however, and had just settled me into the well-worn low armchair on one side of the fire, while he had curled down with his pipe in his mouth upon the opposite chair, when our visitor arrived. If I had said that a mad bull had arrived it would give a clearer impression of what occurred.

77. LATENT TOPICS “I have known him for some time,” said I, “but I never knew him do anything yet without a very good reason,” and with that our conversation drifted off on to other topics.

78. He was face to face with an infinite possibility of latent evil… • Latent Dirichlet Allocation (LDA) is a generative model that automatically discovers the topics that the sentences contain. • It represents documents as mixtures of topics from where words are pulled out with a certain probabilities. • It assumes that each document - has a number N of words (according to a Poisson distribution), - has a topic mixture over a fixed set of K topics (according to a Poisson distribution). • Then, for each word in each document: - a topic is picked randomly (according to the distribution sampled above), - it randomly generates the word itself (according to the other distribution). • Assuming this generative model for a collection of documents, LDA then tries to backtrack from the documents to find a set of topics that have likely generated the collection (Gibbs sampling).

83. He was face to face with an infinite possibility of latent evil… corpus = textacy.Corpus('en', docs=documents) terms = (doc.to_terms_list(ngrams={1}, normalize='lemma') for doc in corpus) tfidf, idx = textacy.vsm.doc_term_matrix(terms, weighting='tfidf') model = textacy.tm.TopicModel('lda', n_topics=60) model.fit(tfidx) for topic_idx, top_terms in model.top_topic_terms(idx, top_n=5): print('Topic #%s: %s' % (topic_idx, 'tt'.join(top_terms))) topics = model.transform(tfidf) for doc_idx, top_topics in model.top_doc_topics(topics): print('%s: %s' % (corpus.docs[doc_idx].metadata['title'], 'tt'.join(['Topic #%s (%.2f)' % (t[0], 100 * t[1]) for t in top_topics]))) model.termite_plot(tfidf, idx)

84. He was face to face with an infinite possibility of latent evil… Topic #0: lestrade london woman window lady miss street inspector hour sherlock Topic #6: jones wilson hopkins inspector sholto trevor league office birmingham pinner Topic #9: gregson mycroft mcmurdo warren garcia douglas barker susan inspector greek Topic #10: moor mortimer henry duke grace american charles bicycle hopkins wilder Topic #11: mcmurdo douglas susan barker robert steve barney jones smith sholto Topic #12: robert ferguson smith trevor woodley carruthers jones mason sholto gregson ... The Sign of the Four: Topic #0 (46.77) Topic #12 (25.02) Topic #6 (23.45) A Study in Scarlet: Topic #0 (53.95) Topic #52 (35.67) Topic #51 (33.71) The Hound of the Baskervilles: Topic #10 (50.89) Topic #0 (44.51) Topic #54 (38.52) The Valley of Fear: Topic #11 (49.42) Topic #9 (28.17) Topic #0 (27.12) ...

85. He was face to face with an infinite possibility of latent evil…

86. “You are very welcome to put any questions that you like to me now, and there is no danger that I will refuse to answer them.”

A Study in (P)rose

Recomendados

Recomendados

Más contenido relacionado

Último

Último (20)

Destacado

Destacado (20)

A Study in (P)rose