Data driven literary analysis: an unsupervised approach to text analysis and classification

•

2 recomendaciones•367 vistas

Unsupervised document classification addresses the problem of assigning categories to documents without the use of a training set or predefined categories. This is useful to enhance information retrieval, the basic assumption being that similar contents are also relevant to the same query. A similar assumption is made in literature to define literary genres and sub-genres, where works which share specific conventions in terms of form and content are described by the same genre. The talk gives an overview of document clustering and its challenges, with a focus on dimensionality reduction and how to address it with topic modelling techniques like LDA (Latent Dirichlet Allocation). Using Shakespeare’s body of work as a case study, the talk describes how to use nltk, sklearn and gensim to process and analyse theatrical works with the final goal of testing whether document clustering yields to the same classification given by literature experts. Deck as presented at PyData Amsterdam 2016

Datos y análisis

DATA DRIVEN LITERARY ANALYSIS: AN UNSUPERVISED APPROACH TO
TEXT ANALYSIS AND CLASSIFICATION
Serena Peruzzo
PhD candidate at TU/e
@sereprz
s.peruzzo@tue.nl
github.com/sereprz

WHY AND WHAT?
➤ Natural Language Processing (NLP)
➤ interaction between natural and artiﬁcial languages
➤ e.g., machine translators, spam ﬁlters
CAN NLP IDENTIFY DIFFERENT GENRES?
2

SHAKESPEARE ANALYSIS
18 comedies
10 tragedies
11000+ words
Two stages
unsupervised
approach
Trials and
Errors
3

FEATURE EXTRACTION
➤ a lot of information needs to be compressed and represented in simple data types
tﬁdf(‘love’, ‘Romeo and Juliet’, ‘Shakespeare’s plays’) = 100 * ln(28/25) = 11.33
tﬁdf(‘Juliet’, ‘Romeo and Juliet’, ‘Shakespeare’s plays’) = 100 * ln(28/1) = 333.22
term frequency
inverse document frequency
6

LATENT DIRICHLET ALLOCATION
➤ N documents
➤ K probability distributions over a collection of words (topics)
➤ Formal statistical relationship
➤ bag-of-words assumption
7

LDA - GENERATIVE MODEL
➤ For each document:
1. Select the number of words
2. Draw a distribution of topics
3. For each word in the document:
i. Draw a speciﬁc topic
ii. Draw a word from a multinomial probability conditioned on the topic
8

LDA - EXAMPLE
➤ d is a 5-words document
➤ Decide d will be 1/2 about cute animals and 1/2 about food
➤ topic:food, word:’broccoli’
➤ topic:cute animals, word:‘panda’
➤ topic:cute animals, word: ’baby’
➤ topic:food, word: ’apple’
➤ topic:food, word:’eating’
➤ d = { broccoli, panda, baby, apple, eating}
9

K-MEANS CLUSTERING
➤ Unsupervised
➤ K groups
➤ minimise variability within each cluster
➤ maximise variability between clusters
11

Complex plot (twists)
Mistaken identities
Language (puns, creative insults)
Love
Happy ending
Noble hero with a tragic flaw that
leads to a tragic fall
Supernatural element
Death
12

PRE-PROCESSING AND ANALYSIS
nltk
13
lda + scikit-learn

TOPICS AVERAGES WITHIN GROUPS
death common love hero
16

K-MEANS GROUPING VS TRADITIONAL CLASSIFICATION
Group 0 Group 1
Twelfth night, The Merchant of Venice,
Love’s Labour’s Lost, Much ado About
Nothing, Taming of the Shrew, As You
Like it, Merry Wives of Windsor,
Midsummer Night’s Dream, Romeo and
Juliet, Comedy of Errors, Two
Gentlemen of Verona
Titus Andronicus, All’s Well What Ends
Well, Macbeth, Hamlet, Antony and
Cleopatra, King Lear, Julius Caesar,
Tempest, Winter’s Tale, Timon of Athens,
Coriolanus, Troilus and Cressida,
Measure for Measure, Cymbeline,
Othello, Pericle Prince of Persia
17

YEARS THE PLAYS WERE PERFORMED FOR THE FIRST TIME
18

WRAP UP
➤ Can’t ﬁnd comedies VS tragedies
➤ Can use NLP for literary analysis
➤ Let the data tell their story
19

code: github.com/sereprz/ShakespeareTextAnalysis
THANKS FOR LISTENING
QUESTIONS?
20

Más contenido relacionado

Similar a Data driven literary analysis: an unsupervised approach to text analysis and classification

Context-Clues2.pptAngelCasila1

Context-Clues2 (1).pptMARYCEL4

Unit 1 nounsnadsab

Teaching Literature GuidebookPrestwick House

The-Language-Instinct-How-the-Mind-Creates-Language,-Steven-Pinker.pdfjosepulido64

Pearland I S D 2014Teri Lesesne

Its the Talk of Nueva York The Hybrid Called SpanglishBy LIZET.docxchristiandean12115

Neural Language Generation Head to Toe Hady Elsahar

Low cost and no cost materialsJaved Iqbal Student of M.S (Teacher Education) at University of Tennessee USA

We love English. Презентація до відкритого заходу для 7-8-х кл. Наталія Slavbibl4

Similar a Data driven literary analysis: an unsupervised approach to text analysis and classification (10)

Context-Clues2.ppt

Context-Clues2 (1).ppt

Unit 1 nouns

Teaching Literature Guidebook

The-Language-Instinct-How-the-Mind-Creates-Language,-Steven-Pinker.pdf

Pearland I S D 2014

Its the Talk of Nueva York The Hybrid Called SpanglishBy LIZET.docx

Neural Language Generation Head to Toe

Low cost and no cost materials

We love English. Презентація до відкритого заходу для 7-8-х кл.

Último

BigBuy dropshipping via API with DroFx.pptxolyaivanovalion

Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823

Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823

Anomaly detection and data imputation within time seriesParis Women in Machine Learning and Data Science

Predicting Loan Approval: A Data Science ProjectBoston Institute of Analytics

Capstone Project on IBM Data Analytics ProgramMoniSankarHazra

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823

April 2024 - Crypto Market Report's Analysismanisha194592

BabyOno dropshipping via API with DroFx.pptxolyaivanovalion

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal

Probability Grade 10 Third Quarter LessonsJoseMangaJr1

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823

Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

Discover Why Less is More in B2B Researchmichael115558

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY

Sampling (random) method and Non random.pptDr. Soumendra Kumar Patra

Data driven literary analysis: an unsupervised approach to text analysis and classification

1. DATA DRIVEN LITERARY ANALYSIS: AN UNSUPERVISED APPROACH TO TEXT ANALYSIS AND CLASSIFICATION Serena Peruzzo PhD candidate at TU/e @sereprz s.peruzzo@tue.nl github.com/sereprz

2. WHY AND WHAT? ➤ Natural Language Processing (NLP) ➤ interaction between natural and artiﬁcial languages ➤ e.g., machine translators, spam ﬁlters CAN NLP IDENTIFY DIFFERENT GENRES? 2

3. SHAKESPEARE ANALYSIS 18 comedies 10 tragedies 11000+ words Two stages unsupervised approach Trials and Errors 3

4. SUPERVISED DOCUMENT CLASSIFICATION 4

5. UNSUPERVISED APPROACH 5

6. FEATURE EXTRACTION ➤ a lot of information needs to be compressed and represented in simple data types tﬁdf(‘love’, ‘Romeo and Juliet’, ‘Shakespeare’s plays’) = 100 * ln(28/25) = 11.33 tﬁdf(‘Juliet’, ‘Romeo and Juliet’, ‘Shakespeare’s plays’) = 100 * ln(28/1) = 333.22 term frequency inverse document frequency 6

7. LATENT DIRICHLET ALLOCATION ➤ N documents ➤ K probability distributions over a collection of words (topics) ➤ Formal statistical relationship ➤ bag-of-words assumption 7

8. LDA - GENERATIVE MODEL ➤ For each document: 1. Select the number of words 2. Draw a distribution of topics 3. For each word in the document: i. Draw a speciﬁc topic ii. Draw a word from a multinomial probability conditioned on the topic 8

9. LDA - EXAMPLE ➤ d is a 5-words document ➤ Decide d will be 1/2 about cute animals and 1/2 about food ➤ topic:food, word:’broccoli’ ➤ topic:cute animals, word:‘panda’ ➤ topic:cute animals, word: ’baby’ ➤ topic:food, word: ’apple’ ➤ topic:food, word:’eating’ ➤ d = { broccoli, panda, baby, apple, eating} 9

10. 10

11. K-MEANS CLUSTERING ➤ Unsupervised ➤ K groups ➤ minimise variability within each cluster ➤ maximise variability between clusters 11

12. Complex plot (twists) Mistaken identities Language (puns, creative insults) Love Happy ending Noble hero with a tragic flaw that leads to a tragic fall Supernatural element Death 12

13. PRE-PROCESSING AND ANALYSIS nltk 13 lda + scikit-learn

14. 14 play

15. common words death love hero 15

16. TOPICS AVERAGES WITHIN GROUPS death common love hero 16

17. K-MEANS GROUPING VS TRADITIONAL CLASSIFICATION Group 0 Group 1 Twelfth night, The Merchant of Venice, Love’s Labour’s Lost, Much ado About Nothing, Taming of the Shrew, As You Like it, Merry Wives of Windsor, Midsummer Night’s Dream, Romeo and Juliet, Comedy of Errors, Two Gentlemen of Verona Titus Andronicus, All’s Well What Ends Well, Macbeth, Hamlet, Antony and Cleopatra, King Lear, Julius Caesar, Tempest, Winter’s Tale, Timon of Athens, Coriolanus, Troilus and Cressida, Measure for Measure, Cymbeline, Othello, Pericle Prince of Persia 17

18. YEARS THE PLAYS WERE PERFORMED FOR THE FIRST TIME 18

19. WRAP UP ➤ Can’t ﬁnd comedies VS tragedies ➤ Can use NLP for literary analysis ➤ Let the data tell their story 19

20. code: github.com/sereprz/ShakespeareTextAnalysis THANKS FOR LISTENING QUESTIONS? 20

Data driven literary analysis: an unsupervised approach to text analysis and classification

Recomendados

Recomendados

Más contenido relacionado

Similar a Data driven literary analysis: an unsupervised approach to text analysis and classification

Similar a Data driven literary analysis: an unsupervised approach to text analysis and classification (10)

Último

Último (20)

Data driven literary analysis: an unsupervised approach to text analysis and classification