SlideShare una empresa de Scribd logo
1 de 20
Descargar para leer sin conexión
DATA DRIVEN LITERARY ANALYSIS: AN UNSUPERVISED APPROACH TO
TEXT ANALYSIS AND CLASSIFICATION
Serena Peruzzo
PhD candidate at TU/e
@sereprz
s.peruzzo@tue.nl
github.com/sereprz
WHY AND WHAT?
➤ Natural Language Processing (NLP)
➤ interaction between natural and artificial languages
➤ e.g., machine translators, spam filters
CAN NLP IDENTIFY DIFFERENT GENRES?
2
SHAKESPEARE ANALYSIS
18 comedies
10 tragedies
11000+ words
Two stages
unsupervised
approach
Trials and
Errors
3
SUPERVISED DOCUMENT CLASSIFICATION
4
UNSUPERVISED APPROACH
5
FEATURE EXTRACTION
➤ a lot of information needs to be compressed and represented in simple data types
tfidf(‘love’, ‘Romeo and Juliet’, ‘Shakespeare’s plays’) = 100 * ln(28/25) = 11.33
tfidf(‘Juliet’, ‘Romeo and Juliet’, ‘Shakespeare’s plays’) = 100 * ln(28/1) = 333.22
term frequency
inverse document frequency
6
LATENT DIRICHLET ALLOCATION
➤ N documents
➤ K probability distributions over a collection of words (topics)
➤ Formal statistical relationship
➤ bag-of-words assumption
7
LDA - GENERATIVE MODEL
➤ For each document:
1. Select the number of words
2. Draw a distribution of topics
3. For each word in the document:
i. Draw a specific topic
ii. Draw a word from a multinomial probability conditioned on the topic
8
LDA - EXAMPLE
➤ d is a 5-words document
➤ Decide d will be 1/2 about cute animals and 1/2 about food
➤ topic:food, word:’broccoli’
➤ topic:cute animals, word:‘panda’
➤ topic:cute animals, word: ’baby’
➤ topic:food, word: ’apple’
➤ topic:food, word:’eating’
➤ d = { broccoli, panda, baby, apple, eating}
9
10
K-MEANS CLUSTERING
➤ Unsupervised
➤ K groups
➤ minimise variability within each cluster
➤ maximise variability between clusters
11
Complex plot (twists)
Mistaken identities
Language (puns, creative insults)
Love
Happy ending
Noble hero with a tragic flaw that
leads to a tragic fall
Supernatural element
Death
12
PRE-PROCESSING AND ANALYSIS
nltk
13
lda + scikit-learn
14
play
common words death
love hero
15
TOPICS AVERAGES WITHIN GROUPS
death common love hero
16
K-MEANS GROUPING VS TRADITIONAL CLASSIFICATION
Group 0 Group 1
Twelfth night, The Merchant of Venice,
Love’s Labour’s Lost, Much ado About
Nothing, Taming of the Shrew, As You
Like it, Merry Wives of Windsor,
Midsummer Night’s Dream, Romeo and
Juliet, Comedy of Errors, Two
Gentlemen of Verona
Titus Andronicus, All’s Well What Ends
Well, Macbeth, Hamlet, Antony and
Cleopatra, King Lear, Julius Caesar,
Tempest, Winter’s Tale, Timon of Athens,
Coriolanus, Troilus and Cressida,
Measure for Measure, Cymbeline,
Othello, Pericle Prince of Persia
17
YEARS THE PLAYS WERE PERFORMED FOR THE FIRST TIME
18
WRAP UP
➤ Can’t find comedies VS tragedies
➤ Can use NLP for literary analysis
➤ Let the data tell their story
19
code: github.com/sereprz/ShakespeareTextAnalysis
THANKS FOR LISTENING
QUESTIONS?
20

Más contenido relacionado

Similar a Data driven literary analysis: an unsupervised approach to text analysis and classification

Context-Clues2.ppt
Context-Clues2.pptContext-Clues2.ppt
Context-Clues2.pptAngelCasila1
 
Context-Clues2 (1).ppt
Context-Clues2 (1).pptContext-Clues2 (1).ppt
Context-Clues2 (1).pptMARYCEL4
 
Unit 1 nouns
Unit 1  nounsUnit 1  nouns
Unit 1 nounsnadsab
 
Teaching Literature Guidebook
Teaching Literature GuidebookTeaching Literature Guidebook
Teaching Literature GuidebookPrestwick House
 
The-Language-Instinct-How-the-Mind-Creates-Language,-Steven-Pinker.pdf
The-Language-Instinct-How-the-Mind-Creates-Language,-Steven-Pinker.pdfThe-Language-Instinct-How-the-Mind-Creates-Language,-Steven-Pinker.pdf
The-Language-Instinct-How-the-Mind-Creates-Language,-Steven-Pinker.pdfjosepulido64
 
Pearland I S D 2014
Pearland I S D 2014Pearland I S D 2014
Pearland I S D 2014Teri Lesesne
 
Its the Talk of Nueva York The Hybrid Called SpanglishBy LIZET.docx
Its the Talk of Nueva York The Hybrid Called SpanglishBy LIZET.docxIts the Talk of Nueva York The Hybrid Called SpanglishBy LIZET.docx
Its the Talk of Nueva York The Hybrid Called SpanglishBy LIZET.docxchristiandean12115
 
Neural Language Generation Head to Toe
Neural Language Generation Head to Toe Neural Language Generation Head to Toe
Neural Language Generation Head to Toe Hady Elsahar
 
We love English. Презентація до відкритого заходу для 7-8-х кл.
We love English. Презентація до відкритого заходу для 7-8-х кл. We love English. Презентація до відкритого заходу для 7-8-х кл.
We love English. Презентація до відкритого заходу для 7-8-х кл. Наталія Slavbibl4
 

Similar a Data driven literary analysis: an unsupervised approach to text analysis and classification (10)

Context-Clues2.ppt
Context-Clues2.pptContext-Clues2.ppt
Context-Clues2.ppt
 
Context-Clues2 (1).ppt
Context-Clues2 (1).pptContext-Clues2 (1).ppt
Context-Clues2 (1).ppt
 
Unit 1 nouns
Unit 1  nounsUnit 1  nouns
Unit 1 nouns
 
Teaching Literature Guidebook
Teaching Literature GuidebookTeaching Literature Guidebook
Teaching Literature Guidebook
 
The-Language-Instinct-How-the-Mind-Creates-Language,-Steven-Pinker.pdf
The-Language-Instinct-How-the-Mind-Creates-Language,-Steven-Pinker.pdfThe-Language-Instinct-How-the-Mind-Creates-Language,-Steven-Pinker.pdf
The-Language-Instinct-How-the-Mind-Creates-Language,-Steven-Pinker.pdf
 
Pearland I S D 2014
Pearland I S D 2014Pearland I S D 2014
Pearland I S D 2014
 
Its the Talk of Nueva York The Hybrid Called SpanglishBy LIZET.docx
Its the Talk of Nueva York The Hybrid Called SpanglishBy LIZET.docxIts the Talk of Nueva York The Hybrid Called SpanglishBy LIZET.docx
Its the Talk of Nueva York The Hybrid Called SpanglishBy LIZET.docx
 
Neural Language Generation Head to Toe
Neural Language Generation Head to Toe Neural Language Generation Head to Toe
Neural Language Generation Head to Toe
 
Low cost and no cost materials
Low cost and no cost materialsLow cost and no cost materials
Low cost and no cost materials
 
We love English. Презентація до відкритого заходу для 7-8-х кл.
We love English. Презентація до відкритого заходу для 7-8-х кл. We love English. Презентація до відкритого заходу для 7-8-х кл.
We love English. Презентація до відкритого заходу для 7-8-х кл.
 

Último

BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 

Último (20)

BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 

Data driven literary analysis: an unsupervised approach to text analysis and classification

  • 1. DATA DRIVEN LITERARY ANALYSIS: AN UNSUPERVISED APPROACH TO TEXT ANALYSIS AND CLASSIFICATION Serena Peruzzo PhD candidate at TU/e @sereprz s.peruzzo@tue.nl github.com/sereprz
  • 2. WHY AND WHAT? ➤ Natural Language Processing (NLP) ➤ interaction between natural and artificial languages ➤ e.g., machine translators, spam filters CAN NLP IDENTIFY DIFFERENT GENRES? 2
  • 3. SHAKESPEARE ANALYSIS 18 comedies 10 tragedies 11000+ words Two stages unsupervised approach Trials and Errors 3
  • 6. FEATURE EXTRACTION ➤ a lot of information needs to be compressed and represented in simple data types tfidf(‘love’, ‘Romeo and Juliet’, ‘Shakespeare’s plays’) = 100 * ln(28/25) = 11.33 tfidf(‘Juliet’, ‘Romeo and Juliet’, ‘Shakespeare’s plays’) = 100 * ln(28/1) = 333.22 term frequency inverse document frequency 6
  • 7. LATENT DIRICHLET ALLOCATION ➤ N documents ➤ K probability distributions over a collection of words (topics) ➤ Formal statistical relationship ➤ bag-of-words assumption 7
  • 8. LDA - GENERATIVE MODEL ➤ For each document: 1. Select the number of words 2. Draw a distribution of topics 3. For each word in the document: i. Draw a specific topic ii. Draw a word from a multinomial probability conditioned on the topic 8
  • 9. LDA - EXAMPLE ➤ d is a 5-words document ➤ Decide d will be 1/2 about cute animals and 1/2 about food ➤ topic:food, word:’broccoli’ ➤ topic:cute animals, word:‘panda’ ➤ topic:cute animals, word: ’baby’ ➤ topic:food, word: ’apple’ ➤ topic:food, word:’eating’ ➤ d = { broccoli, panda, baby, apple, eating} 9
  • 10. 10
  • 11. K-MEANS CLUSTERING ➤ Unsupervised ➤ K groups ➤ minimise variability within each cluster ➤ maximise variability between clusters 11
  • 12. Complex plot (twists) Mistaken identities Language (puns, creative insults) Love Happy ending Noble hero with a tragic flaw that leads to a tragic fall Supernatural element Death 12
  • 16. TOPICS AVERAGES WITHIN GROUPS death common love hero 16
  • 17. K-MEANS GROUPING VS TRADITIONAL CLASSIFICATION Group 0 Group 1 Twelfth night, The Merchant of Venice, Love’s Labour’s Lost, Much ado About Nothing, Taming of the Shrew, As You Like it, Merry Wives of Windsor, Midsummer Night’s Dream, Romeo and Juliet, Comedy of Errors, Two Gentlemen of Verona Titus Andronicus, All’s Well What Ends Well, Macbeth, Hamlet, Antony and Cleopatra, King Lear, Julius Caesar, Tempest, Winter’s Tale, Timon of Athens, Coriolanus, Troilus and Cressida, Measure for Measure, Cymbeline, Othello, Pericle Prince of Persia 17
  • 18. YEARS THE PLAYS WERE PERFORMED FOR THE FIRST TIME 18
  • 19. WRAP UP ➤ Can’t find comedies VS tragedies ➤ Can use NLP for literary analysis ➤ Let the data tell their story 19