Unsupervised document classification addresses the problem of assigning categories to documents without the use of a training set or predefined categories. This is useful to enhance information retrieval, the basic assumption being that similar contents are also relevant to the same query. A similar assumption is made in literature to define literary genres and sub-genres, where works which share specific conventions in terms of form and content are described by the same genre.
The talk gives an overview of document clustering and its challenges, with a focus on dimensionality reduction and how to address it with topic modelling techniques like LDA (Latent Dirichlet Allocation). Using Shakespeare’s body of work as a case study, the talk describes how to use nltk, sklearn and gensim to process and analyse theatrical works with the final goal of testing whether document clustering yields to the same classification given by literature experts.
Deck as presented at PyData Amsterdam 2016
Data driven literary analysis: an unsupervised approach to text analysis and classification
1. DATA DRIVEN LITERARY ANALYSIS: AN UNSUPERVISED APPROACH TO
TEXT ANALYSIS AND CLASSIFICATION
Serena Peruzzo
PhD candidate at TU/e
@sereprz
s.peruzzo@tue.nl
github.com/sereprz
2. WHY AND WHAT?
➤ Natural Language Processing (NLP)
➤ interaction between natural and artificial languages
➤ e.g., machine translators, spam filters
CAN NLP IDENTIFY DIFFERENT GENRES?
2
6. FEATURE EXTRACTION
➤ a lot of information needs to be compressed and represented in simple data types
tfidf(‘love’, ‘Romeo and Juliet’, ‘Shakespeare’s plays’) = 100 * ln(28/25) = 11.33
tfidf(‘Juliet’, ‘Romeo and Juliet’, ‘Shakespeare’s plays’) = 100 * ln(28/1) = 333.22
term frequency
inverse document frequency
6
7. LATENT DIRICHLET ALLOCATION
➤ N documents
➤ K probability distributions over a collection of words (topics)
➤ Formal statistical relationship
➤ bag-of-words assumption
7
8. LDA - GENERATIVE MODEL
➤ For each document:
1. Select the number of words
2. Draw a distribution of topics
3. For each word in the document:
i. Draw a specific topic
ii. Draw a word from a multinomial probability conditioned on the topic
8
9. LDA - EXAMPLE
➤ d is a 5-words document
➤ Decide d will be 1/2 about cute animals and 1/2 about food
➤ topic:food, word:’broccoli’
➤ topic:cute animals, word:‘panda’
➤ topic:cute animals, word: ’baby’
➤ topic:food, word: ’apple’
➤ topic:food, word:’eating’
➤ d = { broccoli, panda, baby, apple, eating}
9
12. Complex plot (twists)
Mistaken identities
Language (puns, creative insults)
Love
Happy ending
Noble hero with a tragic flaw that
leads to a tragic fall
Supernatural element
Death
12
17. K-MEANS GROUPING VS TRADITIONAL CLASSIFICATION
Group 0 Group 1
Twelfth night, The Merchant of Venice,
Love’s Labour’s Lost, Much ado About
Nothing, Taming of the Shrew, As You
Like it, Merry Wives of Windsor,
Midsummer Night’s Dream, Romeo and
Juliet, Comedy of Errors, Two
Gentlemen of Verona
Titus Andronicus, All’s Well What Ends
Well, Macbeth, Hamlet, Antony and
Cleopatra, King Lear, Julius Caesar,
Tempest, Winter’s Tale, Timon of Athens,
Coriolanus, Troilus and Cressida,
Measure for Measure, Cymbeline,
Othello, Pericle Prince of Persia
17