2. Introduction
• To learn from collections of text documents like books,
newspapers, emails, etc.
Important Terms:
• Tokenization
• Tagging (Noun/Verb/…)
• Chunking(Noun Phase)
• Stemming(-ing/-s/-ed)
3. Important packages in R
• library(tm) # Framework for text mining.
• library(SnowballC) # Provides wordStem() for stemming.
• library(qdap) # Quantitative discourse analysis of
transcripts.
• library(qdapDictionaries)
• library(dplyr) # Data preparation and pipes %>%.
• library(RColorBrewer) # Generate palette of colours for
plots.
• library(ggplot2) # Plot word frequencies.
• library(scales) # Include commas in numbers.
• library(Rgraphviz) # Correlation plots.
4. Corpus
• Collection of text
• Each corpus will have separate articles, stories, volumes,
each treated as a separate entity or record.
• Any file format can be converted to text file for corpus
Eg:
• PDF to Text File
• system("for f in *.pdf; do pdftotext -enc ASCII7 -nopgbrk $f; done")
• Word Document to Text File
• system("for f in *.doc; do antiword $f; done")
6. Loading Corpus
• Loading Corpus
** Using DirSource() the source object is passed on to Corpus() which loads the documents.
• In case of PDF Documents
• docs <- Corpus(DirSource(cname), readerControl=list(reader=readPDF))
** xpdf application needs to be installed for readPDF()
• In case of Word Documents
• docs <- Corpus(DirSource(cname), readerControl=list(reader=readDOC("-r -s")))
** -r requests that removed text be included in the output
** -s requests that text hidden by Word be included
7. Exploration of Corpus
• inspect()
• Preparing the corpus
• Transformation type
• tm map() is used to apply one of this transformation
• Other transformations can be implemented using R functions and wrapped
within content_transformer()
8. Transformation Example
• replace “/”, “@” and “|” with a space
• Alternate method
• Conversion to toLower Case
• Remove Numbers
• Remove Punctuation
9. Contd...
• Remove English Stop Words
• Remove Own Stop Words
• Strip Whitespace
• Specific Transformations
10. Contd...
• Stemming
• Creating a Document Term Matrix
A matrix with documents as the rows
terms as the columns
count of the frequency of words as the cells of the matrix.
• Term frequency
11. Contd...
• Frequency order of item
• ord <- order(freq)
• Least Frequent item
• freq[head(ord)]
• Most frequent item
• freq[tail(ord)]
• Document Term matrix to CSV
• dtm <- DocumentTermMatrix(docs)
• m <- as.matrix(dtm)
• write.csv(m, file="dtm.csv")
12. Contd...
• Removing Sparse Terms
• dtms <- removeSparseTerms(dtm, 0.1) //Sparse factor
• the resulting matrix contains only terms with a sparse factor of less than sparse.
• Frequent items and association
** lowfreq = terms that occur at least 1000 times
• Association with word with correlation limit
• // association of “data” with other word
• // two words always appear together => correlation would be 1.0
13. Correlation
• 50 of the more frequent words
• With minimum correlation of 0.5
• Word occurrences 100
• By default
• 20 random terms
• With minimum correlation of 0.7
14. Plotting word frequencies
• freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
• wf <- data.frame(word=names(freq), freq=freq)
• //words that occurs at least 500 times in the corpus
16. Size of Word & Frequency
• For word limitation
• wordcloud(names(freq), freq, max.words=100)
• For term frequency limitation
• wordcloud(names(freq), freq, min.freq=100)
• Adding Color
• wordcloud(names(freq), freq, min.freq=100, colors=brewer.pal(6, "Dark2"))
17. Quantitative Analysis of Text (qdap)
• Extracting the column names (the terms) and retain those shorter
than 20 characters
• To generate frequencies and percentage