[Expert Panel] New Google Shopping Ads Strategies Uncovered
Introduction to Text Mining
1. Class Outline
• Introduction: Unstructured Data Analysis
• Word-level Analysis
– Vector Space Model
– TF-IDF
• Beyond Word-level Analysis: Natural
Language Processing (NLP)
• Text Mining Demonstration in R: Mining
Twitter Data
2. Background: Text Mining – New MR Tool!
• Text data is everywhere – books, news, articles, financial analysis,
blogs, social networking, etc
• According to estimates, 80% of world’s data is in “unstructured text
format”
• We need methods to extract, summarize, and analyze useful
information from unstructured/text data
• Text mining seeks to automatically discover useful knowledge from
the massive amount of data
• Active research is going on in the area of text mining in industry and
academics
3. What is Text Mining?
• Use of computational techniques to extract high quality
information from text
• Extract and discover knowledge hidden in text automatically
• KDD definition: “discovery by computer of new previously unknown
information, by automatically extracting information from a usually
large amount of different unstructured textual resources”
5. Features of Text Data
•
•
•
•
•
•
•
•
High dimensionality
Large number of features
Multiple ways to represent the same concept
Highly redundant data
Unstructured data
Easy for humans, hard for machine
Abstract ideas hard to represent
Huge amount of data to be processed
– Automation is required
6. Acquiring Texts
• Existing digital corpora: e.g. XML (high quality text and metadata)
– http://www.hathitrust.org/htrc
• Other digital sources (e.g. Web, twitter, Amazon consumer reviews)
– Through API: e.g. tweets
– Websites without APIs can be “scraped”
– Generally requires custom programming (Perl, Python, etc) or software tools
(e.g. Web extractor pro)
• Undigitized text
– Scanned and subjected to Optical Character Recognition (OCR)
– Time and labor intensive
– Error-prone
7. Word-level Analysis: Vector Space Model
• Documents are treated as a “bag” of words or terms
• Any document can be represented as a vector: a list of terms and
their associated weights
– D= {(t1,w1),(t2,w2),…………,(tn,wn )}
– ti: i-th term
– wi: weight for the i-th term
• Weight is a measure of the importance of terms of information
content
8. Vector Space Model: Bag of Words Representation
• Each document: Sparse high-dimensional vector!
10. TF-IDF: Example
• TF: Consider a document containing 100 words wherein the word cow
appears 3 times. Following the previously defined formulas, what is
the term frequency (TF) for cow?
– TF(cow,d1) = 3.
• IDF: Now assume we have 10 million documents and cow appears in
one thousand of these. What is the inverse document frequency of
the term, cow?
– IDF(cow) = log(10,000,000/1,000) = 4
• TF-IDF score?
– TF-IDF = 3 x 4 = 12 (Product of TF and IDF)
12. Application 2: Word Frequencies – Zipf’s Law
• Idea: We use a few words very often, and most words very rarely,
because it’s more effort to use a rare word.
• Zipf’s Law: Product of frequency of word and its rank is [reasonably]
constant
• Empirically demonstrable; holds up over different languages
14. Application 3: Word Cloud - Budweiser Example
http://people.duke.edu/~el113/Visualizations.html
15. Problems with Word-level Analysis: Sentiment
• Sentiment can often be expressed in a more subtle manner, making it
difficult to be identified by any of a sentence or document’s terms
when considered in isolation
– A positive or negative sentiment word may have opposite orientations in
different application domains. (“This camera sucks.” -> negative; “This vacuum
cleaner really sucks.” -> positive)
– A sentence containing sentiment words may not express any sentiment. (e.g.
“Can you tell me which Sony camera is good?”)
– Sarcastic sentences with or without sentiment words are hard to deal with. (e.g.
“What a great car! It sopped working in two days.”
– Many sentences without sentiment words can also imply opinions. (e.g. “This
washer uses a lot of water.” -> negative)
• We have to consider the overall context (semantics of each sentence
or document)
16. Natural Language Processing (NLP) to the Rescue!
• NLP: is a filed of computer science, artificial intelligence, and
linguistics, concerned with the interactions between computers and
human (natural) languages.
• Key idea: Use statistical “machine learning” to automatically learn
the language from data!
• Major tasks in NLP
–
–
–
–
–
–
Automatic summarization
Part-of-speech tagging (POS tagging)
Relationship extraction
Sentiment analysis
Topic segmentation and recognition
Machine translation
22. • Text Mining Demonstration in R: Mining
Twitter Data
23. Twitter Mining in R – 1/2
Step 0) Install “R” and Packages
R program: http://www.r-project.org/
Package: http://cran.r-project.org/web/packages/tm/index.html
Package: http://cran.r-project.org/web/packages/twitteR/index.html
Package: http://cran.r-project.org/web/packages/wordcloud/index.html
Manual: http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
Step 1) Retrieving Text from Twitter: Twitter API
(Using twitteR)
24. Twitter Mining in R – 2/2
Step 2) Transforming Text
Step 3) Stemming Words
Step 4) Build a Term-Document Matrix
Step 5) Frequent Terms and Associations
Step 6) Word Cloud
25. Software for Text Mining
• A number of academic/commercial software available:
– 1. Open source packages in R – e.g. tm
• R program: http://www.r-project.org/
• Package: http://cran.r-project.org/web/packages/tm/index.html
• Manual: http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
– 2. Stanford NLP core
• http://nlp.stanford.edu/software/corenlp.shtml
–
–
–
–
–
3. SAS TextMiner
4. IBM SPSS
5. Boos Texter
6. StatSoft
7. AeroText
• Text Data is everywhere – you can mine it to gain insights!