Introduction to Text Mining

Class Outline
• Introduction: Unstructured Data Analysis
• Word-level Analysis
– Vector Space Model
– TF-IDF

• Beyond Word-level Analysis: Natural
Language Processing (NLP)
• Text Mining Demonstration in R: Mining
Twitter Data

Background: Text Mining – New MR Tool!
• Text data is everywhere – books, news, articles, financial analysis,
blogs, social networking, etc
• According to estimates, 80% of world’s data is in “unstructured text
format”
• We need methods to extract, summarize, and analyze useful
information from unstructured/text data
• Text mining seeks to automatically discover useful knowledge from
the massive amount of data
• Active research is going on in the area of text mining in industry and
academics

What is Text Mining?
• Use of computational techniques to extract high quality
information from text

• Extract and discover knowledge hidden in text automatically

• KDD definition: “discovery by computer of new previously unknown
information, by automatically extracting information from a usually
large amount of different unstructured textual resources”

Text Mining Tasks
• 1. Document Categorization (Supervised Learning)
• 2. Document Clustering/Organization (Unsupervised Learning)
• 3. Summarization (key words, indices, etc)
• 4. Visualization (word cloud, maps)
• 5. Numeric prediction (stock market prediction based on news text)

Features of Text Data
•
•
•
•
•
•
•
•

High dimensionality
Large number of features
Multiple ways to represent the same concept
Highly redundant data
Unstructured data
Easy for humans, hard for machine
Abstract ideas hard to represent
Huge amount of data to be processed
– Automation is required

Acquiring Texts
• Existing digital corpora: e.g. XML (high quality text and metadata)
– http://www.hathitrust.org/htrc

• Other digital sources (e.g. Web, twitter, Amazon consumer reviews)
– Through API: e.g. tweets
– Websites without APIs can be “scraped”
– Generally requires custom programming (Perl, Python, etc) or software tools
(e.g. Web extractor pro)

• Undigitized text
– Scanned and subjected to Optical Character Recognition (OCR)
– Time and labor intensive
– Error-prone

Word-level Analysis: Vector Space Model
• Documents are treated as a “bag” of words or terms
• Any document can be represented as a vector: a list of terms and
their associated weights
– D= {(t1,w1),(t2,w2),…………,(tn,wn )}
– ti: i-th term
– wi: weight for the i-th term

• Weight is a measure of the importance of terms of information
content

Vector Space Model: Bag of Words Representation
• Each document: Sparse high-dimensional vector!

TF-IDF: Example
• TF: Consider a document containing 100 words wherein the word cow
appears 3 times. Following the previously defined formulas, what is
the term frequency (TF) for cow?
– TF(cow,d1) = 3.

• IDF: Now assume we have 10 million documents and cow appears in
one thousand of these. What is the inverse document frequency of
the term, cow?
– IDF(cow) = log(10,000,000/1,000) = 4

• TF-IDF score?
– TF-IDF = 3 x 4 = 12 (Product of TF and IDF)

Application 1: Document Search with Query
Document ID

Cat

Dog

d1

0.397

d2

Mouse

Fish

Horse

Cow

Matching Scores

0.397 0.000

0.475

0.000

0.000

1.268

0.352

0.301 0.680

0.000

0.000

0.000

0.653

d3

0.301

0.363 0.000

0.000

0.669

0.741

0.664

d4

0.376

0.352 0.636

0.558

0.000

0.000

1.286

d5

0.301

0.301 0.000

0.426

0.544

0.544

1.028

Application 2: Word Frequencies – Zipf’s Law
• Idea: We use a few words very often, and most words very rarely,
because it’s more effort to use a rare word.

• Zipf’s Law: Product of frequency of word and its rank is [reasonably]
constant

• Empirically demonstrable; holds up over different languages

Application 2: Word Frequencies – Zipf’s Law

Application 3: Word Cloud - Budweiser Example

http://people.duke.edu/~el113/Visualizations.html

Problems with Word-level Analysis: Sentiment
• Sentiment can often be expressed in a more subtle manner, making it
difficult to be identified by any of a sentence or document’s terms
when considered in isolation
– A positive or negative sentiment word may have opposite orientations in
different application domains. (“This camera sucks.” -> negative; “This vacuum
cleaner really sucks.” -> positive)
– A sentence containing sentiment words may not express any sentiment. (e.g.
“Can you tell me which Sony camera is good?”)
– Sarcastic sentences with or without sentiment words are hard to deal with. (e.g.
“What a great car! It sopped working in two days.”
– Many sentences without sentiment words can also imply opinions. (e.g. “This
washer uses a lot of water.” -> negative)

• We have to consider the overall context (semantics of each sentence
or document)

Natural Language Processing (NLP) to the Rescue!
• NLP: is a filed of computer science, artificial intelligence, and
linguistics, concerned with the interactions between computers and
human (natural) languages.
• Key idea: Use statistical “machine learning” to automatically learn
the language from data!
• Major tasks in NLP
–
–
–
–
–
–

Automatic summarization
Part-of-speech tagging (POS tagging)
Relationship extraction
Sentiment analysis
Topic segmentation and recognition
Machine translation

Demonstration: POS Tagging – 1/2
• http://cogcomp.cs.illinois.edu/demo/pos/results.php

Demonstration: POS Tagging – 2/2

Demonstration: Sentence-level Sentiment – 1/3
• Stanford Sentiment Analyzer
– http://nlp.stanford.edu:8080/sentiment/rntnDemo.html

• Review 1: This movie doesn’t care about cleverness, wit or any other
kind of intelligent humor. -> Negative

• There are slow and repetitive parts, but it has just enough spice to
keep it interesting. -> Positive

• Text Mining Demonstration in R: Mining
Twitter Data

Twitter Mining in R – 1/2

Step 0) Install “R” and Packages
R program: http://www.r-project.org/
Package: http://cran.r-project.org/web/packages/tm/index.html
Package: http://cran.r-project.org/web/packages/twitteR/index.html
Package: http://cran.r-project.org/web/packages/wordcloud/index.html
Manual: http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf

Step 1) Retrieving Text from Twitter: Twitter API
(Using twitteR)

Twitter Mining in R – 2/2
Step 2) Transforming Text

Step 3) Stemming Words
Step 4) Build a Term-Document Matrix
Step 5) Frequent Terms and Associations

Step 6) Word Cloud

Software for Text Mining
• A number of academic/commercial software available:
– 1. Open source packages in R – e.g. tm
• R program: http://www.r-project.org/
• Package: http://cran.r-project.org/web/packages/tm/index.html
• Manual: http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf

– 2. Stanford NLP core
• http://nlp.stanford.edu/software/corenlp.shtml

–
–
–
–
–

3. SAS TextMiner
4. IBM SPSS
5. Boos Texter
6. StatSoft
7. AeroText

• Text Data is everywhere – you can mine it to gain insights!

Introduction to Text Mining

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Introduction to Text Mining

Similar a Introduction to Text Mining (20)

Más de Minha Hwang

Más de Minha Hwang (14)

Último

Último (20)

Introduction to Text Mining