2. Abstract
Overview of the data mining and analysis of social media, exploring the
application of various data mining, textual mining and analytical
techniques to social media data sources. The focus will be on the
practical application of these techniques for the purposes of:
• Monitoring of social media sources
• Analyzing content to identify leading issues and sentiment
• Analyzing and forecasting trends
• Identifying and profiling influential participants, subgroups and
communities
2
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
3. Agenda: Part 1
• My Biography
• Resources
• Social Media Defined
• Data Mining Example
• Text Mining Processes
• Using Text Mining for Prediction
• Brief Look at Programming for Prediction
3
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
4. Agenda: Part 2
• Sentiment Analysis & Opinion Mining Defined
– Business Interest & Software Packages
– Levels of Analysis
– Automated Classification
• Social Network Analysis
– Defined
– History
– Basic techniques and measures
– Ego and Social-Centric Analysis
4
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
5. Biography: Dave King
• EVP of Product Development and
Management at JDA Software
• 30 years in enterprise package
software business
• 15 years as university professor
• 15 years as Co-Chair of the Internet &
Digital Economy Track (HICSS)
• Long time interest in various aspects of
E-Commerce, Business Intelligence,
Analytics (including Text Analytics)
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
6. Personal Experiences with
Analytics
• Taught applied statistics and math modeling
• In software R&D
– Optimization in the 80s
– Natural Language Frontends
• NLI Query & CMU Robotics Lab
– EIS Competitive Analysis
• Dow Jones and Reuters
• Verity Topics
• NewsAlert
– InXight’s Hyperbolic Tree
– Supply Chain Analytics
– Sentiment Analysis for Retailers
• In the case of many of these advanced techniques, often the
audiences have been small, sometimes bewildered, and often
fleeting.
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
9. What is Social Media?
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
9
10. Defined
are online technologies and practices for social
interaction enabling sharing opinions, insights,
experiences, perspectives and media itself.
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
10
11. Defined
is the media we use to
be social. That’s it.
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
11
12. Social Media Types: Take Your Pick
12
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
13. Social Media is Still Huge!
Alexa Traffic Oct 6, 2012
Rank Website Type
1 Facebook Social
2 Google Search
3 YouTube Social
4 Yahoo! Search
5 Baidu.com Search
6 Wikipedia Social
7 Windows Live Search
8 Twitter Social
9 QQ.COM Portal
10 Amazon.com E-Commerce
11 Blogspot.com Social
12 LinkedIn Social
13 Taobao.com E-Commerce
14 Google India Search
15 Yahoo! Japan Search
13
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
14. Social Media is Still Huge!
Growth in Registered Users 2011 to 2012
Facebook: 750M -1B
Twitter: 200M - 500M
LinkedIn: 100M – 175M
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
14
15. Social Media is Still Huge!
If Social Media sites were countries…
China: 1.4B
India: 1.2B
Facebook: 1.0B
Twitter: 500M
US: 310M
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
15
16. Social Media is Still Huge!
Usage Per Day
Facebook: 3.2B Likes & Comments
Twitter: 340M Tweets
LinkedIn: 14M Searches
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
16
17. Analyzing Social Media:
Two Paths
Media - Content
Social - Network
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
17
19. social Media Formats
• Articles • Pictures
• Comments • Videos
• Messages • Music
• Reviews • Locations
• Ratings • Tags
• Rankings •…
19
19
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
20. social Media Data: One Commonality
20
20
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
21. Data Mining: Defined
Discovering meaningful
patterns from large data
sets using pattern
recognition technologies.
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 21
22. Data Mining: CRISP-DM
Real-World
Data
Data Consolidation
Business Data
Understanding Understanding
Data
Preparation
Data Cleaning
Deployment
Modeling
Data Transformation
Evaluation
Data Reduction
Well-Formed
Cross-Industry Standard Process for Data Mining Data
22
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
23. Data Mining:
General Data Assumptions
Structured
Transformed
Well-Formed
23
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
24. Data Mining: Simple Example
Market Basket
Analysis
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 24
26. Market Basket Analysis:
The Ole Beer and Diaper Legend…
Transaction Item Binary Representation
1 Bread
1 Milk Transaction Beer Bread Cola Diapers Eggs Milk
2 Beer 1 0 1 0 0 0 1
2 Bread 2 1 1 0 1 1 0
2 Diapers 3 1 0 1 1 0 1
2 Eggs 4 1 1 0 1 0 1
3 Beer 5 0 1 1 1 0 1
3 Cola
3 Diapers
3 Milk
4 Beer Goal: Empirically determine those itemsets
4 Bread
4 Diapers that occur frequently together in a set of
4 Milk transactions, producing a set of
5 Bread
5 Cola Association Rules of the form LHS -> RHS
5 Diapers
5 Milk
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 26
27. Market Basket Analysis:
The Ole Beer and Diaper Legend…
Transaction Beer Bread Cola Diapers Eggs Milk
1 0 1 0 0 0 1
2 1 1 0 1 1 0
3 1 0 1 1 0 1
4 1 1 0 1 0 1
5 0 1 1 1 0 1
Concept Definition Example
Itemset A specific collection of items in transaction {Diapers, Beer}
Support Count Number of transactions with itemset Support {Diapers,Beer} = 3
Transactions No of transactions = N N=5
Association RuleImplication rule of form LHS->RHS where LHS & RHS are {Diapers} -> {Beer}
itemsets
Rule Support No. of times rule appears in dataset 3/5 = .6
#tuples(LHS & RHS}/N
Rule Confidence No. of times RHS occurs in transactions with LHS 3/4 = .75
#tuples(LHS, RHS)/#tuples(LHS)
Rule Lift Strength of Association over random co-occurrence of LHS (3/5)/((4/5)x(3/5)) = 1.25
and RHS
#tuples(LHS,RHS}/N)/(#tuples(LHS)/N x #tuples(RHS)/N)
Confidence(RHS/LHS)/Support(RHS)
Support(LHS,RHS)/Support(LHS)xSupport(RHS)
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 27
28. Market Basket Analysis:
What if it were a text analysis problem?
Joe went to the 7-11 to pick up some cigarettes.
While he was there he also bought some dipers
and beer.
Sally was on her weekly shopping run at Wal-Mart.
She had picked up some diapers and formula for
her infant. She also thought about buying beer for
her husband, but the they were out of the brand he
liked.
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 28
29. Market Basket Analysis:
What if it were a text analysis problem?
• No specified format
• Variable length
• Variable spelling
• Punctuation and non-alphanumeric characters
• Contents are not predefined and no predefined set
of values
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 29
30. Text Mining (aka Text Analytics):
Defined
Using natural language processing
& data mining to discover patterns
in a collection of “documents”
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 30
31. Text Mining:
Document Collections
• Word Documents • Blogs
• PDFs • Tweets
• Emails • Open ended
• IM Chat surveys
• Web Pages • Transcripts of
Helpline calls
31
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
32. Text Mining:
CRISP-Like Processes
Real-World
Text Data
Document
Business
Understanding
Document
Understanding
Consolidation
Document Establish the
Preparation
Corpus
Deployment
Documents
Modeling
Corpus Refinement
(Token, Stem, Stop…)
Feature Selection
Evaluation
& Weighting
Doc-Term
Matrix*
32
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
33. Text Mining:
Creating the corpus
A large and “structured”
or “organized” collection
of text
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 33
34. Text Mining Process:
Corpus Refinement
Common representation of tokens within and between documents
Eliminate
Tokenization Normalize Stemming
Stop Words
• Tokenization —Parse the text to generate terms. Sophisticated
analyzers can also extract phrases from the text.
• Normalize — Convert them to lowercase.
• Eliminate stop words — Eliminate terms that appear very often
(e.g. the, and, …).
• Stemming — Convert the terms into their stemmed form—remove
plurals and different word forms (e.g. achieve, achieves, achieved
– achiev) [note: word about synonyms – WordNet Synset]
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 34
35. Text Mining Processes:
Feature Extraction & Weighting
Feature
Extraction “Bag of Words, Terms
or Tokens”
Vector Representation:
Word, Term or Token/Doc Matrix
“Bag of Words” (BOW) or
Vector Space Model (VSM):
Words or Tokens are
attributes and documents
are examples
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 35
36. Text Mining Processes:
Transforming Frequencies
• Binary Frequencies: tf =1 for tf>0; otherwise 0
• Term Frequencies: tf(i,j)/Sum of tf(i,j) in Doc K
• Log Frequencies: 1 + log(tf) for tf>0; otherwise 0
• Normalized Frequencies: Divide each frequency by
SQRT of Sum of Squares of the frequencies within the
vector (column)
• Term Frequency–Inverse Document Frequency
– TF * IDF
– Inverse Document Frequency: log(N/(1+D)) where N is total
number of docs and D is number with term
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 36
37. Text Mining Processes:
Twitter Example – Problem Features
• Each tweet <= 140
characters (avg. 10-
15 words/message)
• Heavy presence of
non-alpha symbols,
abbrevs, misspellings
and slang
• Tweets often include
retweets (original
tweet repeated)
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 37
38. Text Mining Processes:
Twitter Example
One of the things I love and adore about Twitter …
is how its open API has lit a fierce fire of innovation
when it comes to analytics. Anyone and their
brother and ma-in-law can develop a tool, and they
have! Much to the benefit of the rest of us.
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 38
39. Text Mining Processes:
Twitter Example – Twitter API
Get Search
• http://search.twitter.com/search.json?q=<query>
• search.twitter.com/search.json?q=Obama&rpp=100&page=5
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 39
40. Text Mining Processes:
Twitter Example – JSON
• JSON (JavaScript Object Notation)
• Lightweight, Text-Based, Data-Interchange Format
• Built on Two Structures:
– A collection of name:value pairs. In various languages, this is
realized as an object, record, struct, dictionary, hash table,
keyed list, or associative array.
– An ordered list of values. In most programming languages,
this is realized as an array, vector, list, or sequence
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 40
41. Text Mining Processes:
Twitter Example – Establish Corpus
Query API Result
search.twitter.com/ {…
search.json?q=%3A)+ results:[
feel+ feeling& {iso_language_code: en,
rpp=100&page=5 to_user_name: Andrea,
...'
search.twitter.com/sea text: u”Love is everything in this
rch.json?q=%3A(+ world! Its a feeling like no other. I
feel+feeling& can't wait 2 feel that emotion
rpp=100&page=5 again..but patience is key :-)”
...
created_at: Thu, 18 Oct 2012
20:11:22 +0000,
…'}}
...}
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 41
42. Text Mining Processes:
Twitter Example – Simple Question
Are there any language differences between
“feeling” tweets containing and symbols?
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 42
43. Text Mining Processes:
Twitter Example – Establish Corpus
text: value
1. Love is everything in this world! Its a feeling like no other. I
pairs in
can't wait 2 feel that emotion again..but patience is key :-)
JSON object
2. @mzxAmaZiiN Whats up ma. How ya feeling? Lemme make
that soul feel better. :)
Remove 3. "..I've got a good feeling about today :P Something makes
RTs me feel i might make a sale or two :) *fingers crossed* #etsy
#shop #seller #cra...
4. @IWontForgetDemi Awww poor thing :( hate feeling sick!
Hope you feel better soon
5. @IzabelaLeafsfan no the worst feeling ever is when u feel
like total crap. cuz u think no one luvs u :(
6. …
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 43
44. Text Mining Processes:
Twitter Example – Doc Preparation
Eliminate
Tokenization Normalize Stemming
Stop Words
Tweet:
Love is everything in this world! Its a feeling like no other. I can't wait 2 feel
that emotion again..but patience is key :-)
Words:
['Love', 'is', 'everything', 'in', 'this', 'world!', 'Its', 'a', 'feeling', 'like', 'no', 'other.',
'I', "can't", 'wait', '2', 'feel', 'that', 'emotion', 'again..but', 'patience', 'is', 'key', ':-)']
Tokens:
['Love', 'is', 'everything', 'in', 'this', 'world', '!', 'Its', 'a', 'feeling', 'like', 'no',
'other.', 'I', 'ca', "n't", 'wait', '2', 'feel', 'that', 'emotion', 'again..but',
'patience', 'is', 'key', ':', '-', ')']
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 44
50. Prediction
Information + Epidemiology =
Infodemiology
Monitoring and analyzing
queries from Internet search
engines or peoples' status
updates on microblogs for
syndromic surveillance to
predict disease outbreaks
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 50
51. Prediction
Syndromic Survelliance
Surveillance using health-related data
that precede diagnosis and signal a
sufficient probability of a case or an
outbreak to warrant further public
health response
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 51
52. Monitoring the Onset of the Flu Season
The Official Standard Way
Sentinel ILI - influenza-like-illness
Physician
Public Health
Authority
ILI Report Public Reports
Aggregated
Costly Data
1-2 Week Lag
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 52
53. Prediction
Infodemiology
What is the first thing
search
some people do before
they see a doctor or
tweet
take OTC medicines?
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 53
54. Prediction - What are the terms,
keywords, phrases…?
What is the first thing
search • Flu
people do before they
• Flu symptoms
• H1N1
• Swine Flu
• Cold
• Fever
see a doctor or take
• Headache
OTC medicines?
tweet
• …
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 54
55. Prediction
Infodemiology Studies
Authors Title Date Type Data Source Dependent Explanatory Variables Model Results
(M-Y) Variables
Eysenbach Infodemiology: Tracking Nov-06 Search Impressions and clicks from a Number of Number of clicks on Linear correlation between Correlation of .81 with ILI data and
Flu-Related Searches on Google Ad displaying the influenza lab Google Ad dealing with clicks and 3 measures of .90 with lab tests and positive
the Web for Syndromic question:“Do you have the flu? tests, number of influenza publically reported influenza cases.
Surveillance Fever, Chest discomfort, positive influenza
Weakness, Aches, Headache, lab tests and
Cough.” Covers Oct. 2004-May number of ILI
2005 reports from
Sentinel GPs
reported by PHA
in Canada
Hulth et al. Web Queries as a Feb-09 Search Queries to Swedish heath Weekly Lab Weekly ratio of influenza Two linear regression Average R squared was .90 for the
Source for Syndromic advice site (Varguiden) from diagnosed cases of related Web queries to all models, one predicting lab two years.
Survelliance June 2005 to June 2007 influenza and % of queries at Varguiden site diagnosed cases and the
ILI reports of other ILI reports where the
influenza from explanatory variable based
GPs based on a composite of the
best predicting query terms
Ginsberg et al. Detecting influenza Feb-09 Search Google Search: Historical web % US ILI Visits in ILI-related search queries logit(P)=b0 + b1xlogit(Q)+e Mean correlation of .90 between P
epidemics using search logs of ILI related Google week reported to where P is % visits and Q is & Q for 9 CDC US healthcare
engine query data Searchs 2003-2008 CDC normalized number of regions.
queries
Lampos & Tracking the flu Jun-10 Tweets Twitter: 24 weeks of Twitter Weekly ILI UK Average daily "flu-score" linear least squares Average correlation for 5 UK health
Cristianini pandemic monitoring corpus in UK from June '09 to Health Protection for all tweets. Flu score for regression time series of HPA care regions was about .92.
the Social Web Dec'09 Agency smoothed single tweet is proportion flu rates on aggregated tweet
for daily values of all ILI "stem" markers flu-scores
(ngrams) that appear in
the tweet.
Culotta Towards detecting Jul-10 Tweets 575K flu-related Twitter % ILI weekly % of messages reporting Compares simple logit model Aggregating keyword frequencies
influenza epidemics by messages and % ILI reports reports from CDC an ILI or related symptom with multiple regression using separate keywords (multi-
analyzing Twitter from CDC for period Feb 2010 for specified (based on detailed model having different reg) works better than single
messages to Apr 2010. period classification and counts for separate Tweet aggregated (simple logit) predictor.
statistical procedures to keywords & phrases Simple BOW classifier can be used
determine whether ILI or to filter ILI messages. Achieved
not) r=.78 for 5 weeks of validation data.
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 55
56. Prediction
Infodemiology Studies
Authors Title Date Data Source Dependent Explanatory Variables Model Results
(M-Y) Variables
Acherekar Predicting Flu Trends Apr-11 Tweets ILI "influenza" visits to Sentinel ILI influenza visits Previous weeks ILI visits Auto-regression model with: Analysis shows time-lagged ILI is
using Twitter Data physicians reported weekly by and aggregate number of ILI(t) = a1*ILI(t-2) + not a strong predictor but the
CDC from Oct 2009 to Oct 2010 tweets mentioning flu-like b*Tweets(t) + e aggregate number of tweets with
and tweets mentioning "flu- symptoms flu-like systems is very strong with
like" symptoms during same an r=.9846.
time period
Chan et al. Using Web Search May-11 Search Google Queries in select Official weekly Computed daily counts of O=b0+b1Q+e where O is Very strong correlation (~.9) except
Query Data to Monitor countries from 2003-10 dengue case count queries referencing official weekly count of cases for Singapore (.82).
Dengue Epidemics: A including Bolivia, Brazil, India, from Ministry of Dengue fever for separate in specified countries and S is
New Model for Indonesia and Singapore Heath or WHO countries dengue related search query
Neglected Tropical in those countries
Disease Survelliance
Signorini et al. The Use of Twitter to May-11 Tweets Several panels of tweets and % ILI weekly Fraction of influenza Support Vector Regression Strong SVR fit for both national and
Track Levels of Disease ILI reports. Prediction panel reports from CDC related tweets for total US utilizing SVM feature sets regional figures.
Activity and Public includes weekly ILI %s and US for specified and CDC health regions. (collection of terms occuring
Concern in the US tweets from Oct 2009 thru May period. Reports Utilizes complex Support 10X per week) to estimate
during the Influenza A 2010. include Vector Machine using term- weekly ILI.
H1N1 Pandemic nationwide and 10 frequency statistics.
regional reports.
Paul You are what you tweet: Jul-11 Tweets Covers a number of ailments % ILI weekly Utilizes an Ailment Topic Correlation between the Two types of ATAM/ATAM+ models
Analyzing Twitter for including influenza. Based on 2 reports from CDC Aspect Model (ATAM) to probability the "flu ailment" were compared. The correlations
Public Health billion tweets from Many 2009 for specified determine the type of designation for each week with ILI were .934 and .958.
to October 2010. period ailment associated with a with the ILI rate in the US.
tweet (in this case Flu)
Lampos et al. Flu detector - Tracking Sep-12 Tweets Twitter: 40 weeks of Twitter Weekly ILI UK Sum across all tweets of linear least squares Correlations around .90 for 3 UK
epidemics on Twitter corpus in UK from June '09 to Health Protection daily "flu-score" for each regression time series of HPA health care regions.
Mar '10 Agency smoothed tweet based on number of flu rates on aggregated tweet
for daily values ILI "stem" markers that flu-scores
appear in the tweet.
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 56
57. Infodemiology
Example Prediction Analysis
Nowcasting Events from the Social Web with Statistical
Learning,” Lampos and Cristianini, ACM IS&T, 9/11
N-Gram Stems
ILI Markers Avg. Weekly
Twitter 50M Tweets Wght. Flu-
API 5 UK Regions Score by
6’09-12’09 Region
(time t)
Weekly
ILI Reports by r ~.89-.96
health region
(time t)
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 57
58. Infodemiology
Example Prediction Analysis
50M Tweets
Corpus
5 Region UK, 6/09-4/10
Corpus Lower Stop
Tokens Stems
Refinement Case Words
Feature 1- 2- Hybrid
Selection Grams Grams Grams
Tweet- For each tweet, number of times each hybrid
Hybrid Matrix occurs times hybrid weights. Then sum of
weighted values equal flu-score for tweet.
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 58
59. Infodemiology
Example Prediction Analysis
flu :(
RT @OMGFacts: The US death toll from the 1918 flu epidemic was so high that it created a coffin
shortage
News_SwineFlu: #swineflu Delayed treatment for swine flu can lead to... http://t.co/ZSOjIvAW'
The last thing I want right now is the flu
Much better with that lovely message from u,thank u! (heavy flu since 4 days ;
Supposedly only about 1% of people in the world present flu-like symptoms after flu shot.
Unfortunately 66% of the people in my house do'
Back on track...Fuck you flu!
Sounds like either the flu or Killian walked into the room.
I'll try haha, I got the flu!
It costs an average company $135 a day for every employee sick at home with the flu. Encourage
your employees to get vaccinated....
i wouldn't recommend it, but a couple of days with violent stomach flu does make you appreciate
the small things, nature's reset
Is it possible to get the flu from the flu jab?
Sedatives + cold/flu + ana = can't get up without having to hold a wall for 2 minutes",
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 59
64. Programming Text Mining for
Prediction with Python
64
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
65. Text Mining for Prediction:
Programming with Python & NLTK
# Utilizes “nltk”, a Python “natural language toolkit”
# Step 1: Initialize modules, stopwords and stemmer
import simplejson
import urllib
import re
import nltk
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
porter = nltk.PorterStemmer()
def remove_stopwords(text):
stopwords = nltk.corpus.stopwords.words('english')
content = [w for w in text if w.lower() not in stopwords]
return content
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 65
66. Text Mining for Prediction:
Programming with Python & NLTK
# Step 2: Utilize Twitter Search API to collect “flu” tweets in JSON format.
# Then extract the “text” fields associated with each tweet.
itemsFlu = []
for i in range(0,14):
urlFlu = 'http://search.twitter.com/search.json?q=flu&rpp=100&page=1'
resultFlu = simplejson.load(urllib.urlopen(urlFlu))
itemsFlu = itemsFlu + resultFlu['results']
tweetsFlu = [ item['text'] for item in itemsFlu]
# Step 3. Eliminate Retweets
FluTweetsNoRTs= []
for tweetText in tweetsFlu:
if not re.search('RT ',tweetText):
tweetTextList = [tweetText]
FluTweetsNoRTs = FluTweetsNoRTs + tweetTextList
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 66
67. Text Mining for Prediction:
Programming with Python & NLTK
# Step 3: Preparing to do Text analysis
noTweets = 0; Flu_tmp_word = []; Flu_tot_word = []
Flu_tmp_low = []; Flu_tot_low = []; Flu_tmp_alpha = []; Flu_tot_alpha = []
Flu_tmp_stop = []; Flu_tot_stop = []; Flu_tmp_stem = []; Flu_tot_stem = []
stems_dict = {} # dictionary holding "text" broken into words for 1..N tweets
# Step 4: Text processing. Produces lowercase, alpha stems devoid of stopwords
for tweet in FluTw eetsNoRTs :
noTweets = noTweets + 1
Flu_tmp_word = [ w for w in tweet.split()]; Flu_tot_word = Flu_tot_word + Flu_tmp_word
Flu_tmp_low = [w.lower() for w in Flu_tmp_word] ; Flu_tot_low = Flu_tot_low + Flu_tmp_low
Flu_tmp_alpha = [cv for w in Flu_tmp_low for cv in re.findall(r'^[a-z]+[a-z]+$', w)]
Flu_tot_alpha = Flu_tot_alpha + Flu_tmp_alpha
Flu_tmp_stop = remove_stopwords(Flu_tmp_alpha); Flu_tot_stop = Flu_tot_stop + Flu_tmp_stop
Flu_tmp_stem = [porter.stem(t) for t in Flu_tmp_stop]; Flu_tot_stem = Flu_tot_stem + Flu_tmp_stem
Flu_stems_dict[noTweets] = Flu_tmp_stem
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 67
68. Text Mining for Prediction:
Programming with Python & NLTK
# Step 5: Analyzing frequency distributions
fdist_Flu_stems = nltk.FreqDist(Flu_tot_stem)
fdist_Flu_stems.plot(25)
fdist_Flu_stems.items()[0:25]
# Step 6. Produce doc-matrix “wc” (a dictionary hold the counts associated
# with each word).
for dCnt in range(1,len(Flu_stems_dict)):
wlist = Flu_stems_dict[dCnt]
for word in wlist:
wc.setdefault(word,0)
wc[word] += 1
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 68
69. Text Mining for Prediction:
Programming with Python & NLTK
# Step 7. Eliminate all words that occur less than 3 times
apcount = {}
for word,wcnt in wc.items():
apcount.setdefault(word,0)
if wcnt > 3: apcount[word] = wcnt
# Step 8. Write out doc-term matrix to a file
out = file('flu-doc-matrix.txt','w'); out.write('Tweets')
for word,count in apcount.items():
out.write(","); out.write(word) ; out.write('n')
for tweetno, twlist in Flu_stems_dict.items():
tweetname = "Tweet" + str(tweetno); out.write(tweetname)
for word, count in apcount.items():
if word in twlist: out.write(","); out.write("1")
else: out.write(","); out.write("0"); out.write("n")
out.close()
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 69
70. Text Mining for Prediction:
Programming with R, tm and RTextTools
70
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
71. Text Mining for Prediction:
Programming with R, tm and RTextTools
# initialize libraries # text preprocessing function
library(twitteR) toLowerCaseAlpha <- function(x){
library(tm) removeHTTP <- gsub("http:[/a-zA-Z0-9._]+","",x)
library(RTextTools) removeNonAlpha <- gsub("[^a-zA-Z]"," ",removeHTTP)
library(plyr) removeMultipleSpaces <- gsub(" +"," ",removeNonAlpha)
library(stringr) changeToLowerCase <- tolower(removeMultipleSpaces)
return(changeToLowerCase) }
# utilize Twitter Search API to collect ―flu‖ Tweets
# in JSON format # clean tweet text, convert lower case, remove stop words
fluTweets = searchTwitter('flu', n=1500, lang='en') # create corpus for analysis
fluTweetTextCleaned <- toLowerCaseAlpha(fluTweetTextNonRTs)
# extract text from JSON objects corpusFluTweets <- Corpus(VectorSource(fluTweetTextCleaned))
fluTweetText = laply(fluTweets, function(t) t$getText()) corpusFluText <- tm_map(corpusFluTweets, removeWords,
corpusStopwords)
# remove retweets
indxFluRetweets <- grep('RT',fluTweetText)
indxFluTweetText <- c(1:length(fluTweetText))
indxFluTweetNonRTs <- !(indxFluTweetText %in%
indxFluRetweets)
fluTweetTextNonRTs <-
fluTweetText[indxFluTweetNonRTs]
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 71
72. Text Mining for Prediction:
Programming with R, tm and RTextTools
# function to convert tokens/words to stems # create document –term matrix
convertToStems <- function(x){ corpusFluStemsDTM <-
tokens <- strsplit(x,' +') DocumentTermMatrix(corpusFluStems)
listToVect <- unlist(tokens)
stems <- wordStem(listToVect) # create vector of individual stems and the associated
return(stems)} # frequencies with which they occur across all tweets
fluStemsDTMMat <- as.matrix(corpusFluStemsDTM)
# convert to stems and create new corpus fluStemsDTMMatSorted <-
FluStems <- sapply(corpusFluText, convertToStems) sort(colSums(fluStemsDTMMat))
corpusFluStems <- Corpus(VectorSource(FluStems))
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 72
73. Text Mining for Prediction:
Programming with R, tm and RTextTools
# plot frequency disttribution of stems
noOfPlotPnts = 50
numOfStems =
length(fluStemsDTMMatSorted)
plotStart = numOfStems - noOfPlotPnts +1
plotEnd = numOfStems
top50FluStems =
fluStemsDTMMatSorted[plotStart:plotStart]
barplot(top50,horiz=TRUE,cex.names=0.5,
space = .5,las=1, xlim=c(0,100),
main="Freq of Terms")
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 73
74. Text Mining for Prediction:
Programming with R, tm and RTextTools
# create wordcloud of stems based on frequency
library(wordcloud)
library(RColorBrewer)
fluStemNames <- names(fluStemsDTMMatSorted)
dfFluStemDTMMat <- data.frame(word=fluStemNames,
freq=fluStemsDTMMatSorted)
pal2 <- brewer.pal(8,"Dark2")
wordcloud(dfFluStemDTMMat$word,
dfFluStemDTMMat$freq, min.freq=10, colors=pal2)
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 74
75. Infodemiology is a form of
Weather in the next 6 hours
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 75
76. Now+Forecasting
Predicting the present by
analyzing large volumes of data
that can be used to "forecast"
current events for which official
analysis has not been released
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 76
77. Basic Regression Analysis
Dependent Dependent Traditional, Aggregate
Variable at Variable at Publicly Search
Time t Time t - n Available at Index or
(Standard = b0 + b1 (Standard + b2 Time t - n + b3 Social +e
Publicly Publicly Explanatory Media
Available Available Variable Freq.
Measure) Measure) Count
at Time t
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 77
78. Authors
Examples
Title Date (M- Type Data Source Dependent Variables Explanatory Variables Model Results
Y)
Kholodilin et al. Do Google Searches Help Apr-10 Search Fed reserve data on US private Year-on-Year Growth Rate 220 Google Trend/Insights Search Y-o-Y monthly URPC growth rates Query term principal components
in Nowcasting Private consumption and related Google search of Monthly US Real Private terms related to Priv for 3 sets of regressors -- Sentiment outperform standard Sentiment and
Consumption? A Real-Time terms from Jan '05-Dec '09. Consumption, ALFRED db Consumption reduced to 10 (consumer sentiment and Financial Indicators. A combination of two
Evidence for the US of Fed Rsrv of St. Louis principal components for montly confidence); Financial (short term of the factors work best -- those related to
periods from Jan 2005 to Dec and long term interest rates and S&P mobility and health care consumption.
2009 500); Query (combinations of
principal components of query
terms)
Sakaki et al. Earthquake Shakes Twitter Apr-10 Tweets Earthquake occurrences/intensity and Occurrence, intensity and Tweets that contain the query Utilized a support vector machine Of those earthquakes occurring in the Aug-
Users: Real-time Event tweets containing the words location of an earthquate words "earthquake" and/or (SVM) to determine whether tweet Sept time frame in Japan, 96% of 24 quakes
Detection by Social Sensors "earthquake" or "shaking" in Japan in Japan "shaking"by location reports refers to an earthquake above an intensity of 3 were reported in a
from Aug '09-Sep '09 occurrence. The reports are then tweet. Of the 24 quakes, 80% were
matched against actual occurrence reported with a minute of occurrence. This
at a particular location to see if it is is much faster the reports issued by the
detected within 1 min of occurrence. Japan Meteorological Agency.
Carrière-Swallow & Nowcasting With Google Jul-10 Search Auto sales and Google search indices % year-on-year change in Google Search index of interest in % change in y-on-y autosales in Despite relatively low rates of Internet
Labbé Trends in an Emerging Market for specific automobiles in Chile from auto sales in Chile automobile purchases in Chile Chile regressed against a usage in Chile, models incorporating
2005 thru 2010. compositive auto Google Search Google Trends Automotive Index
index based on queries about 9 outperform benchmark specifications in
leading automobile manufactures in both in-sample and out-of- sample
Chile. nowcasts of y-on-y % changes in autosales
while providing substantial gains in
information delivery times
Ciulla Beating the news using social May-12 Tweets Tweets from US users that related to Contestants who were Number of hashtags with Tweets Contestants with the fewest In general the simple tweet frequencies are
media: the case study of American Idol contestants and tweeted eliminated or won final with hastags signifying mentions predicted to be the strong predictor of the contestant who will
American Idol during the voting time window for each episode contestants candidate eliminated. be eliminated.
episode during 11th season from Jan
'12 - May '12
Chadwick & Nowcasting Unemployment Jun-12 Search Monthly Turkey non-agricultural Monthly Turkey non- Google Search Index in Turkey for Linear auto regression models andModels with Google Search Indicators
SengulCiu Rate in Turkey: Let's Ask unemployment rate from Jan '05-Dec agricultural unemployment terms directly ("looking for job") Bayesian Model Averaging perform better in nowcasting the 1 period,
Google '11 rate or indirectly (job procedure to investigate whether 2 periods and 3 periods ahead
announcements) related to Google search query data can unemployment rate than the benchmark
unemployment. improve where we use only the lag values of the
unemployment rate.
Song, Pan, Ng Forecasting hotel room Sep-12 Search Weekly data on hotel bookings in Weekly Hotel Bookings in Indexed Search Volumes from Log of Room Nights for Log of Test various statistical models; all gave
demand using search Charleston, SC and Google trend data Charleston, SC Google Trends/Insights Jan 2008- Search Volumes - Charleston, Travel reasonable forecasts. Best fit model was
engine data for specific travel tourism search terms Aug 2009 Charleston, Charleston Hotels, Autoregressive Distributed Lag (ADLM)
from Jan '08-Aug '09 Charleston Restaurants, Charleston with a lag period of 6 weeks.
Tourism
McLaren, Using internet search data as Q2-11 Search UK monthly unemployment data and Official monthly Google Trend/Insight query For unemployment, linear AR model For unemployment forecasts, claimant
Shanbhogue economic indicators housing price growth from June '04-Jan unemployment data and indexes for the term "Job Seekers with query term, claimant count, count strongest followed by query term. For
'11 associated with Google Trend query housing price growth in the Allowance (JSA)" for and GfK consumer confid. as exp housing prices, the query term was much
indices for job seeking and UK from June 2004-Jan unemployment and "Estate vars; for housing price growth with stronger than HBF and RICS data.
unemployment 2011 Agents" for housing query term, Home Builders and
Royal Instit. of Chartered Surveyors
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL price growth balances as exp vars. 78
79. Authors
Examples
Title Date (M- Type Data Source Dependent Variables Explanatory Variables Model Results
Y)
Gruhl et al. The Predictive Power of Online Aug-05 Blogs Amazon Sales Rank for for best selling Amazon Sales Rank for Number of mentions of the Cross correlation of time series for While sales rank is a poor predictor of the
Chatter. books from Jul '04-Aug '04 and number 2340 bestselling books in 4book/author in over 300K blogs sales rank and mentions. change in sales rankings, a prior spike in
of mentions in blogs for same time month period (Jul 2004- whose postings that were mentions predicts quite well a future spike
period Aug 2004) and spikes in maintained by IBM's in sales rank.
these sales ranks WebFountain project (over 200K
postings/day)
Choi, Varian Predicting the Present with Apr-09 Search Monthly data for search from Google US Census Bureau Advance Google Trend/Insight query Google Trend indices for query Simple seasonal AR models and fixed-
Google Trends Trends associated with various retail Monthl Retail Sales indices for categories and subcategories related to (log values) effects models that includes relevant
sales and travel data from Jan '04-Aug (general and specific) and subcategories related to retail of overall monthly retail trade Google Trend variables tend to outperform
'08 Travel (Visitor arrival in sales (general and specifix) and (NAICS categories), automotive models that exclude these variables. In
Hong Kong) related to Travel sales, home sales and travel. some cases small gains, in other
substantial.
Suhoy Query Indices and a 2008 Jul-09 Search Monthly official economic growth data Monthly percent changes 30 Google Search Index Bayesian probabilities of downturn Six leading query categories including HR
Downturn: Israeli Data for the months and quarters from 2nd of various real values of categories related to consumption calculated by Hamilton's two-sate (recruiting and staffing), home appliances,
quarter 2004 to 2nd quarter 2009 along industrial production, retail and employment Markov switching AR(0) model. travel, real estate, food and drink and
with various Google search indices trade, revenue of trade Used to determine changes in query beauty and contain cyclical components
during the same period and services, consumer indices can predict changes in which conform with cycles of economic
imports, exports of official economic variables. growth. The strongest relationship was
services and the between HR and unemployment.
employment rate
Sadikov et al. Blogs as Predictors of Movie Aug-09 Blogs Weekly box movie sales, gross sales, Movie critic ranking, user Analysis of spinn3r.com blog data Linear regression for weekly Minimal correlation between rankings and
Success and critic and user rankings from Nov ranking, 2008 gross sales, set 11/07-11/08, counting movie rankings and sales data by blog references and sentiment. Strong
'07-Nov '08 and counts for moving weekly box office sales references and sentiment within references and sentiment. correlation between references and gross
references and sentiment measures for (weeks 1-5) specified time window before and sales but week with sentiment. Strongest
same time period. after movie release date. relationships with timing of references in
weeks after release.
Zhang & Skiena Improving Movie Gross Sep-09 Blogs & Online News stores & blogs 1960-2008 Gross receipts from Variety of IMDB variables (e.g. Linear regression of receipts for Number of news articles mentioning
Prediction Through News News along with movie receipt and IMDB movies movie genre), movie budget, various combinations of explanatory moving 1 week prior have highest
Analysis data. News stores analyzed for various number of first week theaters, variables. Also, K-NN nearest correlation (~.7)and predictive ability
pre-release time periods and number of stores and neighbor analysis determining
mentions of movie titles, director, factors associated with gross
top 3 & top 15 actors receipts
Wu & Brynjolfsson The Future of Prediction: How Dec-09 Search Quarterly Housing Sales and Housing Housing Sales and Housing Google Search Index for Real Linear autoregression between Strong predictive relationships between
Google Searches Foreshadow Price Index for 50 US states along with Price Index (HPI) Estate, Real Estate Agencies, and Housing Sales and prior sales, the Housing Sales and searches for Real Estate
Housing Prices and Sales the Google Search Index for Real Home Appliances HPI, and Search Indices for Real Agencies. Similarly relationships for HPI.
Estate, Real Estate Agencies, and Estate and Real Estate Agencies as
Home Appliances from 4th Quarter well as the same regression for the
2007 to 2nd Quarter 2007 HPI
Asur, Huberman Predicting the Future with Mar-10 Tweets 3 million tweets mentioning 24 movies Box-office revenues for Promotion tweets-retweets for a Regression of 1st weekend box Promotional tweets are weakly correlated
Social Media from Nov '09-Feb '10 along with (24) movies particular movie, tweet rates for office revenues by promotional 1st weekend revs. Tweet rates are very
associated box-office revenues particular movie per hour, ratio tweets-retweets, by tweet rates vs. strongly correlated (min .9) and a stronger
of positive to negative Hollywood Stock Exchange prices, predictor than HSX. Finally, tweet rates are
sentiments for the movie and 2nd weekend revenues by tweet strongly correlated with 2nd weekend
rates and the sentiment ratio. revenues and sentiments improve the
forecasts slightly.
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 79
Notas del editor
Requires structured data (numbers and categories well-defined)Transformed by data preparation or collected with an a prior design in mindTypically housed and organized in a relational database, data mart or data warehouse
First step in textual data preparation is to systematically collect samples of text, i.e. the documents related to the context being studiedRange of possibilities: word documents, PDFs, emails, IM chat, Web pages, RSS Feeds, Blogs, Tweets, Open ended surveys, Transcripts of Helpline calls …Convert into organized set of texts – called a corpus – standardized and prepared for the purpose of knowledge discovery
“Do you have the flu? Fever, Chest discomfort, Weakness, Aches, Headache, Cough.”