SlideShare una empresa de Scribd logo
1 de 79
Descargar para leer sin conexión
Mining and Analyzing Social Media: Part 1
                                   Dave King
                              January 7, 2013
Abstract

Overview of the data mining and analysis of social media, exploring the
application of various data mining, textual mining and analytical
techniques to social media data sources. The focus will be on the
practical application of these techniques for the purposes of:

   •   Monitoring of social media sources
   •   Analyzing content to identify leading issues and sentiment
   •   Analyzing and forecasting trends
   •   Identifying and profiling influential participants, subgroups and
       communities


                                                                                 2
                        Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Agenda: Part 1

•   My Biography
•   Resources
•   Social Media Defined
•   Data Mining Example
•   Text Mining Processes
•   Using Text Mining for Prediction
•   Brief Look at Programming for Prediction

                                                                            3
                   Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Agenda: Part 2

• Sentiment Analysis & Opinion Mining Defined
  – Business Interest & Software Packages
  – Levels of Analysis
  – Automated Classification
• Social Network Analysis
  –   Defined
  –   History
  –   Basic techniques and measures
  –   Ego and Social-Centric Analysis
                                                                             4
                    Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Biography: Dave King

                      • EVP of Product Development and
                        Management at JDA Software
                      • 30 years in enterprise package
                        software business
                      • 15 years as university professor
                      • 15 years as Co-Chair of the Internet &
                        Digital Economy Track (HICSS)
                      • Long time interest in various aspects of
                        E-Commerce, Business Intelligence,
                        Analytics (including Text Analytics)



             Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Personal Experiences with
Analytics
• Taught applied statistics and math modeling
• In software R&D
   – Optimization in the 80s
   – Natural Language Frontends
        • NLI Query & CMU Robotics Lab
   – EIS Competitive Analysis
        • Dow Jones and Reuters
        • Verity Topics
        • NewsAlert
   – InXight’s Hyperbolic Tree
   – Supply Chain Analytics
   – Sentiment Analysis for Retailers
• In the case of many of these advanced techniques, often the
  audiences have been small, sometimes bewildered, and often
  fleeting.
                             Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Text Mining Resources




             Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
                                                                      7
Social Networking Analysis Resources




                                                                      8
             Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
What is Social Media?




             Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
                                                                      9
Defined




are online technologies and practices for social
interaction enabling sharing opinions, insights,
experiences, perspectives and media itself.


                Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
                                                                         10
Defined




          is the media we use to
          be social. That’s it.
           Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
                                                                    11
Social Media Types: Take Your Pick




                                                                      12




             Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Social Media is Still Huge!
Alexa Traffic Oct 6, 2012
 Rank   Website        Type
  1     Facebook       Social
  2     Google         Search
  3     YouTube        Social
  4     Yahoo!         Search
  5     Baidu.com      Search
  6     Wikipedia      Social
  7     Windows Live   Search
  8     Twitter        Social
  9     QQ.COM         Portal
  10    Amazon.com     E-Commerce
  11    Blogspot.com   Social
  12    LinkedIn       Social
  13    Taobao.com     E-Commerce
  14    Google India   Search
  15    Yahoo! Japan   Search


                                                                                    13
                           Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Social Media is Still Huge!
 Growth in Registered Users 2011 to 2012


Facebook: 750M -1B

Twitter: 200M - 500M

LinkedIn: 100M – 175M

                 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
                                                                          14
Social Media is Still Huge!
If Social Media sites were countries…

China: 1.4B
India: 1.2B
Facebook: 1.0B
Twitter: 500M
US: 310M

                 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
                                                                          15
Social Media is Still Huge!
 Usage Per Day

Facebook: 3.2B Likes & Comments

Twitter: 340M Tweets

LinkedIn: 14M Searches

               Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
                                                                        16
Analyzing Social Media:
Two Paths

         Media - Content




        Social - Network
               Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
                                                                        17
Analyzing Social Media:
Two Paths
        An Example: Which Blogs are Similar?
        Term1   Term2   Term3    …        TermM                                 Blog1    Blog2   Blog3   …   BlogN
Blog1     1       0       0      …           1                   Blog1            -        1       0     …     1
Blog2     0       0       1      …           0                   Blog2            0        -       1     …     0
Blog3     0       1       0      …           1                   Blog3            1        1       -     …     0
…         …       …       …      …          …                    …                …        …       …     -     …
BlogN     0       0       0      …           1                   BlogN            1        0       1     …      -




          Cluster Analysis                                                   Social Network (Graph)
           (e.g. K-Means)                                                           Analysis

                                                                                                                     18
                                Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
social Media Formats

 • Articles                                               • Pictures
 • Comments                                               • Videos
 • Messages                                               • Music
 • Reviews                                                • Locations
 • Ratings                                                • Tags
 • Rankings                                               •…
                                                                        19
                                                                        19
             Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
social Media Data: One Commonality




                                                                        20
                                                                        20
               Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Data Mining: Defined


Discovering meaningful
patterns from large data
sets using pattern
recognition technologies.



             Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL   21
Data Mining: CRISP-DM
                                                                                                             Real-World
                                                                                                               Data



                                                                                                           Data Consolidation
               Business                       Data
             Understanding                Understanding




                                                   Data
                                                Preparation
                                                                                                             Data Cleaning

      Deployment


                                                  Modeling
                                                                                                          Data Transformation




                             Evaluation
                                                                                                            Data Reduction




                                                                                                             Well-Formed
Cross-Industry Standard Process for Data Mining                                                                  Data
                                                                                                                                22
                                                 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Data Mining:
General Data Assumptions


                                                         Structured
                                                       Transformed
                                                       Well-Formed

                                                                     23
            Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Data Mining: Simple Example




       Market Basket
         Analysis


             Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL   24
Market Basket Analysis:
Applications


•   Cross Selling
•   Product Placement
•   Affinity Promotion
•   Customer Segmentation Analysis

                                                                      25
             Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL   25
Market Basket Analysis:
 The Ole Beer and Diaper Legend…
Transaction   Item                             Binary Representation
     1        Bread
     1        Milk          Transaction          Beer          Bread          Cola   Diapers   Eggs   Milk
     2        Beer               1                0              1             0        0       0      1
     2        Bread              2                1              1             0        1       1      0
     2        Diapers            3                1              0             1        1       0      1
     2        Eggs               4                1              1             0        1       0      1
     3        Beer               5                0              1             1        1       0      1
     3        Cola
     3        Diapers
     3        Milk
     4        Beer      Goal: Empirically determine those itemsets
     4        Bread
     4        Diapers   that occur frequently together in a set of
     4        Milk      transactions, producing a set of
     5        Bread
     5        Cola      Association Rules of the form LHS -> RHS
     5        Diapers
     5        Milk



                           Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL                            26
Market Basket Analysis:
The Ole Beer and Diaper Legend…
                  Transaction   Beer        Bread        Cola       Diapers       Eggs        Milk
                       1         0            1           0            0           0           1
                       2         1            1           0            1           1           0
                       3         1            0           1            1           0           1
                       4         1            1           0            1           0           1
                       5         0            1           1            1           0           1




    Concept         Definition                                                           Example
    Itemset         A specific collection of items in transaction                        {Diapers, Beer}
    Support Count   Number of transactions with itemset                                  Support {Diapers,Beer} = 3
    Transactions    No of transactions = N                                               N=5
    Association RuleImplication rule of form LHS->RHS where LHS & RHS are                {Diapers} -> {Beer}
                    itemsets
    Rule Support    No. of times rule appears in dataset                                 3/5 = .6
                    #tuples(LHS & RHS}/N
    Rule Confidence No. of times RHS occurs in transactions with LHS                     3/4 = .75
                    #tuples(LHS, RHS)/#tuples(LHS)
    Rule Lift       Strength of Association over random co-occurrence of LHS             (3/5)/((4/5)x(3/5)) = 1.25
                    and RHS
                    #tuples(LHS,RHS}/N)/(#tuples(LHS)/N x #tuples(RHS)/N)
                    Confidence(RHS/LHS)/Support(RHS)
                    Support(LHS,RHS)/Support(LHS)xSupport(RHS)



                                Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL                                27
Market Basket Analysis:
What if it were a text analysis problem?

 Joe went to the 7-11 to pick up some cigarettes.
 While he was there he also bought some dipers
 and beer.


 Sally was on her weekly shopping run at Wal-Mart.
 She had picked up some diapers and formula for
 her infant. She also thought about buying beer for
 her husband, but the they were out of the brand he
 liked.



                 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL   28
Market Basket Analysis:
What if it were a text analysis problem?
•   No specified format
•   Variable length
•   Variable spelling
•   Punctuation and non-alphanumeric characters
•   Contents are not predefined and no predefined set
    of values



                   Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL   29
Text Mining (aka Text Analytics):
Defined




  Using natural language processing
  & data mining to discover patterns
    in a collection of “documents”
              Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL   30
Text Mining:
Document Collections
•   Word Documents                          • Blogs
•   PDFs                                    • Tweets
•   Emails                                  • Open ended
•   IM Chat                                   surveys
•   Web Pages                               • Transcripts of
                                              Helpline calls

                                                                       31
              Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Text Mining:
CRISP-Like Processes
                                                                                                        Real-World
                                                                                                         Text Data


                                                                                                         Document
          Business
        Understanding
                                       Document
                                     Understanding
                                                                                                        Consolidation


                                           Document                                                     Establish the
                                           Preparation
                                                                                                          Corpus
 Deployment
                        Documents
                                             Modeling
                                                                                                      Corpus Refinement
                                                                                                     (Token, Stem, Stop…)


                                                                                                      Feature Selection
                        Evaluation
                                                                                                        & Weighting



                                                                                                          Doc-Term
                                                                                                           Matrix*
                                                                                                                            32
                                            Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Text Mining:
Creating the corpus


A large and “structured”
or “organized” collection
of text

             Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL   33
Text Mining Process:
Corpus Refinement
   Common representation of tokens within and between documents

                                                           Eliminate
    Tokenization     Normalize                                                 Stemming
                                                          Stop Words


• Tokenization —Parse the text to generate terms. Sophisticated
  analyzers can also extract phrases from the text.
• Normalize — Convert them to lowercase.
• Eliminate stop words — Eliminate terms that appear very often
  (e.g. the, and, …).
• Stemming — Convert the terms into their stemmed form—remove
  plurals and different word forms (e.g. achieve, achieves, achieved
  – achiev) [note: word about synonyms – WordNet Synset]

                      Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL              34
Text Mining Processes:
Feature Extraction & Weighting

                           Feature
                           Extraction                                   “Bag of Words, Terms
                                                                        or Tokens”




  Vector Representation:
  Word, Term or Token/Doc Matrix

                                                                              “Bag of Words” (BOW) or
                                                                              Vector Space Model (VSM):
                                                                              Words or Tokens are
                                                                              attributes and documents
                                                                              are examples


                              Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL                 35
Text Mining Processes:
Transforming Frequencies
• Binary Frequencies: tf =1 for tf>0; otherwise 0
• Term Frequencies: tf(i,j)/Sum of tf(i,j) in Doc K
• Log Frequencies: 1 + log(tf) for tf>0; otherwise 0
• Normalized Frequencies: Divide each frequency by
  SQRT of Sum of Squares of the frequencies within the
  vector (column)
• Term Frequency–Inverse Document Frequency
    – TF * IDF
    – Inverse Document Frequency: log(N/(1+D)) where N is total
      number of docs and D is number with term


                     Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL   36
Text Mining Processes:
Twitter Example – Problem Features
                                                          • Each tweet <= 140
                                                            characters (avg. 10-
                                                            15 words/message)
                                                          • Heavy presence of
                                                            non-alpha symbols,
                                                            abbrevs, misspellings
                                                            and slang
                                                          • Tweets often include
                                                            retweets (original
                                                            tweet repeated)

             Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL           37
Text Mining Processes:
Twitter Example

One of the things I love and adore about Twitter …
is how its open API has lit a fierce fire of innovation
when it comes to analytics. Anyone and their
brother and ma-in-law can develop a tool, and they
have! Much to the benefit of the rest of us.




                 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL   38
Text Mining Processes:
 Twitter Example – Twitter API
Get Search
• http://search.twitter.com/search.json?q=<query>
• search.twitter.com/search.json?q=Obama&rpp=100&page=5




                   Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL   39
Text Mining Processes:
 Twitter Example – JSON
• JSON (JavaScript Object Notation)
• Lightweight, Text-Based, Data-Interchange Format
• Built on Two Structures:
  – A collection of name:value pairs. In various languages, this is
    realized as an object, record, struct, dictionary, hash table,
    keyed list, or associative array.
  – An ordered list of values. In most programming languages,
    this is realized as an array, vector, list, or sequence


                      Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL   40
Text Mining Processes:
 Twitter Example – Establish Corpus
    Query                        API                                              Result
search.twitter.com/                                                    {…
search.json?q=%3A)+                                                    results:[
feel+ feeling&                                                         {iso_language_code: en,
rpp=100&page=5                                                         to_user_name: Andrea,
                                                                       ...'
search.twitter.com/sea                                                 text: u”Love is everything in this
rch.json?q=%3A(+                                                       world! Its a feeling like no other. I
feel+feeling&                                                          can't wait 2 feel that emotion
rpp=100&page=5                                                         again..but patience is key :-)”
                                                                       ...
                                                                       created_at: Thu, 18 Oct 2012
                                                                       20:11:22 +0000,
                                                                       …'}}
                                                                       ...}



                         Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL                        41
Text Mining Processes:
Twitter Example – Simple Question

Are there any language differences between




“feeling” tweets containing  and  symbols?


               Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL   42
Text Mining Processes:
Twitter Example – Establish Corpus

 text: value
               1.    Love is everything in this world! Its a feeling like no other. I
   pairs in
                     can't wait 2 feel that emotion again..but patience is key :-)
JSON object
               2.    @mzxAmaZiiN Whats up ma. How ya feeling? Lemme make
                     that soul feel better. :)
 Remove        3.    "..I've got a good feeling about today :P Something makes
   RTs               me feel i might make a sale or two :) *fingers crossed* #etsy
                     #shop #seller #cra...
               4.    @IWontForgetDemi Awww poor thing :( hate feeling sick!
                     Hope you feel better soon
               5.    @IzabelaLeafsfan no the worst feeling ever is when u feel
                     like total crap. cuz u think no one luvs u :(
               6.    …


                    Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL      43
Text Mining Processes:
  Twitter Example – Doc Preparation
                                                                             Eliminate
            Tokenization               Normalize                                           Stemming
                                                                            Stop Words



Tweet:
Love is everything in this world! Its a feeling like no other. I can't wait 2 feel
that emotion again..but patience is key :-)

Words:
['Love', 'is', 'everything', 'in', 'this', 'world!', 'Its', 'a', 'feeling', 'like', 'no', 'other.',
 'I', "can't", 'wait', '2', 'feel', 'that', 'emotion', 'again..but', 'patience', 'is', 'key', ':-)']
Tokens:
['Love', 'is', 'everything', 'in', 'this', 'world', '!', 'Its', 'a', 'feeling', 'like', 'no',
 'other.', 'I', 'ca', "n't", 'wait', '2', 'feel', 'that', 'emotion', 'again..but',
 'patience', 'is', 'key', ':', '-', ')']


                                  Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL              44
Text Mining Processes:
Twitter Example – Doc Preparation
                                                                             Eliminate
       Tokenization                    Normalize                                              Stemming
                                                                            Stop Words


• Normalize
   – ['love', 'is', 'everything', 'in', 'this', 'world!', 'its', 'a', 'feeling', 'like', 'no', 'other.', 'i',
     "can't", 'wait', '2', 'feel', 'that', 'emotion', 'again..but', 'patience', 'is', 'key', ':-)']
• Alpha
   – ['love', 'is', 'everything', 'in', 'this', 'its', 'feeling', 'like', 'no', 'wait', 'feel', 'that',
     'emotion', 'patience', 'is', 'key']
• Remove Stopwords
   – ['love', 'everything', 'feeling', 'like', 'wait', 'feel', 'emotion', 'patience', 'key']
• Stemming
   – ['love', 'everyth', 'feel', 'like', 'wait', 'feel', 'emot', 'patienc', 'key']

                                  Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL                        45
Text Mining Processes:
Twitter Example – Analysis
  Item               Collection       List           Set        Lex Div Aver Len in Chars Aver No/Tweets(w/o)
  Tweets                 HF           498             -            -           108                  -
                         SF           499             -            -           100                  -
  Tweets w/o "RT"        HF           409             -            -           105                  -
                         SF           429             -            -            98                  -
  Words                  HF           8077          2346           3            4                  20
                         SF           8149          2073           4            4                  19
  Alpha (lower)          HF           5622          1197           5            4                  14
                         SF           5733          1041           6            4                  13
  Alpha w/o Stops        HF           3400          1092           3            5                   8
                         SF           3469          936            4            5                   8
  Stems                  HF           3400          978            3            4                   8
                         SF           3469          844            4            4                   8
  Stems w/o "feel"       HF           2619          977            3            4                   6
                         SF           2635          843            3            4                   6




                                  Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL                        46
Text Mining Processes:
Twitter Example – Analysis
        Corpus                  Av Word Len Aver Sent Ln Lexical Diversity
        HF Tweets                   4            20              3
        SF Tweets                   4            19              4
        austen-emma.txt             4            21             26
        austen-persuasion.txt       4            23             16
        austen-sense.txt            4            23             22
        bible-kjv.txt               4            33             79
        blake-poems.txt             4            18              5
        bryant-stories.txt          4            17             14
        burgess-busterbrown.txt     4            17             12
        carroll-alice.txt           4            16             12
        chesterton-ball.txt         4            17             11
        chesterton-brown.txt        4            19             11
        chesterton-thursday.txt     4            16             10
        edgeworth-parents.txt       4            17             24
        melville-moby_dick.txt      4            24             15
        milton-paradise.txt         4            52             10
        shakespeare-caesar.txt      4            11              8
        shakespeare-hamlet.txt      4            12              7
        shakespeare-macbeth.txt     4            12              6
        whitman-leaves.txt          4            35             12


                     Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL   47
Text Mining Processes:
Twitter Example – Analysis




             Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
                                                                      48
Text Mining Processes:
Twitter Example – Doc-Term Matrix
3538    Total      Type   216 166 111 86         83   76 66      64 58 58          …    4 4     4     4      4      4       4     4      4       4
Total   Tweets     Face   like better make good hope know get   sick hate realli   …   ye rain quit chang stress happier longer cheer without everyth
  5     Tweet1      HF      0    1     0     0    1    0   0      0   0     0      …    0 0     0     0      0      0       0     0      0       0
  0     Tweet2      HF      0    0     0     0    0    0   0      0   0     0      …    0 0     0     0      0      0       0     0      0       0
  7     Tweet3      HF      0    0     0     0    0    0   0      0   0     0      …    0 0     0     0      0      0       0     0      0       0
  3     Tweet4      HF      0    0     0     0    0    0   0      0   0     1      …    0 0     0     0      0      0       0     0      0       0
  2     Tweet5      HF      0    0     1     0    0    0   0      0   0     0      …    0 0     0     0      0      0       0     0      0       0
  1     Tweet6      HF      0    1     0     0    0    0   0      0   0     0      …    0 0     0     0      0      0       0     0      0       0
  5     Tweet7      HF      1    0     0     0    0    0   0      0   0     0      …    0 0     0     0      0      0       0     0      0       0
  7     Tweet8      HF      0    0     0     0    1    0   0      0   0     0      …    0 0     0     0      0      0       0     0      0       0
  5     Tweet9      HF      0    1     0     0    1    0   0      1   0     0      …    0 0     0     0      0      0       0     0      0       0
  2     Tweet10     HF      0    0     0     0    0    0   0      0   0     0      …    0 0     0     0      0      0       0     0      0       0
 …          …       …       …    …     …    …    …     …   …     … …        …      …   … … …          …     …       …      …      …      …       …
  5     Tweet829    SF      0    0     0     0    0    0   0      0   0     0      …    0 0     0     0      0      0       0     0      0       0
  5     Tweet830    SF      1    0     0     0    0    0   0      0   0     0      …    0 0     0     0      0      0       0     0      0       0
  3     Tweet831    SF      0    0     0     0    0    1   0      0   0     0      …    0 0     0     0      0      0       0     0      0       0
  4     Tweet832    SF      0    0     1     0    0    0   0      0   0     0      …    0 0     0     0      0      0       0     0      0       0
  3     Tweet833    SF      1    0     0     0    0    0   0      0   0     0      …    0 0     1     0      0      0       0     0      0       0
  3     Tweet834    SF      0    1     1     0    0    0   0      0   0     0      …    0 0     0     0      0      0       0     0      0       0
  5     Tweet835    SF      1    0     0     0    0    1   0      0   1     0      …    0 0     0     0      0      0       0     0      0       0
  3     Tweet836    SF      0    0     0     0    0    0   0      1   1     0      …    0 0     0     0      0      0       0     0      0       0
  5     Tweet837    SF      0    1     0     0    0    0   1      0   0     0      …    0 0     0     0      0      0       0     0      0       0
  3     Tweet838    SF      1    0     0     0    0    0   0      0   0     0      …    0 0     0     0      0      0       0     0      0       0




                                            Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL                                             49
Prediction
Information + Epidemiology =

                    Infodemiology
             Monitoring and analyzing
             queries from Internet search
             engines or peoples' status
             updates on microblogs for
             syndromic surveillance to
             predict disease outbreaks

             Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL   50
Prediction
Syndromic Survelliance


Surveillance using health-related data
that precede diagnosis and signal a
sufficient probability of a case or an
outbreak to warrant further public
health response


             Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL   51
Monitoring the Onset of the Flu Season
The Official Standard Way
   Sentinel                                                               ILI - influenza-like-illness
  Physician

                            Public Health
                             Authority


        ILI Report                                                            Public Reports



                                             Aggregated
    Costly                                      Data
1-2 Week Lag

                     Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL                         52
Prediction
Infodemiology

What is the first thing
                                        search
some people do before




they see a doctor or
                                          tweet
take OTC medicines?



                    Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL   53
Prediction - What are the terms,
keywords, phrases…?

  What is the first thing
                                             search                   •           Flu
  people do before they
                                                                      •           Flu symptoms
                                                                      •           H1N1
                                                                      •           Swine Flu
                                                                      •           Cold
                                                                      •           Fever
  see a doctor or take
                                                                      •           Headache
  OTC medicines?
                                               tweet
                                                                      •           …

                         Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL             54
Prediction
               Infodemiology Studies
Authors           Title                   Date   Type     Data Source                   Dependent            Explanatory Variables      Model                            Results
                                          (M-Y)                                         Variables
Eysenbach         Infodemiology: Tracking Nov-06 Search  Impressions and clicks from a Number of             Number of clicks on        Linear correlation between Correlation of .81 with ILI data and
                  Flu-Related Searches on                Google Ad displaying the       influenza lab        Google Ad dealing with     clicks and 3 measures of      .90 with lab tests and positive
                  the Web for Syndromic                  question:“Do you have the flu? tests, number of     influenza                  publically reported influenza cases.
                  Surveillance                           Fever, Chest discomfort,       positive influenza
                                                         Weakness, Aches, Headache, lab tests and
                                                         Cough.” Covers Oct. 2004-May number of ILI
                                                         2005                           reports from
                                                                                        Sentinel GPs
                                                                                        reported by PHA
                                                                                        in Canada
Hulth et al.      Web Queries as a         Feb-09 Search Queries to Swedish heath       Weekly Lab          Weekly ratio of influenza Two linear regression              Average R squared was .90 for the
                  Source for Syndromic                   advice site (Varguiden) from   diagnosed cases of  related Web queries to all models, one predicting lab        two years.
                  Survelliance                           June 2005 to June 2007         influenza and % of  queries at Varguiden site diagnosed cases and the
                                                                                        ILI reports of                                   other ILI reports where the
                                                                                        influenza from                                   explanatory variable based
                                                                                        GPs                                              based on a composite of the
                                                                                                                                         best predicting query terms
Ginsberg et al.   Detecting influenza      Feb-09 Search Google Search: Historical web   % US ILI Visits in ILI-related search queries logit(P)=b0 + b1xlogit(Q)+e       Mean correlation of .90 between P
                  epidemics using search                 logs of ILI related Google      week reported to                                where P is % visits and Q is    & Q for 9 CDC US healthcare
                  engine query data                      Searchs 2003-2008               CDC                                             normalized number of            regions.
                                                                                                                                         queries
Lampos &          Tracking the flu         Jun-10 Tweets Twitter: 24 weeks of Twitter    Weekly ILI UK      Average daily "flu-score" linear least squares               Average correlation for 5 UK health
Cristianini       pandemic monitoring                    corpus in UK from June '09 to   Health Protection for all tweets. Flu score for regression time series of HPA   care regions was about .92.
                  the Social Web                         Dec'09                          Agency smoothed single tweet is proportion flu rates on aggregated tweet
                                                                                         for daily values   of all ILI "stem" markers    flu-scores
                                                                                                            (ngrams) that appear in
                                                                                                            the tweet.
Culotta           Towards detecting        Jul-10 Tweets 575K flu-related Twitter        % ILI weekly       % of messages reporting Compares simple logit model          Aggregating keyword frequencies
                  influenza epidemics by                 messages and % ILI reports      reports from CDC an ILI or related symptom with multiple regression             using separate keywords (multi-
                  analyzing Twitter                      from CDC for period Feb 2010    for specified      (based on detailed           model having different          reg) works better than single
                  messages                               to Apr 2010.                    period             classification and           counts for separate Tweet       aggregated (simple logit) predictor.
                                                                                                            statistical procedures to    keywords & phrases              Simple BOW classifier can be used
                                                                                                            determine whether ILI or                                     to filter ILI messages. Achieved
                                                                                                            not)                                                         r=.78 for 5 weeks of validation data.

                                                                        Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL                                                                55
Prediction
              Infodemiology Studies
Authors            Title                     Date           Data Source                        Dependent            Explanatory Variables        Model                          Results
                                             (M-Y)                                             Variables
Acherekar          Predicting Flu Trends      Apr-11 Tweets ILI "influenza" visits to Sentinel ILI influenza visits Previous weeks ILI visits    Auto-regression model with: Analysis shows time-lagged ILI is
                   using Twitter Data                       physicians reported weekly by                           and aggregate number of      ILI(t) = a1*ILI(t-2) +      not a strong predictor but the
                                                            CDC from Oct 2009 to Oct 2010                           tweets mentioning flu-like   b*Tweets(t) + e             aggregate number of tweets with
                                                            and tweets mentioning "flu-                             symptoms                                                 flu-like systems is very strong with
                                                            like" symptoms during same                                                                                       an r=.9846.
                                                            time period

Chan et al.        Using Web Search          May-11 Search Google Queries in select            Official weekly     Computed daily counts of      O=b0+b1Q+e where O is           Very strong correlation (~.9) except
                   Query Data to Monitor                   countries from 2003-10              dengue case count   queries referencing           official weekly count of cases for Singapore (.82).
                   Dengue Epidemics: A                     including Bolivia, Brazil, India,   from Ministry of    Dengue fever for separate     in specified countries and S is
                   New Model for                           Indonesia and Singapore             Heath or WHO        countries                     dengue related search query
                   Neglected Tropical                                                                                                            in those countries
                   Disease Survelliance
Signorini et al.   The Use of Twitter to     May-11 Tweets Several panels of tweets and        % ILI weekly        Fraction of influenza       Support Vector Regression     Strong SVR fit for both national and
                   Track Levels of Disease                 ILI reports. Prediction panel       reports from CDC    related tweets for total US utilizing SVM feature sets    regional figures.
                   Activity and Public                     includes weekly ILI %s and US       for specified       and CDC health regions.     (collection of terms occuring
                   Concern in the US                       tweets from Oct 2009 thru May       period. Reports     Utilizes complex Support 10X per week) to estimate
                   during the Influenza A                  2010.                               include             Vector Machine using term- weekly ILI.
                   H1N1 Pandemic                                                               nationwide and 10   frequency statistics.
                                                                                               regional reports.

Paul               You are what you tweet:    Jul-11 Tweets Covers a number of ailments        % ILI weekly      Utilizes an Ailment Topic       Correlation between the       Two types of ATAM/ATAM+ models
                   Analyzing Twitter for                    including influenza. Based on 2    reports from CDC  Aspect Model (ATAM) to          probability the "flu ailment" were compared. The correlations
                   Public Health                            billion tweets from Many 2009      for specified     determine the type of           designation for each week     with ILI were .934 and .958.
                                                            to October 2010.                   period            ailment associated with a       with the ILI rate in the US.
                                                                                                                 tweet (in this case Flu)
Lampos et al.      Flu detector - Tracking   Sep-12 Tweets Twitter: 40 weeks of Twitter        Weekly ILI UK     Sum across all tweets of        linear least squares          Correlations around .90 for 3 UK
                   epidemics on Twitter                    corpus in UK from June '09 to       Health Protection daily "flu-score" for each      regression time series of HPA health care regions.
                                                           Mar '10                             Agency smoothed tweet based on number of          flu rates on aggregated tweet
                                                                                               for daily values  ILI "stem" markers that         flu-scores
                                                                                                                 appear in the tweet.




                                                                          Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL                                                                      56
Infodemiology
Example Prediction Analysis
 Nowcasting Events from the Social Web with Statistical
 Learning,” Lampos and Cristianini, ACM IS&T, 9/11
                                                         N-Gram Stems
                                                          ILI Markers      Avg. Weekly
    Twitter      50M Tweets                                                 Wght. Flu-
     API        5 UK Regions                                                Score by
                  6’09-12’09                                                 Region
                                                                             (time t)



                                 Weekly
                             ILI Reports by                                 r ~.89-.96
                              health region
                                 (time t)


                  Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL             57
Infodemiology
Example Prediction Analysis
                                      50M Tweets
    Corpus
                                 5 Region UK, 6/09-4/10




    Corpus                                     Lower                       Stop
                        Tokens                                                     Stems
   Refinement                                  Case                        Words




    Feature                1-                   2-                    Hybrid
   Selection             Grams                Grams                   Grams



     Tweet-                    For each tweet, number of times each hybrid
  Hybrid Matrix                occurs times hybrid weights. Then sum of
                               weighted values equal flu-score for tweet.



                  Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL                   58
Infodemiology
Example Prediction Analysis
 flu :(
 RT @OMGFacts: The US death toll from the 1918 flu epidemic was so high that it created a coffin
  shortage
 News_SwineFlu: #swineflu Delayed treatment for swine flu can lead to... http://t.co/ZSOjIvAW'
 The last thing I want right now is the flu
 Much better with that lovely message from u,thank u! (heavy flu since 4 days ;
 Supposedly only about 1% of people in the world present flu-like symptoms after flu shot.
  Unfortunately 66% of the people in my house do'
 Back on track...Fuck you flu!
 Sounds like either the flu or Killian walked into the room.
 I'll try haha, I got the flu!
 It costs an average company $135 a day for every employee sick at home with the flu. Encourage
  your employees to get vaccinated....
 i wouldn't recommend it, but a couple of days with violent stomach flu does make you appreciate
  the small things, nature's reset
 Is it possible to get the flu from the flu jab?
 Sedatives + cold/flu + ana = can't get up without having to hold a wall for 2 minutes",

                                Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL          59
Infodemiology
Example Prediction Analysis (3)
                     Bigram                                Frequency   Bigram                     Frequency
                     (u'flu',u'shot')                          11      (u'common',u'cold')            2
                     (u'flu',u'season')                        8       (u'man',u'flu')                2
                     (u'flu',u'symptom')                       5       (u'southeast',u'asia,')        2
                     (u'symptom',u'busters')                   4       (u'please',u'go')              2
                     (u'mtm',u'blog')                          4       (u'headache',u'since')         2
                     (u'sore',u'throat')                       4       (u'get',u'flu')                2
                     (u'immune',u'system')                     4       (u'every',u'single')           2
                     (u'season',u'u2013')                     4       (u"can't",u'get')              2
                     (u'every',u'symptom')                     3       (u'gonna',u'get')              2
                     (u'since',u'last')                        3       (u'stuffy',u'head')            2
                     (u'flu',u'jab')                           3       (u'fever,',u'flu,')            2
                     (u'sore',u'throat,')                      3       (u'feeling',u'like')           2
                     (u'feel',u'like')                         3       (u'symptom',u'checker')        2
                     (u'jonas',u'brothers')                    3       (u'runny',u'nose')             2
                     (u'flu',u'shots')                         2       (u'riva',u'offering')          2
                     (u'+',u'sore')                            2       (u'flu',u'pills')              2
                     (u'like',u"i'm")                          2       (u'#nutrition',u'foods')       2
                     (u"didn't",u'go')                         2       (u'cure',u'#colds')            2
                     (u"don't",u'feel')                        2       (u'cold,',u'flu')              2
                     (u'offering',u'complimentary')            2       (u"i'm",u'gonna')              2
                     (u'flu-like',u'symptom')                  2       (u'ear',u'infection')          2
                     (u'asia,',u'millions')                    2       (u'flu,',u'sore')              2




              Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL                                     60
Infodemiology
Example Prediction Analysis (3)



  1-Grams                                                                        2-Grams




                                                                       Hybrids


              Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL                       61
Infodemiology
Example Prediction Analysis

                                                    Weekly            Avg. Weekly
                                                  ILI Reports          Wght. Flu-
                                                   by health = b0 + b1 Score by + e
                                                     region             Region
                                                    (time t)            (time t)



                    Hybrid 1 Hybrid 2 Hybrid 3 …                   Hybrid n
     Tweets Week    wgt1     wgt2     wgt3     …                   wgtn           Flu-Score Indep Var
     Tweet1       1 wgt1*N11 wgt2*N21 wgt3*N21 …                   wgtn*N21       Sum1 W*N
     Tweet2       1 wgt1*N21 wgt2*N21 wgt3*N21 …                   wgtn*N21       Sum2 W*N
     Tweet3       1 wgt1*N31 wgt2*N21 wgt3*N21 …                   wgtn*N21       Sum3 W*N Aver Wk-1
     …       …      …        …        …        …                   …              …
     Tweet m     24 wgt1*Nm1wgt2*N21 wgt3*N21 …                    wgtn*N21       Summ W*N Aver Wk-m



                         Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL                         62
Infodemiology
Example Prediction Analysis




        http://geopatterns.enm.bris.ac.uk/epidemics/
                  Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL   63
Programming Text Mining for
Prediction with Python




                                                                      64
             Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Text Mining for Prediction:
 Programming with Python & NLTK
# Utilizes “nltk”, a Python “natural language toolkit”
# Step 1: Initialize modules, stopwords and stemmer

import simplejson
import urllib
import re
import nltk

from nltk.corpus import stopwords
stopwords = stopwords.words('english')
porter = nltk.PorterStemmer()

def remove_stopwords(text):
  stopwords = nltk.corpus.stopwords.words('english')
  content = [w for w in text if w.lower() not in stopwords]
  return content


                                 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL   65
Text Mining for Prediction:
Programming with Python & NLTK
# Step 2: Utilize Twitter Search API to collect “flu” tweets in JSON format.
# Then extract the “text” fields associated with each tweet.

itemsFlu = []

for i in range(0,14):
  urlFlu = 'http://search.twitter.com/search.json?q=flu&rpp=100&page=1'
  resultFlu = simplejson.load(urllib.urlopen(urlFlu))
  itemsFlu = itemsFlu + resultFlu['results']

tweetsFlu = [ item['text'] for item in itemsFlu]

# Step 3. Eliminate Retweets

FluTweetsNoRTs= []

for tweetText in tweetsFlu:
  if not re.search('RT ',tweetText):
     tweetTextList = [tweetText]
     FluTweetsNoRTs = FluTweetsNoRTs + tweetTextList

                                Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL   66
Text Mining for Prediction:
Programming with Python & NLTK
# Step 3: Preparing to do Text analysis
noTweets = 0; Flu_tmp_word = []; Flu_tot_word = []
Flu_tmp_low = []; Flu_tot_low = []; Flu_tmp_alpha = []; Flu_tot_alpha = []
Flu_tmp_stop = []; Flu_tot_stop = []; Flu_tmp_stem = []; Flu_tot_stem = []
stems_dict = {} # dictionary holding "text" broken into words for 1..N tweets

# Step 4: Text processing. Produces lowercase, alpha stems devoid of stopwords
for tweet in FluTw eetsNoRTs :
  noTweets = noTweets + 1
  Flu_tmp_word = [ w for w in tweet.split()]; Flu_tot_word = Flu_tot_word + Flu_tmp_word
  Flu_tmp_low = [w.lower() for w in Flu_tmp_word] ; Flu_tot_low = Flu_tot_low + Flu_tmp_low
  Flu_tmp_alpha = [cv for w in Flu_tmp_low for cv in re.findall(r'^[a-z]+[a-z]+$', w)]
  Flu_tot_alpha = Flu_tot_alpha + Flu_tmp_alpha
  Flu_tmp_stop = remove_stopwords(Flu_tmp_alpha); Flu_tot_stop = Flu_tot_stop + Flu_tmp_stop
  Flu_tmp_stem = [porter.stem(t) for t in Flu_tmp_stop]; Flu_tot_stem = Flu_tot_stem + Flu_tmp_stem
  Flu_stems_dict[noTweets] = Flu_tmp_stem


                                   Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL             67
Text Mining for Prediction:
Programming with Python & NLTK
# Step 5: Analyzing frequency distributions

fdist_Flu_stems = nltk.FreqDist(Flu_tot_stem)
fdist_Flu_stems.plot(25)
fdist_Flu_stems.items()[0:25]

# Step 6. Produce doc-matrix “wc” (a dictionary hold the counts associated
# with each word).

for dCnt in range(1,len(Flu_stems_dict)):
  wlist = Flu_stems_dict[dCnt]
  for word in wlist:
    wc.setdefault(word,0)
    wc[word] += 1




                             Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL   68
Text Mining for Prediction:
Programming with Python & NLTK
# Step 7. Eliminate all words that occur less than 3 times
apcount = {}
for word,wcnt in wc.items():
  apcount.setdefault(word,0)
  if wcnt > 3: apcount[word] = wcnt

# Step 8. Write out doc-term matrix to a file
out = file('flu-doc-matrix.txt','w'); out.write('Tweets')
for word,count in apcount.items():
  out.write(","); out.write(word) ; out.write('n')
for tweetno, twlist in Flu_stems_dict.items():
  tweetname = "Tweet" + str(tweetno); out.write(tweetname)
  for word, count in apcount.items():
     if word in twlist: out.write(","); out.write("1")
     else: out.write(","); out.write("0"); out.write("n")
out.close()


                             Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL   69
Text Mining for Prediction:
Programming with R, tm and RTextTools




                                                                       70
              Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Text Mining for Prediction:
   Programming with R, tm and RTextTools
# initialize libraries                                                       # text preprocessing function
library(twitteR)                                                             toLowerCaseAlpha <- function(x){
library(tm)                                                                  removeHTTP <- gsub("http:[/a-zA-Z0-9._]+","",x)
library(RTextTools)                                                          removeNonAlpha <- gsub("[^a-zA-Z]"," ",removeHTTP)
library(plyr)                                                                removeMultipleSpaces <- gsub(" +"," ",removeNonAlpha)
library(stringr)                                                             changeToLowerCase <- tolower(removeMultipleSpaces)
                                                                             return(changeToLowerCase) }
# utilize Twitter Search API to collect ―flu‖ Tweets
# in JSON format                                                             # clean tweet text, convert lower case, remove stop words
fluTweets = searchTwitter('flu', n=1500, lang='en')                          # create corpus for analysis
                                                                             fluTweetTextCleaned <- toLowerCaseAlpha(fluTweetTextNonRTs)
# extract text from JSON objects                                             corpusFluTweets <- Corpus(VectorSource(fluTweetTextCleaned))
fluTweetText = laply(fluTweets, function(t) t$getText())                     corpusFluText <- tm_map(corpusFluTweets, removeWords,
                                                                             corpusStopwords)
# remove retweets
indxFluRetweets <- grep('RT',fluTweetText)
indxFluTweetText <- c(1:length(fluTweetText))
indxFluTweetNonRTs <- !(indxFluTweetText %in%
indxFluRetweets)
fluTweetTextNonRTs <-
fluTweetText[indxFluTweetNonRTs]
                                              Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL                             71
Text Mining for Prediction:
 Programming with R, tm and RTextTools

# function to convert tokens/words to stems                          # create document –term matrix
convertToStems <- function(x){                                       corpusFluStemsDTM <-
tokens <- strsplit(x,' +')                                             DocumentTermMatrix(corpusFluStems)
listToVect <- unlist(tokens)
stems <- wordStem(listToVect)                                        # create vector of individual stems and the associated
return(stems)}                                                       # frequencies with which they occur across all tweets
                                                                     fluStemsDTMMat <- as.matrix(corpusFluStemsDTM)
# convert to stems and create new corpus                             fluStemsDTMMatSorted <-
FluStems <- sapply(corpusFluText, convertToStems)                       sort(colSums(fluStemsDTMMat))
corpusFluStems <- Corpus(VectorSource(FluStems))




                                       Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL                             72
Text Mining for Prediction:
   Programming with R, tm and RTextTools
# plot frequency disttribution of stems

noOfPlotPnts = 50

numOfStems =
 length(fluStemsDTMMatSorted)

plotStart = numOfStems - noOfPlotPnts +1

plotEnd = numOfStems

top50FluStems =
  fluStemsDTMMatSorted[plotStart:plotStart]

barplot(top50,horiz=TRUE,cex.names=0.5,
  space = .5,las=1, xlim=c(0,100),
  main="Freq of Terms")




                                      Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL   73
Text Mining for Prediction:
   Programming with R, tm and RTextTools

# create wordcloud of stems based on frequency

library(wordcloud)
library(RColorBrewer)

fluStemNames <- names(fluStemsDTMMatSorted)

dfFluStemDTMMat <- data.frame(word=fluStemNames,
  freq=fluStemsDTMMatSorted)

pal2 <- brewer.pal(8,"Dark2")

wordcloud(dfFluStemDTMMat$word,
  dfFluStemDTMMat$freq, min.freq=10, colors=pal2)




                                        Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL   74
Infodemiology is a form of




     Weather in the next 6 hours
              Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL   75
Now+Forecasting


 Predicting the present by
 analyzing large volumes of data
 that can be used to "forecast"
 current events for which official
 analysis has not been released



            Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL   76
Basic Regression Analysis


Dependent                 Dependent                                Traditional,         Aggregate
Variable at               Variable at                              Publicly             Search
Time t                    Time t - n                               Available at         Index or
(Standard     = b0 + b1   (Standard                 + b2           Time t - n +    b3   Social       +e
Publicly                  Publicly                                 Explanatory          Media
Available                 Available                                Variable             Freq.
Measure)                  Measure)                                                      Count
                                                                                        at Time t




                          Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL                77
Authors
                Examples
                     Title                           Date (M-    Type    Data Source                              Dependent Variables         Explanatory Variables              Model                                      Results
                                                        Y)
Kholodilin et al.    Do Google Searches Help          Apr-10    Search   Fed reserve data on US private        Year-on-Year Growth Rate       220 Google Trend/Insights Search   Y-o-Y monthly URPC growth rates            Query term principal components
                     in Nowcasting Private                               consumption and related Google search of Monthly US Real Private     terms related to Priv              for 3 sets of regressors -- Sentiment      outperform standard Sentiment and
                     Consumption? A Real-Time                            terms from Jan '05-Dec '09.           Consumption, ALFRED db         Consumption reduced to 10          (consumer sentiment and                    Financial Indicators. A combination of two
                     Evidence for the US                                                                       of Fed Rsrv of St. Louis       principal components for montly    confidence); Financial (short term         of the factors work best -- those related to
                                                                                                                                              periods from Jan 2005 to Dec       and long term interest rates and S&P       mobility and health care consumption.
                                                                                                                                              2009                               500); Query (combinations of
                                                                                                                                                                                 principal components of query
                                                                                                                                                                                 terms)
Sakaki et al.        Earthquake Shakes Twitter        Apr-10    Tweets   Earthquake occurrences/intensity and     Occurrence, intensity and   Tweets that contain the query      Utilized a support vector machine          Of those earthquakes occurring in the Aug-
                     Users: Real-time Event                              tweets containing the words              location of an earthquate   words "earthquake" and/or          (SVM) to determine whether tweet           Sept time frame in Japan, 96% of 24 quakes
                     Detection by Social Sensors                         "earthquake" or "shaking" in Japan       in Japan                    "shaking"by location               reports refers to an earthquake            above an intensity of 3 were reported in a
                                                                         from Aug '09-Sep '09                                                                                    occurrence. The reports are then           tweet. Of the 24 quakes, 80% were
                                                                                                                                                                                 matched against actual occurrence          reported with a minute of occurrence. This
                                                                                                                                                                                 at a particular location to see if it is   is much faster the reports issued by the
                                                                                                                                                                                 detected within 1 min of occurrence.       Japan Meteorological Agency.

Carrière-Swallow &   Nowcasting With Google           Jul-10    Search   Auto sales and Google search indices     % year-on-year change in    Google Search index of interest in % change in y-on-y autosales in            Despite relatively low rates of Internet
Labbé                Trends in an Emerging Market                        for specific automobiles in Chile from   auto sales in Chile         automobile purchases in Chile      Chile regressed against a                  usage in Chile, models incorporating
                                                                         2005 thru 2010.                                                                                         compositive auto Google Search             Google Trends Automotive Index
                                                                                                                                                                                 index based on queries about 9             outperform benchmark specifications in
                                                                                                                                                                                 leading automobile manufactures in         both in-sample and out-of- sample
                                                                                                                                                                                 Chile.                                     nowcasts of y-on-y % changes in autosales
                                                                                                                                                                                                                            while providing substantial gains in
                                                                                                                                                                                                                            information delivery times
Ciulla               Beating the news using social   May-12     Tweets   Tweets from US users that related to     Contestants who were        Number of hashtags with Tweets Contestants with the fewest                    In general the simple tweet frequencies are
                     media: the case study of                            American Idol contestants and tweeted    eliminated or won final     with hastags signifying        mentions predicted to be the                   strong predictor of the contestant who will
                     American Idol                                       during the voting time window for each   episode                     contestants                    candidate eliminated.                          be eliminated.
                                                                         episode during 11th season from Jan
                                                                         '12 - May '12
Chadwick &           Nowcasting Unemployment          Jun-12    Search   Monthly Turkey non-agricultural          Monthly Turkey non-       Google Search Index in Turkey for    Linear auto regression models andModels with Google Search Indicators
SengulCiu            Rate in Turkey: Let's Ask                           unemployment rate from Jan '05-Dec       agricultural unemployment terms directly ("looking for job")   Bayesian Model Averaging         perform better in nowcasting the 1 period,
                     Google                                              '11                                      rate                      or indirectly (job                   procedure to investigate whether 2 periods and 3 periods ahead
                                                                                                                                            announcements) related to            Google search query data can     unemployment rate than the benchmark
                                                                                                                                            unemployment.                        improve                          where we use only the lag values of the
                                                                                                                                                                                                                  unemployment rate.
Song, Pan, Ng        Forecasting hotel room          Sep-12     Search   Weekly data on hotel bookings in         Weekly Hotel Bookings in Indexed Search Volumes from       Log of Room Nights for Log of        Test various statistical models; all gave
                     demand using search                                 Charleston, SC and Google trend data     Charleston, SC            Google Trends/Insights Jan 2008- Search Volumes - Charleston, Travel reasonable forecasts. Best fit model was
                     engine data                                         for specific travel tourism search terms                           Aug 2009                         Charleston, Charleston Hotels,       Autoregressive Distributed Lag (ADLM)
                                                                         from Jan '08-Aug '09                                                                                Charleston Restaurants, Charleston with a lag period of 6 weeks.
                                                                                                                                                                             Tourism
McLaren,             Using internet search data as    Q2-11     Search   UK monthly unemployment data and         Official monthly          Google Trend/Insight query       For unemployment, linear AR model For unemployment forecasts, claimant
Shanbhogue           economic indicators                                 housing price growth from June '04-Jan unemployment data and indexes for the term "Job Seekers with query term, claimant count,          count strongest followed by query term. For
                                                                         '11 associated with Google Trend query housing price growth in the Allowance (JSA)" for             and GfK consumer confid. as exp      housing prices, the query term was much
                                                                         indices for job seeking and              UK from June 2004-Jan     unemployment and "Estate         vars; for housing price growth with stronger than HBF and RICS data.
                                                                         unemployment                             2011                      Agents" for housing              query term, Home Builders and
                                                                                                                                                                             Royal Instit. of Chartered Surveyors
                                                                                          Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL                             price growth balances as exp vars.                                       78
Authors
                 Examples
                    Title                          Date (M-    Type    Data Source                               Dependent Variables           Explanatory Variables            Model                                   Results
                                                      Y)
Gruhl et al.        The Predictive Power of Online Aug-05      Blogs   Amazon Sales Rank for for best selling    Amazon Sales Rank for      Number of mentions of the           Cross correlation of time series for    While sales rank is a poor predictor of the
                    Chatter.                                           books from Jul '04-Aug '04 and number     2340 bestselling books in 4book/author in over 300K blogs      sales rank and mentions.                change in sales rankings, a prior spike in
                                                                       of mentions in blogs for same time        month period (Jul 2004-    whose postings that were                                                    mentions predicts quite well a future spike
                                                                       period                                    Aug 2004) and spikes in    maintained by IBM's                                                         in sales rank.
                                                                                                                 these sales ranks          WebFountain project (over 200K
                                                                                                                                            postings/day)
Choi, Varian        Predicting the Present with     Apr-09    Search   Monthly data for search from Google       US Census Bureau Advance Google Trend/Insight query            Google Trend indices for query         Simple seasonal AR models and fixed-
                    Google Trends                                      Trends associated with various retail     Monthl Retail Sales        indices for categories and          subcategories related to (log values)  effects models that includes relevant
                                                                       sales and travel data from Jan '04-Aug    (general and specific) and subcategories related to retail     of overall monthly retail trade        Google Trend variables tend to outperform
                                                                       '08                                       Travel (Visitor arrival in sales (general and specifix) and    (NAICS categories), automotive         models that exclude these variables. In
                                                                                                                 Hong Kong)                 related to Travel                   sales, home sales and travel.          some cases small gains, in other
                                                                                                                                                                                                                       substantial.
Suhoy               Query Indices and a 2008        Jul-09    Search   Monthly official economic growth data     Monthly percent changes 30 Google Search Index                    Bayesian probabilities of downturn Six leading query categories including HR
                    Downturn: Israeli Data                             for the months and quarters from 2nd      of various real values of     categories related to consumption calculated by Hamilton's two-sate     (recruiting and staffing), home appliances,
                                                                       quarter 2004 to 2nd quarter 2009 along    industrial production, retail and employment                      Markov switching AR(0) model.       travel, real estate, food and drink and
                                                                       with various Google search indices        trade, revenue of trade                                           Used to determine changes in query beauty and contain cyclical components
                                                                       during the same period                    and services, consumer                                            indices can predict changes in      which conform with cycles of economic
                                                                                                                 imports, exports of                                               official economic variables.        growth. The strongest relationship was
                                                                                                                 services and the                                                                                      between HR and unemployment.
                                                                                                                 employment rate
Sadikov et al.      Blogs as Predictors of Movie    Aug-09     Blogs   Weekly box movie sales, gross sales,      Movie critic ranking, user Analysis of spinn3r.com blog data Linear regression for weekly             Minimal correlation between rankings and
                    Success                                            and critic and user rankings from Nov     ranking, 2008 gross sales, set 11/07-11/08, counting movie rankings and sales data by blog            references and sentiment. Strong
                                                                       '07-Nov '08 and counts for moving         weekly box office sales       references and sentiment within references and sentiment.               correlation between references and gross
                                                                       references and sentiment measures for     (weeks 1-5)                   specified time window before and                                        sales but week with sentiment. Strongest
                                                                       same time period.                                                       after movie release date.                                               relationships with timing of references in
                                                                                                                                                                                                                       weeks after release.
Zhang & Skiena      Improving Movie Gross           Sep-09    Blogs & Online News stores & blogs 1960-2008       Gross receipts from           Variety of IMDB variables (e.g.     Linear regression of receipts for   Number of news articles mentioning
                    Prediction Through News                    News along with movie receipt and IMDB            movies                        movie genre), movie budget,         various combinations of explanatory moving 1 week prior have highest
                    Analysis                                          data. News stores analyzed for various                                   number of first week theaters,      variables. Also, K-NN nearest       correlation (~.7)and predictive ability
                                                                      pre-release time periods                                                 and number of stores and            neighbor analysis determining
                                                                                                                                               mentions of movie titles, director, factors associated with gross
                                                                                                                                               top 3 & top 15 actors               receipts
Wu & Brynjolfsson   The Future of Prediction: How   Dec-09    Search   Quarterly Housing Sales and Housing       Housing Sales and Housing Google Search Index for Real            Linear autoregression between       Strong predictive relationships between
                    Google Searches Foreshadow                         Price Index for 50 US states along with   Price Index (HPI)             Estate, Real Estate Agencies, and Housing Sales and prior sales, the    Housing Sales and searches for Real Estate
                    Housing Prices and Sales                           the Google Search Index for Real                                        Home Appliances                     HPI, and Search Indices for Real    Agencies. Similarly relationships for HPI.
                                                                       Estate, Real Estate Agencies, and                                                                           Estate and Real Estate Agencies as
                                                                       Home Appliances from 4th Quarter                                                                            well as the same regression for the
                                                                       2007 to 2nd Quarter 2007                                                                                    HPI
Asur, Huberman      Predicting the Future with      Mar-10    Tweets   3 million tweets mentioning 24 movies     Box-office revenues for       Promotion tweets-retweets for a Regression of 1st weekend box           Promotional tweets are weakly correlated
                    Social Media                                       from Nov '09-Feb '10 along with           (24) movies                   particular movie, tweet rates for office revenues by promotional        1st weekend revs. Tweet rates are very
                                                                       associated box-office revenues                                          particular movie per hour, ratio tweets-retweets, by tweet rates vs. strongly correlated (min .9) and a stronger
                                                                                                                                               of positive to negative             Hollywood Stock Exchange prices,    predictor than HSX. Finally, tweet rates are
                                                                                                                                               sentiments for the movie            and 2nd weekend revenues by tweet strongly correlated with 2nd weekend
                                                                                                                                                                                   rates and the sentiment ratio.      revenues and sentiments improve the
                                                                                                                                                                                                                       forecasts slightly.
                                                                                      Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL                                                                                                     79

Más contenido relacionado

La actualidad más candente

Twitter sentimentanalysis report
Twitter sentimentanalysis reportTwitter sentimentanalysis report
Twitter sentimentanalysis reportSavio Aberneithie
 
Business Intelligence Module 5
Business Intelligence Module 5Business Intelligence Module 5
Business Intelligence Module 5Home
 
Machine Learning and Internet of Things
Machine Learning and Internet of ThingsMachine Learning and Internet of Things
Machine Learning and Internet of ThingsSofian Hadiwijaya
 
Intro to Machine Learning & AI
Intro to Machine Learning & AIIntro to Machine Learning & AI
Intro to Machine Learning & AIMostafa Elsheikh
 
Machine Learning for Finance Master Class
Machine Learning for Finance Master Class Machine Learning for Finance Master Class
Machine Learning for Finance Master Class QuantUniversity
 
Fuzzy rule based expert system for diagnosis of lung cancer
Fuzzy rule based expert system for diagnosis of lung cancerFuzzy rule based expert system for diagnosis of lung cancer
Fuzzy rule based expert system for diagnosis of lung cancerFarzad Vasheghani Farahani
 
IoT Standardization and Implementation Challenges
IoT Standardization and Implementation ChallengesIoT Standardization and Implementation Challenges
IoT Standardization and Implementation ChallengesAhmed Banafa
 
Social io t-sito s-iot
Social io t-sito s-iotSocial io t-sito s-iot
Social io t-sito s-iotLuigi Atzori
 
How artificial intelligence is revolutionizing learning and development pract...
How artificial intelligence is revolutionizing learning and development pract...How artificial intelligence is revolutionizing learning and development pract...
How artificial intelligence is revolutionizing learning and development pract...Charles Cotter, PhD
 
Business Intelligence Presentation 1 (15th March'16)
Business Intelligence Presentation 1 (15th March'16)Business Intelligence Presentation 1 (15th March'16)
Business Intelligence Presentation 1 (15th March'16)Muhammad Fahad
 
Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)Krishnaram Kenthapadi
 
Internet of things (IoT)
Internet of things (IoT)Internet of things (IoT)
Internet of things (IoT)Tarika Verma
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learningbutest
 

La actualidad más candente (20)

Twitter sentimentanalysis report
Twitter sentimentanalysis reportTwitter sentimentanalysis report
Twitter sentimentanalysis report
 
Business Intelligence Module 5
Business Intelligence Module 5Business Intelligence Module 5
Business Intelligence Module 5
 
Machine Learning and Internet of Things
Machine Learning and Internet of ThingsMachine Learning and Internet of Things
Machine Learning and Internet of Things
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data
Big DataBig Data
Big Data
 
Overview of Big data(ppt)
Overview of Big data(ppt)Overview of Big data(ppt)
Overview of Big data(ppt)
 
Intro to Machine Learning & AI
Intro to Machine Learning & AIIntro to Machine Learning & AI
Intro to Machine Learning & AI
 
Machine Learning for Finance Master Class
Machine Learning for Finance Master Class Machine Learning for Finance Master Class
Machine Learning for Finance Master Class
 
Fuzzy rule based expert system for diagnosis of lung cancer
Fuzzy rule based expert system for diagnosis of lung cancerFuzzy rule based expert system for diagnosis of lung cancer
Fuzzy rule based expert system for diagnosis of lung cancer
 
Data science - An Introduction
Data science - An IntroductionData science - An Introduction
Data science - An Introduction
 
Big data
Big dataBig data
Big data
 
Presentation machine learning
Presentation machine learningPresentation machine learning
Presentation machine learning
 
IoT Standardization and Implementation Challenges
IoT Standardization and Implementation ChallengesIoT Standardization and Implementation Challenges
IoT Standardization and Implementation Challenges
 
Social io t-sito s-iot
Social io t-sito s-iotSocial io t-sito s-iot
Social io t-sito s-iot
 
How artificial intelligence is revolutionizing learning and development pract...
How artificial intelligence is revolutionizing learning and development pract...How artificial intelligence is revolutionizing learning and development pract...
How artificial intelligence is revolutionizing learning and development pract...
 
Business Intelligence Presentation 1 (15th March'16)
Business Intelligence Presentation 1 (15th March'16)Business Intelligence Presentation 1 (15th March'16)
Business Intelligence Presentation 1 (15th March'16)
 
Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)
 
The Nature of Data
The Nature of DataThe Nature of Data
The Nature of Data
 
Internet of things (IoT)
Internet of things (IoT)Internet of things (IoT)
Internet of things (IoT)
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learning
 

Destacado

Ariadne: First Report on Data Mining
Ariadne: First Report on Data MiningAriadne: First Report on Data Mining
Ariadne: First Report on Data Miningariadnenetwork
 
Omnichannel: Lessons Learned from First-Mover Failures and Success
Omnichannel: Lessons Learned from First-Mover Failures and SuccessOmnichannel: Lessons Learned from First-Mover Failures and Success
Omnichannel: Lessons Learned from First-Mover Failures and SuccessAlexandra Frith
 
Market basket analysis
Market basket analysisMarket basket analysis
Market basket analysisVermaAkash32
 
Market Basket Analysis
Market Basket AnalysisMarket Basket Analysis
Market Basket AnalysisMahendra Gupta
 
Real-time Market Basket Analysis for Retail with Hadoop
Real-time Market Basket Analysis for Retail with HadoopReal-time Market Basket Analysis for Retail with Hadoop
Real-time Market Basket Analysis for Retail with HadoopDataWorks Summit
 
Importance Of E-Commerce Personalization in 20 Revealing Stats
Importance Of E-Commerce Personalization in 20 Revealing StatsImportance Of E-Commerce Personalization in 20 Revealing Stats
Importance Of E-Commerce Personalization in 20 Revealing StatsMojn
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and predictionDataminingTools Inc
 
JDA Innovation Forum: Seamless Omnichannel Campaigns Revenue Model
JDA Innovation Forum: Seamless Omnichannel Campaigns Revenue ModelJDA Innovation Forum: Seamless Omnichannel Campaigns Revenue Model
JDA Innovation Forum: Seamless Omnichannel Campaigns Revenue ModelFederico Gasparotto
 
Ceo viewpoint 2017 the transformation of retail
Ceo viewpoint 2017 the transformation of retailCeo viewpoint 2017 the transformation of retail
Ceo viewpoint 2017 the transformation of retailClotilde Chenevoy
 
Setmana D’Activitats ExtraordinàRies
Setmana D’Activitats ExtraordinàRiesSetmana D’Activitats ExtraordinàRies
Setmana D’Activitats ExtraordinàRiesbertafv
 
структура эумк
структура эумкструктура эумк
структура эумкfarcrys
 
Week 4 5352
Week 4 5352Week 4 5352
Week 4 5352abby20
 
Pembekalan Ujian Kualifikasi Penerjemah (UKP) 2009
Pembekalan Ujian Kualifikasi Penerjemah (UKP) 2009Pembekalan Ujian Kualifikasi Penerjemah (UKP) 2009
Pembekalan Ujian Kualifikasi Penerjemah (UKP) 2009Bahtera
 
Power point adopció
Power point adopcióPower point adopció
Power point adopcióaida
 
Bhajan Bhaktvatsal Bhagwan
Bhajan   Bhaktvatsal BhagwanBhajan   Bhaktvatsal Bhagwan
Bhajan Bhaktvatsal BhagwanMool Chand
 

Destacado (20)

Ariadne: First Report on Data Mining
Ariadne: First Report on Data MiningAriadne: First Report on Data Mining
Ariadne: First Report on Data Mining
 
Omnichannel: Lessons Learned from First-Mover Failures and Success
Omnichannel: Lessons Learned from First-Mover Failures and SuccessOmnichannel: Lessons Learned from First-Mover Failures and Success
Omnichannel: Lessons Learned from First-Mover Failures and Success
 
Crisp dm
Crisp dmCrisp dm
Crisp dm
 
Market basket analysis
Market basket analysisMarket basket analysis
Market basket analysis
 
Market Basket Analysis
Market Basket AnalysisMarket Basket Analysis
Market Basket Analysis
 
Real-time Market Basket Analysis for Retail with Hadoop
Real-time Market Basket Analysis for Retail with HadoopReal-time Market Basket Analysis for Retail with Hadoop
Real-time Market Basket Analysis for Retail with Hadoop
 
Importance Of E-Commerce Personalization in 20 Revealing Stats
Importance Of E-Commerce Personalization in 20 Revealing StatsImportance Of E-Commerce Personalization in 20 Revealing Stats
Importance Of E-Commerce Personalization in 20 Revealing Stats
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
 
JDA Innovation Forum: Seamless Omnichannel Campaigns Revenue Model
JDA Innovation Forum: Seamless Omnichannel Campaigns Revenue ModelJDA Innovation Forum: Seamless Omnichannel Campaigns Revenue Model
JDA Innovation Forum: Seamless Omnichannel Campaigns Revenue Model
 
Ceo viewpoint 2017 the transformation of retail
Ceo viewpoint 2017 the transformation of retailCeo viewpoint 2017 the transformation of retail
Ceo viewpoint 2017 the transformation of retail
 
Ep The Ftd Collection Vol2
Ep The Ftd Collection Vol2Ep The Ftd Collection Vol2
Ep The Ftd Collection Vol2
 
SharePoint Project Work
SharePoint Project WorkSharePoint Project Work
SharePoint Project Work
 
Ep No 6
Ep No 6Ep No 6
Ep No 6
 
Setmana D’Activitats ExtraordinàRies
Setmana D’Activitats ExtraordinàRiesSetmana D’Activitats ExtraordinàRies
Setmana D’Activitats ExtraordinàRies
 
Tobacco Cessation: Accept The Challenge
Tobacco Cessation: Accept The ChallengeTobacco Cessation: Accept The Challenge
Tobacco Cessation: Accept The Challenge
 
структура эумк
структура эумкструктура эумк
структура эумк
 
Week 4 5352
Week 4 5352Week 4 5352
Week 4 5352
 
Pembekalan Ujian Kualifikasi Penerjemah (UKP) 2009
Pembekalan Ujian Kualifikasi Penerjemah (UKP) 2009Pembekalan Ujian Kualifikasi Penerjemah (UKP) 2009
Pembekalan Ujian Kualifikasi Penerjemah (UKP) 2009
 
Power point adopció
Power point adopcióPower point adopció
Power point adopció
 
Bhajan Bhaktvatsal Bhagwan
Bhajan   Bhaktvatsal BhagwanBhajan   Bhaktvatsal Bhagwan
Bhajan Bhaktvatsal Bhagwan
 

Similar a Social media mining hicss 46 part 1

Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1Dave King
 
Buzzient short presentation_nov8_slideshare
Buzzient short presentation_nov8_slideshareBuzzient short presentation_nov8_slideshare
Buzzient short presentation_nov8_slideshareTBJ Investments, LLC
 
Sentiment Analysis and Social Media: How and Why
Sentiment Analysis and Social Media: How and WhySentiment Analysis and Social Media: How and Why
Sentiment Analysis and Social Media: How and WhyDavide Feltoni Gurini
 
Web Analytics: Effectively Tracking Your Social Media - Joe Laratro
Web Analytics: Effectively Tracking Your Social Media - Joe LaratroWeb Analytics: Effectively Tracking Your Social Media - Joe Laratro
Web Analytics: Effectively Tracking Your Social Media - Joe LaratroSFIMA
 
LinkedIn API Possibilities
LinkedIn API PossibilitiesLinkedIn API Possibilities
LinkedIn API PossibilitiesKim Beinborn
 
LinkedIn API's
LinkedIn API'sLinkedIn API's
LinkedIn API'sTim Deegan
 
LinkedIn API Possibilities
LinkedIn API PossibilitiesLinkedIn API Possibilities
LinkedIn API PossibilitiesRachel Romba
 
LinkedIn API Possibilities
LinkedIn API PossibilitiesLinkedIn API Possibilities
LinkedIn API PossibilitiesLinkedIn
 
LinkedIn API Possibilities
LinkedIn API PossibilitiesLinkedIn API Possibilities
LinkedIn API PossibilitiesMark Treacy
 
Social Media & Footy: The Art of Social
Social Media & Footy: The Art of SocialSocial Media & Footy: The Art of Social
Social Media & Footy: The Art of SocialTiffanny Junee
 
2020 Social Workshop on Social Media Strategy for CXOs
2020 Social Workshop on Social Media Strategy for CXOs2020 Social Workshop on Social Media Strategy for CXOs
2020 Social Workshop on Social Media Strategy for CXOs2020 Social
 
Social Intelligence a case for Bunnings Hardware
Social Intelligence a case for Bunnings HardwareSocial Intelligence a case for Bunnings Hardware
Social Intelligence a case for Bunnings HardwareiGo2 Pty Ltd
 
Forrester & Perficient on SharePoint as a Social Business Platform
Forrester & Perficient on SharePoint as a Social Business PlatformForrester & Perficient on SharePoint as a Social Business Platform
Forrester & Perficient on SharePoint as a Social Business PlatformPerficient, Inc.
 
Socialmediametrics Papercliq 101109175746 Phpapp01 1
Socialmediametrics Papercliq 101109175746 Phpapp01 1Socialmediametrics Papercliq 101109175746 Phpapp01 1
Socialmediametrics Papercliq 101109175746 Phpapp01 1kurt_hulett
 
Beyond social media monitoring v2.0
Beyond social media monitoring v2.0Beyond social media monitoring v2.0
Beyond social media monitoring v2.0DIE DIGITALE GmbH
 

Similar a Social media mining hicss 46 part 1 (20)

Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
 
Developing Social Networks
Developing Social NetworksDeveloping Social Networks
Developing Social Networks
 
Buzzient short presentation_nov8_slideshare
Buzzient short presentation_nov8_slideshareBuzzient short presentation_nov8_slideshare
Buzzient short presentation_nov8_slideshare
 
Sentiment Analysis and Social Media: How and Why
Sentiment Analysis and Social Media: How and WhySentiment Analysis and Social Media: How and Why
Sentiment Analysis and Social Media: How and Why
 
Web Analytics: Effectively Tracking Your Social Media - Joe Laratro
Web Analytics: Effectively Tracking Your Social Media - Joe LaratroWeb Analytics: Effectively Tracking Your Social Media - Joe Laratro
Web Analytics: Effectively Tracking Your Social Media - Joe Laratro
 
LinkedIn API Possibilities
LinkedIn API PossibilitiesLinkedIn API Possibilities
LinkedIn API Possibilities
 
LinkedIn API's
LinkedIn API'sLinkedIn API's
LinkedIn API's
 
LinkedIn API Possibilities
LinkedIn API PossibilitiesLinkedIn API Possibilities
LinkedIn API Possibilities
 
LinkedIn API Possibilities
LinkedIn API PossibilitiesLinkedIn API Possibilities
LinkedIn API Possibilities
 
LinkedIn API Possibilities
LinkedIn API PossibilitiesLinkedIn API Possibilities
LinkedIn API Possibilities
 
Social Media & Footy: The Art of Social
Social Media & Footy: The Art of SocialSocial Media & Footy: The Art of Social
Social Media & Footy: The Art of Social
 
2020 Social Workshop on Social Media Strategy for CXOs
2020 Social Workshop on Social Media Strategy for CXOs2020 Social Workshop on Social Media Strategy for CXOs
2020 Social Workshop on Social Media Strategy for CXOs
 
Social Media Principles for Enterprise Knowledge Management by Augustine Fou
Social Media Principles for Enterprise Knowledge Management by Augustine FouSocial Media Principles for Enterprise Knowledge Management by Augustine Fou
Social Media Principles for Enterprise Knowledge Management by Augustine Fou
 
Social Intelligence a case for Bunnings Hardware
Social Intelligence a case for Bunnings HardwareSocial Intelligence a case for Bunnings Hardware
Social Intelligence a case for Bunnings Hardware
 
Forrester & Perficient on SharePoint as a Social Business Platform
Forrester & Perficient on SharePoint as a Social Business PlatformForrester & Perficient on SharePoint as a Social Business Platform
Forrester & Perficient on SharePoint as a Social Business Platform
 
Social Media Metrics
Social Media MetricsSocial Media Metrics
Social Media Metrics
 
Socialmediametrics Papercliq 101109175746 Phpapp01 1
Socialmediametrics Papercliq 101109175746 Phpapp01 1Socialmediametrics Papercliq 101109175746 Phpapp01 1
Socialmediametrics Papercliq 101109175746 Phpapp01 1
 
Social Media Metrics
Social Media MetricsSocial Media Metrics
Social Media Metrics
 
LinkedIn API
LinkedIn APILinkedIn API
LinkedIn API
 
Beyond social media monitoring v2.0
Beyond social media monitoring v2.0Beyond social media monitoring v2.0
Beyond social media monitoring v2.0
 

Más de Dave King

Mining and analyzing social media part 2 - hicss47 tutorial - dave king
Mining and analyzing social media   part 2 - hicss47 tutorial - dave kingMining and analyzing social media   part 2 - hicss47 tutorial - dave king
Mining and analyzing social media part 2 - hicss47 tutorial - dave kingDave King
 
Mining and analyzing social media part 1 - hicss47 tutorial - dave king
Mining and analyzing social media   part 1 - hicss47 tutorial - dave kingMining and analyzing social media   part 1 - hicss47 tutorial - dave king
Mining and analyzing social media part 1 - hicss47 tutorial - dave kingDave King
 
Mining and analyzing social media facebook w gephi - hicss47 tutorial - dav...
Mining and analyzing social media   facebook w gephi - hicss47 tutorial - dav...Mining and analyzing social media   facebook w gephi - hicss47 tutorial - dav...
Mining and analyzing social media facebook w gephi - hicss47 tutorial - dav...Dave King
 
Mining and analyzing social media bollywood w pajek - hicss47 tutorial - da...
Mining and analyzing social media   bollywood w pajek - hicss47 tutorial - da...Mining and analyzing social media   bollywood w pajek - hicss47 tutorial - da...
Mining and analyzing social media bollywood w pajek - hicss47 tutorial - da...Dave King
 
Mining and analyzing social media sample network w ora - hicss47 tutorial -...
Mining and analyzing social media   sample network w ora - hicss47 tutorial -...Mining and analyzing social media   sample network w ora - hicss47 tutorial -...
Mining and analyzing social media sample network w ora - hicss47 tutorial -...Dave King
 
Social media mining hicss 46 part 2
Social media mining   hicss 46 part 2Social media mining   hicss 46 part 2
Social media mining hicss 46 part 2Dave King
 
Mining and analyzing social media hicss 45 tutorial – part 2
Mining and analyzing social media hicss 45 tutorial – part 2Mining and analyzing social media hicss 45 tutorial – part 2
Mining and analyzing social media hicss 45 tutorial – part 2Dave King
 
Text mining and analytics v6 - p1
Text mining and analytics   v6 - p1Text mining and analytics   v6 - p1
Text mining and analytics v6 - p1Dave King
 
Text mining and analytics v6 - p2
Text mining and analytics   v6 - p2Text mining and analytics   v6 - p2
Text mining and analytics v6 - p2Dave King
 
Digital Trails Dave King 1 5 10 Part 2 D3
Digital Trails   Dave King   1 5 10   Part 2   D3Digital Trails   Dave King   1 5 10   Part 2   D3
Digital Trails Dave King 1 5 10 Part 2 D3Dave King
 
Digital Trails Dave King 1 5 10 Part 1 D3
Digital Trails   Dave King   1 5 10   Part 1 D3Digital Trails   Dave King   1 5 10   Part 1 D3
Digital Trails Dave King 1 5 10 Part 1 D3Dave King
 

Más de Dave King (11)

Mining and analyzing social media part 2 - hicss47 tutorial - dave king
Mining and analyzing social media   part 2 - hicss47 tutorial - dave kingMining and analyzing social media   part 2 - hicss47 tutorial - dave king
Mining and analyzing social media part 2 - hicss47 tutorial - dave king
 
Mining and analyzing social media part 1 - hicss47 tutorial - dave king
Mining and analyzing social media   part 1 - hicss47 tutorial - dave kingMining and analyzing social media   part 1 - hicss47 tutorial - dave king
Mining and analyzing social media part 1 - hicss47 tutorial - dave king
 
Mining and analyzing social media facebook w gephi - hicss47 tutorial - dav...
Mining and analyzing social media   facebook w gephi - hicss47 tutorial - dav...Mining and analyzing social media   facebook w gephi - hicss47 tutorial - dav...
Mining and analyzing social media facebook w gephi - hicss47 tutorial - dav...
 
Mining and analyzing social media bollywood w pajek - hicss47 tutorial - da...
Mining and analyzing social media   bollywood w pajek - hicss47 tutorial - da...Mining and analyzing social media   bollywood w pajek - hicss47 tutorial - da...
Mining and analyzing social media bollywood w pajek - hicss47 tutorial - da...
 
Mining and analyzing social media sample network w ora - hicss47 tutorial -...
Mining and analyzing social media   sample network w ora - hicss47 tutorial -...Mining and analyzing social media   sample network w ora - hicss47 tutorial -...
Mining and analyzing social media sample network w ora - hicss47 tutorial -...
 
Social media mining hicss 46 part 2
Social media mining   hicss 46 part 2Social media mining   hicss 46 part 2
Social media mining hicss 46 part 2
 
Mining and analyzing social media hicss 45 tutorial – part 2
Mining and analyzing social media hicss 45 tutorial – part 2Mining and analyzing social media hicss 45 tutorial – part 2
Mining and analyzing social media hicss 45 tutorial – part 2
 
Text mining and analytics v6 - p1
Text mining and analytics   v6 - p1Text mining and analytics   v6 - p1
Text mining and analytics v6 - p1
 
Text mining and analytics v6 - p2
Text mining and analytics   v6 - p2Text mining and analytics   v6 - p2
Text mining and analytics v6 - p2
 
Digital Trails Dave King 1 5 10 Part 2 D3
Digital Trails   Dave King   1 5 10   Part 2   D3Digital Trails   Dave King   1 5 10   Part 2   D3
Digital Trails Dave King 1 5 10 Part 2 D3
 
Digital Trails Dave King 1 5 10 Part 1 D3
Digital Trails   Dave King   1 5 10   Part 1 D3Digital Trails   Dave King   1 5 10   Part 1 D3
Digital Trails Dave King 1 5 10 Part 1 D3
 

Último

Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxYounusS2
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
GenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncGenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncObject Automation
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 

Último (20)

Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptx
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
GenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncGenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation Inc
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 

Social media mining hicss 46 part 1

  • 1. Mining and Analyzing Social Media: Part 1 Dave King January 7, 2013
  • 2. Abstract Overview of the data mining and analysis of social media, exploring the application of various data mining, textual mining and analytical techniques to social media data sources. The focus will be on the practical application of these techniques for the purposes of: • Monitoring of social media sources • Analyzing content to identify leading issues and sentiment • Analyzing and forecasting trends • Identifying and profiling influential participants, subgroups and communities 2 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 3. Agenda: Part 1 • My Biography • Resources • Social Media Defined • Data Mining Example • Text Mining Processes • Using Text Mining for Prediction • Brief Look at Programming for Prediction 3 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 4. Agenda: Part 2 • Sentiment Analysis & Opinion Mining Defined – Business Interest & Software Packages – Levels of Analysis – Automated Classification • Social Network Analysis – Defined – History – Basic techniques and measures – Ego and Social-Centric Analysis 4 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 5. Biography: Dave King • EVP of Product Development and Management at JDA Software • 30 years in enterprise package software business • 15 years as university professor • 15 years as Co-Chair of the Internet & Digital Economy Track (HICSS) • Long time interest in various aspects of E-Commerce, Business Intelligence, Analytics (including Text Analytics) Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 6. Personal Experiences with Analytics • Taught applied statistics and math modeling • In software R&D – Optimization in the 80s – Natural Language Frontends • NLI Query & CMU Robotics Lab – EIS Competitive Analysis • Dow Jones and Reuters • Verity Topics • NewsAlert – InXight’s Hyperbolic Tree – Supply Chain Analytics – Sentiment Analysis for Retailers • In the case of many of these advanced techniques, often the audiences have been small, sometimes bewildered, and often fleeting. Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 7. Text Mining Resources Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 7
  • 8. Social Networking Analysis Resources 8 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 9. What is Social Media? Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 9
  • 10. Defined are online technologies and practices for social interaction enabling sharing opinions, insights, experiences, perspectives and media itself. Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 10
  • 11. Defined is the media we use to be social. That’s it. Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 11
  • 12. Social Media Types: Take Your Pick 12 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 13. Social Media is Still Huge! Alexa Traffic Oct 6, 2012 Rank Website Type 1 Facebook Social 2 Google Search 3 YouTube Social 4 Yahoo! Search 5 Baidu.com Search 6 Wikipedia Social 7 Windows Live Search 8 Twitter Social 9 QQ.COM Portal 10 Amazon.com E-Commerce 11 Blogspot.com Social 12 LinkedIn Social 13 Taobao.com E-Commerce 14 Google India Search 15 Yahoo! Japan Search 13 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 14. Social Media is Still Huge! Growth in Registered Users 2011 to 2012 Facebook: 750M -1B Twitter: 200M - 500M LinkedIn: 100M – 175M Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 14
  • 15. Social Media is Still Huge! If Social Media sites were countries… China: 1.4B India: 1.2B Facebook: 1.0B Twitter: 500M US: 310M Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 15
  • 16. Social Media is Still Huge! Usage Per Day Facebook: 3.2B Likes & Comments Twitter: 340M Tweets LinkedIn: 14M Searches Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 16
  • 17. Analyzing Social Media: Two Paths Media - Content Social - Network Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 17
  • 18. Analyzing Social Media: Two Paths An Example: Which Blogs are Similar? Term1 Term2 Term3 … TermM Blog1 Blog2 Blog3 … BlogN Blog1 1 0 0 … 1 Blog1 - 1 0 … 1 Blog2 0 0 1 … 0 Blog2 0 - 1 … 0 Blog3 0 1 0 … 1 Blog3 1 1 - … 0 … … … … … … … … … … - … BlogN 0 0 0 … 1 BlogN 1 0 1 … - Cluster Analysis Social Network (Graph) (e.g. K-Means) Analysis 18 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 19. social Media Formats • Articles • Pictures • Comments • Videos • Messages • Music • Reviews • Locations • Ratings • Tags • Rankings •… 19 19 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 20. social Media Data: One Commonality 20 20 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 21. Data Mining: Defined Discovering meaningful patterns from large data sets using pattern recognition technologies. Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 21
  • 22. Data Mining: CRISP-DM Real-World Data Data Consolidation Business Data Understanding Understanding Data Preparation Data Cleaning Deployment Modeling Data Transformation Evaluation Data Reduction Well-Formed Cross-Industry Standard Process for Data Mining Data 22 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 23. Data Mining: General Data Assumptions Structured Transformed Well-Formed 23 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 24. Data Mining: Simple Example Market Basket Analysis Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 24
  • 25. Market Basket Analysis: Applications • Cross Selling • Product Placement • Affinity Promotion • Customer Segmentation Analysis 25 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 25
  • 26. Market Basket Analysis: The Ole Beer and Diaper Legend… Transaction Item Binary Representation 1 Bread 1 Milk Transaction Beer Bread Cola Diapers Eggs Milk 2 Beer 1 0 1 0 0 0 1 2 Bread 2 1 1 0 1 1 0 2 Diapers 3 1 0 1 1 0 1 2 Eggs 4 1 1 0 1 0 1 3 Beer 5 0 1 1 1 0 1 3 Cola 3 Diapers 3 Milk 4 Beer Goal: Empirically determine those itemsets 4 Bread 4 Diapers that occur frequently together in a set of 4 Milk transactions, producing a set of 5 Bread 5 Cola Association Rules of the form LHS -> RHS 5 Diapers 5 Milk Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 26
  • 27. Market Basket Analysis: The Ole Beer and Diaper Legend… Transaction Beer Bread Cola Diapers Eggs Milk 1 0 1 0 0 0 1 2 1 1 0 1 1 0 3 1 0 1 1 0 1 4 1 1 0 1 0 1 5 0 1 1 1 0 1 Concept Definition Example Itemset A specific collection of items in transaction {Diapers, Beer} Support Count Number of transactions with itemset Support {Diapers,Beer} = 3 Transactions No of transactions = N N=5 Association RuleImplication rule of form LHS->RHS where LHS & RHS are {Diapers} -> {Beer} itemsets Rule Support No. of times rule appears in dataset 3/5 = .6 #tuples(LHS & RHS}/N Rule Confidence No. of times RHS occurs in transactions with LHS 3/4 = .75 #tuples(LHS, RHS)/#tuples(LHS) Rule Lift Strength of Association over random co-occurrence of LHS (3/5)/((4/5)x(3/5)) = 1.25 and RHS #tuples(LHS,RHS}/N)/(#tuples(LHS)/N x #tuples(RHS)/N) Confidence(RHS/LHS)/Support(RHS) Support(LHS,RHS)/Support(LHS)xSupport(RHS) Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 27
  • 28. Market Basket Analysis: What if it were a text analysis problem? Joe went to the 7-11 to pick up some cigarettes. While he was there he also bought some dipers and beer. Sally was on her weekly shopping run at Wal-Mart. She had picked up some diapers and formula for her infant. She also thought about buying beer for her husband, but the they were out of the brand he liked. Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 28
  • 29. Market Basket Analysis: What if it were a text analysis problem? • No specified format • Variable length • Variable spelling • Punctuation and non-alphanumeric characters • Contents are not predefined and no predefined set of values Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 29
  • 30. Text Mining (aka Text Analytics): Defined Using natural language processing & data mining to discover patterns in a collection of “documents” Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 30
  • 31. Text Mining: Document Collections • Word Documents • Blogs • PDFs • Tweets • Emails • Open ended • IM Chat surveys • Web Pages • Transcripts of Helpline calls 31 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 32. Text Mining: CRISP-Like Processes Real-World Text Data Document Business Understanding Document Understanding Consolidation Document Establish the Preparation Corpus Deployment Documents Modeling Corpus Refinement (Token, Stem, Stop…) Feature Selection Evaluation & Weighting Doc-Term Matrix* 32 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 33. Text Mining: Creating the corpus A large and “structured” or “organized” collection of text Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 33
  • 34. Text Mining Process: Corpus Refinement Common representation of tokens within and between documents Eliminate Tokenization Normalize Stemming Stop Words • Tokenization —Parse the text to generate terms. Sophisticated analyzers can also extract phrases from the text. • Normalize — Convert them to lowercase. • Eliminate stop words — Eliminate terms that appear very often (e.g. the, and, …). • Stemming — Convert the terms into their stemmed form—remove plurals and different word forms (e.g. achieve, achieves, achieved – achiev) [note: word about synonyms – WordNet Synset] Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 34
  • 35. Text Mining Processes: Feature Extraction & Weighting Feature Extraction “Bag of Words, Terms or Tokens” Vector Representation: Word, Term or Token/Doc Matrix “Bag of Words” (BOW) or Vector Space Model (VSM): Words or Tokens are attributes and documents are examples Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 35
  • 36. Text Mining Processes: Transforming Frequencies • Binary Frequencies: tf =1 for tf>0; otherwise 0 • Term Frequencies: tf(i,j)/Sum of tf(i,j) in Doc K • Log Frequencies: 1 + log(tf) for tf>0; otherwise 0 • Normalized Frequencies: Divide each frequency by SQRT of Sum of Squares of the frequencies within the vector (column) • Term Frequency–Inverse Document Frequency – TF * IDF – Inverse Document Frequency: log(N/(1+D)) where N is total number of docs and D is number with term Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 36
  • 37. Text Mining Processes: Twitter Example – Problem Features • Each tweet <= 140 characters (avg. 10- 15 words/message) • Heavy presence of non-alpha symbols, abbrevs, misspellings and slang • Tweets often include retweets (original tweet repeated) Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 37
  • 38. Text Mining Processes: Twitter Example One of the things I love and adore about Twitter … is how its open API has lit a fierce fire of innovation when it comes to analytics. Anyone and their brother and ma-in-law can develop a tool, and they have! Much to the benefit of the rest of us. Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 38
  • 39. Text Mining Processes: Twitter Example – Twitter API Get Search • http://search.twitter.com/search.json?q=<query> • search.twitter.com/search.json?q=Obama&rpp=100&page=5 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 39
  • 40. Text Mining Processes: Twitter Example – JSON • JSON (JavaScript Object Notation) • Lightweight, Text-Based, Data-Interchange Format • Built on Two Structures: – A collection of name:value pairs. In various languages, this is realized as an object, record, struct, dictionary, hash table, keyed list, or associative array. – An ordered list of values. In most programming languages, this is realized as an array, vector, list, or sequence Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 40
  • 41. Text Mining Processes: Twitter Example – Establish Corpus Query API Result search.twitter.com/ {… search.json?q=%3A)+ results:[ feel+ feeling& {iso_language_code: en, rpp=100&page=5 to_user_name: Andrea, ...' search.twitter.com/sea text: u”Love is everything in this rch.json?q=%3A(+ world! Its a feeling like no other. I feel+feeling& can't wait 2 feel that emotion rpp=100&page=5 again..but patience is key :-)” ... created_at: Thu, 18 Oct 2012 20:11:22 +0000, …'}} ...} Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 41
  • 42. Text Mining Processes: Twitter Example – Simple Question Are there any language differences between “feeling” tweets containing  and  symbols? Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 42
  • 43. Text Mining Processes: Twitter Example – Establish Corpus text: value 1. Love is everything in this world! Its a feeling like no other. I pairs in can't wait 2 feel that emotion again..but patience is key :-) JSON object 2. @mzxAmaZiiN Whats up ma. How ya feeling? Lemme make that soul feel better. :) Remove 3. "..I've got a good feeling about today :P Something makes RTs me feel i might make a sale or two :) *fingers crossed* #etsy #shop #seller #cra... 4. @IWontForgetDemi Awww poor thing :( hate feeling sick! Hope you feel better soon 5. @IzabelaLeafsfan no the worst feeling ever is when u feel like total crap. cuz u think no one luvs u :( 6. … Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 43
  • 44. Text Mining Processes: Twitter Example – Doc Preparation Eliminate Tokenization Normalize Stemming Stop Words Tweet: Love is everything in this world! Its a feeling like no other. I can't wait 2 feel that emotion again..but patience is key :-) Words: ['Love', 'is', 'everything', 'in', 'this', 'world!', 'Its', 'a', 'feeling', 'like', 'no', 'other.', 'I', "can't", 'wait', '2', 'feel', 'that', 'emotion', 'again..but', 'patience', 'is', 'key', ':-)'] Tokens: ['Love', 'is', 'everything', 'in', 'this', 'world', '!', 'Its', 'a', 'feeling', 'like', 'no', 'other.', 'I', 'ca', "n't", 'wait', '2', 'feel', 'that', 'emotion', 'again..but', 'patience', 'is', 'key', ':', '-', ')'] Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 44
  • 45. Text Mining Processes: Twitter Example – Doc Preparation Eliminate Tokenization Normalize Stemming Stop Words • Normalize – ['love', 'is', 'everything', 'in', 'this', 'world!', 'its', 'a', 'feeling', 'like', 'no', 'other.', 'i', "can't", 'wait', '2', 'feel', 'that', 'emotion', 'again..but', 'patience', 'is', 'key', ':-)'] • Alpha – ['love', 'is', 'everything', 'in', 'this', 'its', 'feeling', 'like', 'no', 'wait', 'feel', 'that', 'emotion', 'patience', 'is', 'key'] • Remove Stopwords – ['love', 'everything', 'feeling', 'like', 'wait', 'feel', 'emotion', 'patience', 'key'] • Stemming – ['love', 'everyth', 'feel', 'like', 'wait', 'feel', 'emot', 'patienc', 'key'] Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 45
  • 46. Text Mining Processes: Twitter Example – Analysis Item Collection List Set Lex Div Aver Len in Chars Aver No/Tweets(w/o) Tweets HF 498 - - 108 - SF 499 - - 100 - Tweets w/o "RT" HF 409 - - 105 - SF 429 - - 98 - Words HF 8077 2346 3 4 20 SF 8149 2073 4 4 19 Alpha (lower) HF 5622 1197 5 4 14 SF 5733 1041 6 4 13 Alpha w/o Stops HF 3400 1092 3 5 8 SF 3469 936 4 5 8 Stems HF 3400 978 3 4 8 SF 3469 844 4 4 8 Stems w/o "feel" HF 2619 977 3 4 6 SF 2635 843 3 4 6 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 46
  • 47. Text Mining Processes: Twitter Example – Analysis Corpus Av Word Len Aver Sent Ln Lexical Diversity HF Tweets 4 20 3 SF Tweets 4 19 4 austen-emma.txt 4 21 26 austen-persuasion.txt 4 23 16 austen-sense.txt 4 23 22 bible-kjv.txt 4 33 79 blake-poems.txt 4 18 5 bryant-stories.txt 4 17 14 burgess-busterbrown.txt 4 17 12 carroll-alice.txt 4 16 12 chesterton-ball.txt 4 17 11 chesterton-brown.txt 4 19 11 chesterton-thursday.txt 4 16 10 edgeworth-parents.txt 4 17 24 melville-moby_dick.txt 4 24 15 milton-paradise.txt 4 52 10 shakespeare-caesar.txt 4 11 8 shakespeare-hamlet.txt 4 12 7 shakespeare-macbeth.txt 4 12 6 whitman-leaves.txt 4 35 12 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 47
  • 48. Text Mining Processes: Twitter Example – Analysis Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 48
  • 49. Text Mining Processes: Twitter Example – Doc-Term Matrix 3538 Total Type 216 166 111 86 83 76 66 64 58 58 … 4 4 4 4 4 4 4 4 4 4 Total Tweets Face like better make good hope know get sick hate realli … ye rain quit chang stress happier longer cheer without everyth 5 Tweet1 HF 0 1 0 0 1 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 Tweet2 HF 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 7 Tweet3 HF 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 3 Tweet4 HF 0 0 0 0 0 0 0 0 0 1 … 0 0 0 0 0 0 0 0 0 0 2 Tweet5 HF 0 0 1 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 1 Tweet6 HF 0 1 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 5 Tweet7 HF 1 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 7 Tweet8 HF 0 0 0 0 1 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 5 Tweet9 HF 0 1 0 0 1 0 0 1 0 0 … 0 0 0 0 0 0 0 0 0 0 2 Tweet10 HF 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 … … … … … … … … … … … … … … … … … … … … … … … … 5 Tweet829 SF 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 5 Tweet830 SF 1 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 3 Tweet831 SF 0 0 0 0 0 1 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 4 Tweet832 SF 0 0 1 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 3 Tweet833 SF 1 0 0 0 0 0 0 0 0 0 … 0 0 1 0 0 0 0 0 0 0 3 Tweet834 SF 0 1 1 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 5 Tweet835 SF 1 0 0 0 0 1 0 0 1 0 … 0 0 0 0 0 0 0 0 0 0 3 Tweet836 SF 0 0 0 0 0 0 0 1 1 0 … 0 0 0 0 0 0 0 0 0 0 5 Tweet837 SF 0 1 0 0 0 0 1 0 0 0 … 0 0 0 0 0 0 0 0 0 0 3 Tweet838 SF 1 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 49
  • 50. Prediction Information + Epidemiology = Infodemiology Monitoring and analyzing queries from Internet search engines or peoples' status updates on microblogs for syndromic surveillance to predict disease outbreaks Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 50
  • 51. Prediction Syndromic Survelliance Surveillance using health-related data that precede diagnosis and signal a sufficient probability of a case or an outbreak to warrant further public health response Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 51
  • 52. Monitoring the Onset of the Flu Season The Official Standard Way Sentinel ILI - influenza-like-illness Physician Public Health Authority ILI Report Public Reports Aggregated Costly Data 1-2 Week Lag Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 52
  • 53. Prediction Infodemiology What is the first thing search some people do before they see a doctor or tweet take OTC medicines? Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 53
  • 54. Prediction - What are the terms, keywords, phrases…? What is the first thing search • Flu people do before they • Flu symptoms • H1N1 • Swine Flu • Cold • Fever see a doctor or take • Headache OTC medicines? tweet • … Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 54
  • 55. Prediction Infodemiology Studies Authors Title Date Type Data Source Dependent Explanatory Variables Model Results (M-Y) Variables Eysenbach Infodemiology: Tracking Nov-06 Search Impressions and clicks from a Number of Number of clicks on Linear correlation between Correlation of .81 with ILI data and Flu-Related Searches on Google Ad displaying the influenza lab Google Ad dealing with clicks and 3 measures of .90 with lab tests and positive the Web for Syndromic question:“Do you have the flu? tests, number of influenza publically reported influenza cases. Surveillance Fever, Chest discomfort, positive influenza Weakness, Aches, Headache, lab tests and Cough.” Covers Oct. 2004-May number of ILI 2005 reports from Sentinel GPs reported by PHA in Canada Hulth et al. Web Queries as a Feb-09 Search Queries to Swedish heath Weekly Lab Weekly ratio of influenza Two linear regression Average R squared was .90 for the Source for Syndromic advice site (Varguiden) from diagnosed cases of related Web queries to all models, one predicting lab two years. Survelliance June 2005 to June 2007 influenza and % of queries at Varguiden site diagnosed cases and the ILI reports of other ILI reports where the influenza from explanatory variable based GPs based on a composite of the best predicting query terms Ginsberg et al. Detecting influenza Feb-09 Search Google Search: Historical web % US ILI Visits in ILI-related search queries logit(P)=b0 + b1xlogit(Q)+e Mean correlation of .90 between P epidemics using search logs of ILI related Google week reported to where P is % visits and Q is & Q for 9 CDC US healthcare engine query data Searchs 2003-2008 CDC normalized number of regions. queries Lampos & Tracking the flu Jun-10 Tweets Twitter: 24 weeks of Twitter Weekly ILI UK Average daily "flu-score" linear least squares Average correlation for 5 UK health Cristianini pandemic monitoring corpus in UK from June '09 to Health Protection for all tweets. Flu score for regression time series of HPA care regions was about .92. the Social Web Dec'09 Agency smoothed single tweet is proportion flu rates on aggregated tweet for daily values of all ILI "stem" markers flu-scores (ngrams) that appear in the tweet. Culotta Towards detecting Jul-10 Tweets 575K flu-related Twitter % ILI weekly % of messages reporting Compares simple logit model Aggregating keyword frequencies influenza epidemics by messages and % ILI reports reports from CDC an ILI or related symptom with multiple regression using separate keywords (multi- analyzing Twitter from CDC for period Feb 2010 for specified (based on detailed model having different reg) works better than single messages to Apr 2010. period classification and counts for separate Tweet aggregated (simple logit) predictor. statistical procedures to keywords & phrases Simple BOW classifier can be used determine whether ILI or to filter ILI messages. Achieved not) r=.78 for 5 weeks of validation data. Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 55
  • 56. Prediction Infodemiology Studies Authors Title Date Data Source Dependent Explanatory Variables Model Results (M-Y) Variables Acherekar Predicting Flu Trends Apr-11 Tweets ILI "influenza" visits to Sentinel ILI influenza visits Previous weeks ILI visits Auto-regression model with: Analysis shows time-lagged ILI is using Twitter Data physicians reported weekly by and aggregate number of ILI(t) = a1*ILI(t-2) + not a strong predictor but the CDC from Oct 2009 to Oct 2010 tweets mentioning flu-like b*Tweets(t) + e aggregate number of tweets with and tweets mentioning "flu- symptoms flu-like systems is very strong with like" symptoms during same an r=.9846. time period Chan et al. Using Web Search May-11 Search Google Queries in select Official weekly Computed daily counts of O=b0+b1Q+e where O is Very strong correlation (~.9) except Query Data to Monitor countries from 2003-10 dengue case count queries referencing official weekly count of cases for Singapore (.82). Dengue Epidemics: A including Bolivia, Brazil, India, from Ministry of Dengue fever for separate in specified countries and S is New Model for Indonesia and Singapore Heath or WHO countries dengue related search query Neglected Tropical in those countries Disease Survelliance Signorini et al. The Use of Twitter to May-11 Tweets Several panels of tweets and % ILI weekly Fraction of influenza Support Vector Regression Strong SVR fit for both national and Track Levels of Disease ILI reports. Prediction panel reports from CDC related tweets for total US utilizing SVM feature sets regional figures. Activity and Public includes weekly ILI %s and US for specified and CDC health regions. (collection of terms occuring Concern in the US tweets from Oct 2009 thru May period. Reports Utilizes complex Support 10X per week) to estimate during the Influenza A 2010. include Vector Machine using term- weekly ILI. H1N1 Pandemic nationwide and 10 frequency statistics. regional reports. Paul You are what you tweet: Jul-11 Tweets Covers a number of ailments % ILI weekly Utilizes an Ailment Topic Correlation between the Two types of ATAM/ATAM+ models Analyzing Twitter for including influenza. Based on 2 reports from CDC Aspect Model (ATAM) to probability the "flu ailment" were compared. The correlations Public Health billion tweets from Many 2009 for specified determine the type of designation for each week with ILI were .934 and .958. to October 2010. period ailment associated with a with the ILI rate in the US. tweet (in this case Flu) Lampos et al. Flu detector - Tracking Sep-12 Tweets Twitter: 40 weeks of Twitter Weekly ILI UK Sum across all tweets of linear least squares Correlations around .90 for 3 UK epidemics on Twitter corpus in UK from June '09 to Health Protection daily "flu-score" for each regression time series of HPA health care regions. Mar '10 Agency smoothed tweet based on number of flu rates on aggregated tweet for daily values ILI "stem" markers that flu-scores appear in the tweet. Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 56
  • 57. Infodemiology Example Prediction Analysis Nowcasting Events from the Social Web with Statistical Learning,” Lampos and Cristianini, ACM IS&T, 9/11 N-Gram Stems ILI Markers Avg. Weekly Twitter 50M Tweets Wght. Flu- API 5 UK Regions Score by 6’09-12’09 Region (time t) Weekly ILI Reports by r ~.89-.96 health region (time t) Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 57
  • 58. Infodemiology Example Prediction Analysis 50M Tweets Corpus 5 Region UK, 6/09-4/10 Corpus Lower Stop Tokens Stems Refinement Case Words Feature 1- 2- Hybrid Selection Grams Grams Grams Tweet- For each tweet, number of times each hybrid Hybrid Matrix occurs times hybrid weights. Then sum of weighted values equal flu-score for tweet. Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 58
  • 59. Infodemiology Example Prediction Analysis  flu :(  RT @OMGFacts: The US death toll from the 1918 flu epidemic was so high that it created a coffin shortage  News_SwineFlu: #swineflu Delayed treatment for swine flu can lead to... http://t.co/ZSOjIvAW'  The last thing I want right now is the flu  Much better with that lovely message from u,thank u! (heavy flu since 4 days ;  Supposedly only about 1% of people in the world present flu-like symptoms after flu shot. Unfortunately 66% of the people in my house do'  Back on track...Fuck you flu!  Sounds like either the flu or Killian walked into the room.  I'll try haha, I got the flu!  It costs an average company $135 a day for every employee sick at home with the flu. Encourage your employees to get vaccinated....  i wouldn't recommend it, but a couple of days with violent stomach flu does make you appreciate the small things, nature's reset  Is it possible to get the flu from the flu jab?  Sedatives + cold/flu + ana = can't get up without having to hold a wall for 2 minutes", Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 59
  • 60. Infodemiology Example Prediction Analysis (3) Bigram Frequency Bigram Frequency (u'flu',u'shot') 11 (u'common',u'cold') 2 (u'flu',u'season') 8 (u'man',u'flu') 2 (u'flu',u'symptom') 5 (u'southeast',u'asia,') 2 (u'symptom',u'busters') 4 (u'please',u'go') 2 (u'mtm',u'blog') 4 (u'headache',u'since') 2 (u'sore',u'throat') 4 (u'get',u'flu') 2 (u'immune',u'system') 4 (u'every',u'single') 2 (u'season',u'u2013') 4 (u"can't",u'get') 2 (u'every',u'symptom') 3 (u'gonna',u'get') 2 (u'since',u'last') 3 (u'stuffy',u'head') 2 (u'flu',u'jab') 3 (u'fever,',u'flu,') 2 (u'sore',u'throat,') 3 (u'feeling',u'like') 2 (u'feel',u'like') 3 (u'symptom',u'checker') 2 (u'jonas',u'brothers') 3 (u'runny',u'nose') 2 (u'flu',u'shots') 2 (u'riva',u'offering') 2 (u'+',u'sore') 2 (u'flu',u'pills') 2 (u'like',u"i'm") 2 (u'#nutrition',u'foods') 2 (u"didn't",u'go') 2 (u'cure',u'#colds') 2 (u"don't",u'feel') 2 (u'cold,',u'flu') 2 (u'offering',u'complimentary') 2 (u"i'm",u'gonna') 2 (u'flu-like',u'symptom') 2 (u'ear',u'infection') 2 (u'asia,',u'millions') 2 (u'flu,',u'sore') 2 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 60
  • 61. Infodemiology Example Prediction Analysis (3) 1-Grams 2-Grams Hybrids Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 61
  • 62. Infodemiology Example Prediction Analysis Weekly Avg. Weekly ILI Reports Wght. Flu- by health = b0 + b1 Score by + e region Region (time t) (time t) Hybrid 1 Hybrid 2 Hybrid 3 … Hybrid n Tweets Week wgt1 wgt2 wgt3 … wgtn Flu-Score Indep Var Tweet1 1 wgt1*N11 wgt2*N21 wgt3*N21 … wgtn*N21 Sum1 W*N Tweet2 1 wgt1*N21 wgt2*N21 wgt3*N21 … wgtn*N21 Sum2 W*N Tweet3 1 wgt1*N31 wgt2*N21 wgt3*N21 … wgtn*N21 Sum3 W*N Aver Wk-1 … … … … … … … … Tweet m 24 wgt1*Nm1wgt2*N21 wgt3*N21 … wgtn*N21 Summ W*N Aver Wk-m Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 62
  • 63. Infodemiology Example Prediction Analysis http://geopatterns.enm.bris.ac.uk/epidemics/ Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 63
  • 64. Programming Text Mining for Prediction with Python 64 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 65. Text Mining for Prediction: Programming with Python & NLTK # Utilizes “nltk”, a Python “natural language toolkit” # Step 1: Initialize modules, stopwords and stemmer import simplejson import urllib import re import nltk from nltk.corpus import stopwords stopwords = stopwords.words('english') porter = nltk.PorterStemmer() def remove_stopwords(text): stopwords = nltk.corpus.stopwords.words('english') content = [w for w in text if w.lower() not in stopwords] return content Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 65
  • 66. Text Mining for Prediction: Programming with Python & NLTK # Step 2: Utilize Twitter Search API to collect “flu” tweets in JSON format. # Then extract the “text” fields associated with each tweet. itemsFlu = [] for i in range(0,14): urlFlu = 'http://search.twitter.com/search.json?q=flu&rpp=100&page=1' resultFlu = simplejson.load(urllib.urlopen(urlFlu)) itemsFlu = itemsFlu + resultFlu['results'] tweetsFlu = [ item['text'] for item in itemsFlu] # Step 3. Eliminate Retweets FluTweetsNoRTs= [] for tweetText in tweetsFlu: if not re.search('RT ',tweetText): tweetTextList = [tweetText] FluTweetsNoRTs = FluTweetsNoRTs + tweetTextList Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 66
  • 67. Text Mining for Prediction: Programming with Python & NLTK # Step 3: Preparing to do Text analysis noTweets = 0; Flu_tmp_word = []; Flu_tot_word = [] Flu_tmp_low = []; Flu_tot_low = []; Flu_tmp_alpha = []; Flu_tot_alpha = [] Flu_tmp_stop = []; Flu_tot_stop = []; Flu_tmp_stem = []; Flu_tot_stem = [] stems_dict = {} # dictionary holding "text" broken into words for 1..N tweets # Step 4: Text processing. Produces lowercase, alpha stems devoid of stopwords for tweet in FluTw eetsNoRTs : noTweets = noTweets + 1 Flu_tmp_word = [ w for w in tweet.split()]; Flu_tot_word = Flu_tot_word + Flu_tmp_word Flu_tmp_low = [w.lower() for w in Flu_tmp_word] ; Flu_tot_low = Flu_tot_low + Flu_tmp_low Flu_tmp_alpha = [cv for w in Flu_tmp_low for cv in re.findall(r'^[a-z]+[a-z]+$', w)] Flu_tot_alpha = Flu_tot_alpha + Flu_tmp_alpha Flu_tmp_stop = remove_stopwords(Flu_tmp_alpha); Flu_tot_stop = Flu_tot_stop + Flu_tmp_stop Flu_tmp_stem = [porter.stem(t) for t in Flu_tmp_stop]; Flu_tot_stem = Flu_tot_stem + Flu_tmp_stem Flu_stems_dict[noTweets] = Flu_tmp_stem Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 67
  • 68. Text Mining for Prediction: Programming with Python & NLTK # Step 5: Analyzing frequency distributions fdist_Flu_stems = nltk.FreqDist(Flu_tot_stem) fdist_Flu_stems.plot(25) fdist_Flu_stems.items()[0:25] # Step 6. Produce doc-matrix “wc” (a dictionary hold the counts associated # with each word). for dCnt in range(1,len(Flu_stems_dict)): wlist = Flu_stems_dict[dCnt] for word in wlist: wc.setdefault(word,0) wc[word] += 1 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 68
  • 69. Text Mining for Prediction: Programming with Python & NLTK # Step 7. Eliminate all words that occur less than 3 times apcount = {} for word,wcnt in wc.items(): apcount.setdefault(word,0) if wcnt > 3: apcount[word] = wcnt # Step 8. Write out doc-term matrix to a file out = file('flu-doc-matrix.txt','w'); out.write('Tweets') for word,count in apcount.items(): out.write(","); out.write(word) ; out.write('n') for tweetno, twlist in Flu_stems_dict.items(): tweetname = "Tweet" + str(tweetno); out.write(tweetname) for word, count in apcount.items(): if word in twlist: out.write(","); out.write("1") else: out.write(","); out.write("0"); out.write("n") out.close() Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 69
  • 70. Text Mining for Prediction: Programming with R, tm and RTextTools 70 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 71. Text Mining for Prediction: Programming with R, tm and RTextTools # initialize libraries # text preprocessing function library(twitteR) toLowerCaseAlpha <- function(x){ library(tm) removeHTTP <- gsub("http:[/a-zA-Z0-9._]+","",x) library(RTextTools) removeNonAlpha <- gsub("[^a-zA-Z]"," ",removeHTTP) library(plyr) removeMultipleSpaces <- gsub(" +"," ",removeNonAlpha) library(stringr) changeToLowerCase <- tolower(removeMultipleSpaces) return(changeToLowerCase) } # utilize Twitter Search API to collect ―flu‖ Tweets # in JSON format # clean tweet text, convert lower case, remove stop words fluTweets = searchTwitter('flu', n=1500, lang='en') # create corpus for analysis fluTweetTextCleaned <- toLowerCaseAlpha(fluTweetTextNonRTs) # extract text from JSON objects corpusFluTweets <- Corpus(VectorSource(fluTweetTextCleaned)) fluTweetText = laply(fluTweets, function(t) t$getText()) corpusFluText <- tm_map(corpusFluTweets, removeWords, corpusStopwords) # remove retweets indxFluRetweets <- grep('RT',fluTweetText) indxFluTweetText <- c(1:length(fluTweetText)) indxFluTweetNonRTs <- !(indxFluTweetText %in% indxFluRetweets) fluTweetTextNonRTs <- fluTweetText[indxFluTweetNonRTs] Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 71
  • 72. Text Mining for Prediction: Programming with R, tm and RTextTools # function to convert tokens/words to stems # create document –term matrix convertToStems <- function(x){ corpusFluStemsDTM <- tokens <- strsplit(x,' +') DocumentTermMatrix(corpusFluStems) listToVect <- unlist(tokens) stems <- wordStem(listToVect) # create vector of individual stems and the associated return(stems)} # frequencies with which they occur across all tweets fluStemsDTMMat <- as.matrix(corpusFluStemsDTM) # convert to stems and create new corpus fluStemsDTMMatSorted <- FluStems <- sapply(corpusFluText, convertToStems) sort(colSums(fluStemsDTMMat)) corpusFluStems <- Corpus(VectorSource(FluStems)) Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 72
  • 73. Text Mining for Prediction: Programming with R, tm and RTextTools # plot frequency disttribution of stems noOfPlotPnts = 50 numOfStems = length(fluStemsDTMMatSorted) plotStart = numOfStems - noOfPlotPnts +1 plotEnd = numOfStems top50FluStems = fluStemsDTMMatSorted[plotStart:plotStart] barplot(top50,horiz=TRUE,cex.names=0.5, space = .5,las=1, xlim=c(0,100), main="Freq of Terms") Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 73
  • 74. Text Mining for Prediction: Programming with R, tm and RTextTools # create wordcloud of stems based on frequency library(wordcloud) library(RColorBrewer) fluStemNames <- names(fluStemsDTMMatSorted) dfFluStemDTMMat <- data.frame(word=fluStemNames, freq=fluStemsDTMMatSorted) pal2 <- brewer.pal(8,"Dark2") wordcloud(dfFluStemDTMMat$word, dfFluStemDTMMat$freq, min.freq=10, colors=pal2) Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 74
  • 75. Infodemiology is a form of Weather in the next 6 hours Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 75
  • 76. Now+Forecasting Predicting the present by analyzing large volumes of data that can be used to "forecast" current events for which official analysis has not been released Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 76
  • 77. Basic Regression Analysis Dependent Dependent Traditional, Aggregate Variable at Variable at Publicly Search Time t Time t - n Available at Index or (Standard = b0 + b1 (Standard + b2 Time t - n + b3 Social +e Publicly Publicly Explanatory Media Available Available Variable Freq. Measure) Measure) Count at Time t Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 77
  • 78. Authors Examples Title Date (M- Type Data Source Dependent Variables Explanatory Variables Model Results Y) Kholodilin et al. Do Google Searches Help Apr-10 Search Fed reserve data on US private Year-on-Year Growth Rate 220 Google Trend/Insights Search Y-o-Y monthly URPC growth rates Query term principal components in Nowcasting Private consumption and related Google search of Monthly US Real Private terms related to Priv for 3 sets of regressors -- Sentiment outperform standard Sentiment and Consumption? A Real-Time terms from Jan '05-Dec '09. Consumption, ALFRED db Consumption reduced to 10 (consumer sentiment and Financial Indicators. A combination of two Evidence for the US of Fed Rsrv of St. Louis principal components for montly confidence); Financial (short term of the factors work best -- those related to periods from Jan 2005 to Dec and long term interest rates and S&P mobility and health care consumption. 2009 500); Query (combinations of principal components of query terms) Sakaki et al. Earthquake Shakes Twitter Apr-10 Tweets Earthquake occurrences/intensity and Occurrence, intensity and Tweets that contain the query Utilized a support vector machine Of those earthquakes occurring in the Aug- Users: Real-time Event tweets containing the words location of an earthquate words "earthquake" and/or (SVM) to determine whether tweet Sept time frame in Japan, 96% of 24 quakes Detection by Social Sensors "earthquake" or "shaking" in Japan in Japan "shaking"by location reports refers to an earthquake above an intensity of 3 were reported in a from Aug '09-Sep '09 occurrence. The reports are then tweet. Of the 24 quakes, 80% were matched against actual occurrence reported with a minute of occurrence. This at a particular location to see if it is is much faster the reports issued by the detected within 1 min of occurrence. Japan Meteorological Agency. Carrière-Swallow & Nowcasting With Google Jul-10 Search Auto sales and Google search indices % year-on-year change in Google Search index of interest in % change in y-on-y autosales in Despite relatively low rates of Internet Labbé Trends in an Emerging Market for specific automobiles in Chile from auto sales in Chile automobile purchases in Chile Chile regressed against a usage in Chile, models incorporating 2005 thru 2010. compositive auto Google Search Google Trends Automotive Index index based on queries about 9 outperform benchmark specifications in leading automobile manufactures in both in-sample and out-of- sample Chile. nowcasts of y-on-y % changes in autosales while providing substantial gains in information delivery times Ciulla Beating the news using social May-12 Tweets Tweets from US users that related to Contestants who were Number of hashtags with Tweets Contestants with the fewest In general the simple tweet frequencies are media: the case study of American Idol contestants and tweeted eliminated or won final with hastags signifying mentions predicted to be the strong predictor of the contestant who will American Idol during the voting time window for each episode contestants candidate eliminated. be eliminated. episode during 11th season from Jan '12 - May '12 Chadwick & Nowcasting Unemployment Jun-12 Search Monthly Turkey non-agricultural Monthly Turkey non- Google Search Index in Turkey for Linear auto regression models andModels with Google Search Indicators SengulCiu Rate in Turkey: Let's Ask unemployment rate from Jan '05-Dec agricultural unemployment terms directly ("looking for job") Bayesian Model Averaging perform better in nowcasting the 1 period, Google '11 rate or indirectly (job procedure to investigate whether 2 periods and 3 periods ahead announcements) related to Google search query data can unemployment rate than the benchmark unemployment. improve where we use only the lag values of the unemployment rate. Song, Pan, Ng Forecasting hotel room Sep-12 Search Weekly data on hotel bookings in Weekly Hotel Bookings in Indexed Search Volumes from Log of Room Nights for Log of Test various statistical models; all gave demand using search Charleston, SC and Google trend data Charleston, SC Google Trends/Insights Jan 2008- Search Volumes - Charleston, Travel reasonable forecasts. Best fit model was engine data for specific travel tourism search terms Aug 2009 Charleston, Charleston Hotels, Autoregressive Distributed Lag (ADLM) from Jan '08-Aug '09 Charleston Restaurants, Charleston with a lag period of 6 weeks. Tourism McLaren, Using internet search data as Q2-11 Search UK monthly unemployment data and Official monthly Google Trend/Insight query For unemployment, linear AR model For unemployment forecasts, claimant Shanbhogue economic indicators housing price growth from June '04-Jan unemployment data and indexes for the term "Job Seekers with query term, claimant count, count strongest followed by query term. For '11 associated with Google Trend query housing price growth in the Allowance (JSA)" for and GfK consumer confid. as exp housing prices, the query term was much indices for job seeking and UK from June 2004-Jan unemployment and "Estate vars; for housing price growth with stronger than HBF and RICS data. unemployment 2011 Agents" for housing query term, Home Builders and Royal Instit. of Chartered Surveyors Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL price growth balances as exp vars. 78
  • 79. Authors Examples Title Date (M- Type Data Source Dependent Variables Explanatory Variables Model Results Y) Gruhl et al. The Predictive Power of Online Aug-05 Blogs Amazon Sales Rank for for best selling Amazon Sales Rank for Number of mentions of the Cross correlation of time series for While sales rank is a poor predictor of the Chatter. books from Jul '04-Aug '04 and number 2340 bestselling books in 4book/author in over 300K blogs sales rank and mentions. change in sales rankings, a prior spike in of mentions in blogs for same time month period (Jul 2004- whose postings that were mentions predicts quite well a future spike period Aug 2004) and spikes in maintained by IBM's in sales rank. these sales ranks WebFountain project (over 200K postings/day) Choi, Varian Predicting the Present with Apr-09 Search Monthly data for search from Google US Census Bureau Advance Google Trend/Insight query Google Trend indices for query Simple seasonal AR models and fixed- Google Trends Trends associated with various retail Monthl Retail Sales indices for categories and subcategories related to (log values) effects models that includes relevant sales and travel data from Jan '04-Aug (general and specific) and subcategories related to retail of overall monthly retail trade Google Trend variables tend to outperform '08 Travel (Visitor arrival in sales (general and specifix) and (NAICS categories), automotive models that exclude these variables. In Hong Kong) related to Travel sales, home sales and travel. some cases small gains, in other substantial. Suhoy Query Indices and a 2008 Jul-09 Search Monthly official economic growth data Monthly percent changes 30 Google Search Index Bayesian probabilities of downturn Six leading query categories including HR Downturn: Israeli Data for the months and quarters from 2nd of various real values of categories related to consumption calculated by Hamilton's two-sate (recruiting and staffing), home appliances, quarter 2004 to 2nd quarter 2009 along industrial production, retail and employment Markov switching AR(0) model. travel, real estate, food and drink and with various Google search indices trade, revenue of trade Used to determine changes in query beauty and contain cyclical components during the same period and services, consumer indices can predict changes in which conform with cycles of economic imports, exports of official economic variables. growth. The strongest relationship was services and the between HR and unemployment. employment rate Sadikov et al. Blogs as Predictors of Movie Aug-09 Blogs Weekly box movie sales, gross sales, Movie critic ranking, user Analysis of spinn3r.com blog data Linear regression for weekly Minimal correlation between rankings and Success and critic and user rankings from Nov ranking, 2008 gross sales, set 11/07-11/08, counting movie rankings and sales data by blog references and sentiment. Strong '07-Nov '08 and counts for moving weekly box office sales references and sentiment within references and sentiment. correlation between references and gross references and sentiment measures for (weeks 1-5) specified time window before and sales but week with sentiment. Strongest same time period. after movie release date. relationships with timing of references in weeks after release. Zhang & Skiena Improving Movie Gross Sep-09 Blogs & Online News stores & blogs 1960-2008 Gross receipts from Variety of IMDB variables (e.g. Linear regression of receipts for Number of news articles mentioning Prediction Through News News along with movie receipt and IMDB movies movie genre), movie budget, various combinations of explanatory moving 1 week prior have highest Analysis data. News stores analyzed for various number of first week theaters, variables. Also, K-NN nearest correlation (~.7)and predictive ability pre-release time periods and number of stores and neighbor analysis determining mentions of movie titles, director, factors associated with gross top 3 & top 15 actors receipts Wu & Brynjolfsson The Future of Prediction: How Dec-09 Search Quarterly Housing Sales and Housing Housing Sales and Housing Google Search Index for Real Linear autoregression between Strong predictive relationships between Google Searches Foreshadow Price Index for 50 US states along with Price Index (HPI) Estate, Real Estate Agencies, and Housing Sales and prior sales, the Housing Sales and searches for Real Estate Housing Prices and Sales the Google Search Index for Real Home Appliances HPI, and Search Indices for Real Agencies. Similarly relationships for HPI. Estate, Real Estate Agencies, and Estate and Real Estate Agencies as Home Appliances from 4th Quarter well as the same regression for the 2007 to 2nd Quarter 2007 HPI Asur, Huberman Predicting the Future with Mar-10 Tweets 3 million tweets mentioning 24 movies Box-office revenues for Promotion tweets-retweets for a Regression of 1st weekend box Promotional tweets are weakly correlated Social Media from Nov '09-Feb '10 along with (24) movies particular movie, tweet rates for office revenues by promotional 1st weekend revs. Tweet rates are very associated box-office revenues particular movie per hour, ratio tweets-retweets, by tweet rates vs. strongly correlated (min .9) and a stronger of positive to negative Hollywood Stock Exchange prices, predictor than HSX. Finally, tweet rates are sentiments for the movie and 2nd weekend revenues by tweet strongly correlated with 2nd weekend rates and the sentiment ratio. revenues and sentiments improve the forecasts slightly. Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 79

Notas del editor

  1. Requires structured data (numbers and categories well-defined)Transformed by data preparation or collected with an a prior design in mindTypically housed and organized in a relational database, data mart or data warehouse
  2. First step in textual data preparation is to systematically collect samples of text, i.e. the documents related to the context being studiedRange of possibilities: word documents, PDFs, emails, IM chat, Web pages, RSS Feeds, Blogs, Tweets, Open ended surveys, Transcripts of Helpline calls …Convert into organized set of texts – called a corpus – standardized and prepared for the purpose of knowledge discovery
  3.  “Do you have the flu? Fever, Chest discomfort, Weakness, Aches, Headache, Cough.”