Citizen Sensor Data Mining, Social Media Analytics and Development Centric Web Applications

Citizen Sensor Data Mining, Social Media Analytics and Development Centric Web ApplicationsTutorial at WWW2011, Hyderabad, IndiaMarch 28, 2011 1

Citizen Sensing Overview, Social Signals, Enablers Role of Social Media Activism, Journalism, Business Intelligence, Global Development Development-Centric Platforms Beginnings, Architectures and Possibilities Systematic Study of Social Media Spatio-Temporal-Thematic + People-Content-Network Analysis Trustworthiness in Social Media Mobile Social Computing Citizen Sensing @ Real-time Research Application: Twitris Conclusion & Future Work 2 Outline

Selvam Velmurugan (Kiirti, eMoksha NGOs) Meena Nagarajan (Content Analysis) Hemant Purohit (People & Network analysis) AmitSheth (Semantic Web) Ashutosh Jadhav (Event Analysis) Lu Chen (Sentiment Analysis) Pramod Anantharam (Social & Sensor web) Pavan Kapanipathi (Real Time Web)

Preliminaries Tutorial description: http://www2011india.com/tutorialstr27.html and http://knoesis.org/library/resource.php?id=1030 Lots of breadth: many examples, some depth: few algorithms, mainly to convey insights Twitter > Myspace/Facebook > SMS Each has different reach/focus/importance Given the time, only parts will be covered today! Citations, further reading at bottom and at the end Images belong to their copyright holders. Copyright info. for images, where available are at the end. 5

Aim What are research opportunities and technical challenges in gaining insights and use of social media content (esp. citizen sensing)? Provide a structure to a vast array of issues Breath, not depth 6

Citizen Sensing Common person (citizens of Internet) is able to use Web2.0 and social networks The human centric activity** of observing, reporting, disseminating information (facts, opinions, views) via text, audio, video and built in device sensor (and smart devices) ** direct/indirect, collective/individual Human-in-the-loop (participatory) sensing + Web 2.0 + Mobile computing = Emergence of Citizen-Sensor networks Image: http://bit.ly/hmZe428 A. Sheth, 'Citizen Sensing, Social Signals, and Enriching Human Experience', IEEE Internet Computing, July/August 2009, pp. 80-85.

Understanding meaningful citizen sensor observations Social Signal Processing: Aggregation, Enhancement, Analysis, Visualization, and Interpretation Citizen-Sensor network: Immense potential to disseminate social signals quickly and in real-time 9 Social Signals A. Sheth, 'Citizen Sensing, Social Signals, and Enriching Human Experience', IEEE Internet Computing, July/August 2009, pp. 80-85. Image:http://bit.ly/gWHSjD

1+B with internet connected mobile devices (2010)

Smartphones> Notebooks + Netbooks (2010E)

500K+ mobile phone applications

74% of mobile phone users (2.4B) worldwide used SMS (2007)Mobile device might qualify as humankind's primary tool Redefines the way we engage with people, information, etc. Enablers: Mobile Devices & Ubiquitous Connectivity Mobile is Global Ubiquity, 24x7 Built in sensors environmental, biometric/biomedical,... 10

Enablers: Web 2.0 & Social Media 500M+ Facebook Users 100M+ Twitter users, 85M+ tweets/day Internet Users: 1.8 Bln Large variety of social media and traditional media interact, creating potent mixture 11 Types of UGC: ,[object Object]

Ping(Social network for music) Image: http://bit.ly/euLETT

Citizen Sensing Overview, Social Signals, Enablers Role of Social Media (important classes of applications) Activism, Journalism, Business Intelligence, Global Development Development-Centric Platforms Beginnings, Architectures and Possibilities Systematic Study of Social Media Spatio-Temporal-Thematic + People-Content-Network Analysis Trustworthiness in Social Media Mobile Social Computing Citizen Sensing @ Real-time Research Application: Twitris Conclusion & Future Work 12 Outline

Citizen Sensors in Action Mumbai Terror Attack Iran Election 2009 Haiti Earthquake 2010 US Healthcare Debate 2009 13 Image: http://huff.to/hp0OhA

Revolution 2.0 Political/Social Activism Ghonim, who has been a figurehead for the movement against the Egyptian government, told Blitzer “If you want to liberate a government, give them the internet.” ,[object Object], Ghonim replied succinctly “Ask Facebook.” http://cnn.com/video/?/video/world/2011/02/13/nr.social.media.revolution.cnn http://cnn.com/video/?/video/tech/2011/02/11/barnett.egypt.social.media.cnn Egyptian anti-government demonstrator sleeps on the pavement under spray paint that reads 'Al-Jazeera' and 'Facebook' at Cairo's Tahrir square on February 7, 2011. http://www.cbsnews.com/stories/2011/02/15/eveningnews/main20032118.shtml 14

Citizen Journalism 15 Twitter Journalism Images: http://bit.ly/9GVfPQ, http://bit.ly/hmrTYV

News is increasingly Social Social News Social Media and Global Media are inter-twined. 16

Business Intelligence: Trend Spotting, Forecasting, Brand Tracking, Targeted Advertising Sysomos(http://www.sysomos.com/) - Business intelligence by engaging, measuring and understanding activities in Social Media Trendspotting(http://trendspotting.com) - Detecting, analyzingandevaluating trends for business. Simplify(http://simplify360.com/) - A collaborativeplatform to monitor, measureandengage customersusing Social Media. Shoutlet(http://www.shoutlet.com/) - Managing social media marketing communication using a single platform. Reputation.com(http://www.reputationdefender.com/) - Preserves privacyanddefendsreputationbyprotectingattacks onpersonalinformation. Image: http://bit.ly/eAebBb 17

Social Development (Education, Health, eGov) LiveMocha (http://www.livemocha.com/) Online Language learning tool with social engagement - bridging the gap!! Soliya(http://www.soliya.net/) Dialogue between students from diverse backgrounds across the globe using latest multimedia technologies ProjectEinstein (http://digital-democracy.org/what-we-do/programs/) A photography-based digital penpal program connecting youths in refugee camps to the world PatientsLikeMe (http://mashable.com/2010/07/13/social-media-health-trends/) - Facilitates sharing of health profiles, finding patients with similar ailments, and learn from discussions. TrialX(http://trialx.com/) - Finding clinical trials of new treatments and connecting with clinical trial investigators. 18 Image: http://bit.ly/ayyjlU

Collaboration We “simply do not have enough genes to program the brain fully in advance,” we must work together, extending and supporting our own intelligence with “social prosthetic” systems that make up for our missing cognitive and emotional capacities: “Evolution has allowed our brains to be configured during development so that we are ‘plug compatible’ with other humans, so that others can help us extend ourselves.” - Harvard "Group Brain Project" 20

Beginnings Open Source Linux, Apache Social Networks FaceBook, Twitter, MySpace Crowd Sourcing Wikipedia, Kiva, Ushahidi, Kiirti, SwiftRiver, Sahana Collaborative Governance Peer-to-Patent, e-Demogracia 21

Popular Initiatives FaceBook + Twitter Iran post-election protests Tunisia and Egypt uprisings Ushahidi Kenyan post-election violence India, Lebanon, Afghanistan, and Sudan elections Haiti Earthquake Pakistan Floods Kiirti BBMP election monitoring Bangalore AutoWatch 22

FixOurCity - Chennai Built on top of FixMyCity open-source codebase Stage I Report by Area/Ward and Street Integration with Google Map Displays Ward member name/contact details Select category of issue, description and severity Confirmation through email to avoid misuse Stage II/III Normalize incoming reports to official wards and categories Integration with Corporation website to allow auto-forwarding and updating of reports 23

Ushahidi Information Collection: SMS (FrontlineSMS, Clickatell), Email, Web Visualization/Interactive Mapping: Timeline, Category, Geo-spatial Alerts: Geo-spatial Admin: User Management, Report Moderation / Creation, Site Statistics 24

SwiftRiver Filtering and verification of real-time data from channels like Twitter, SMS, Email and RSS feeds. Offers organizations an easy way to apply semantic analysis and verification algorithms to different sources of information. Speed up the process of managing real-time data streams (email, web, sms, twitter) Add elusive context (location, historical data) and history (reputation of sources) to online research Offer a dashboard for monitoring multiple channels of information Offer advanced aggregation and analytic tools on or offline Give the user control over advance curation tools and filter 25

SwiftRiver Architecture - I 26

SwiftRiver Architecture - II 27

Free and Open Source Disaster Management system. A web based collaboration tool that addresses the common coordination problems during a disaster between Government, the civil society (NGOs) and the victims themselves. Sahana

Mapping - Situation Awareness & Geospatial Analysis. Messaging - Sends & Receives Alerts via Email & SMS. Document Library - A library of digital resources, such as Photos & Office documents. Missing Persons Registry: Report and Search for Missing Persons. Disaster Victim Identification Requests Management: Tracks requests for aid and matches them against donors who have pledged aid. Shelter Registry - Tracks the location, distribution, capacity and breakdown of victims in Shelters Hospital Management System - Hospitals can share information on resources & needs. Organization Registry - "Who is doing What & Where". Allows relief agencies to coordinate their activities. Ticketing - Master Message Log to process incoming reports & requests. Delphi Decision Maker - Supports the decision making of large groups of Experts Sahana

Peer to Patent Peer To Patent opens the patent examination process to public participation for the first time. It is an online system that aims to improve the quality of issued patents by enabling the public to supply the USPTO with information relevant to assessing the claims of pending patent applications. 30

http://www.peertopatent.org/video/p2p640/VideoPlayer.html 31 Peer to Patent - Video

Kiirti Allows you to set up your own instance of the Ushahidi Platform without having to install it on your own web server. Provides pre-integrated Voice and SMS reporting capabilities within India. 32

34 Kiirti – User Interaction Flow

Kiirti - Flywheel of Engagement 35

Future Possibilities Online Dispute Resolution 30M+ pending cases in India's courts Public Policy Reviews Crisis Management Effective Local Governance 36

Challenges Challenges Information overload Processing and de-duping messages Accessibility (e.x. network congestion, access points, …) Incorrect or partial data Trustworthiness of source (e.x. influence, reputation, …) Metadata extraction (e.x. geo data, name-entity, sentiment/opinion, …) Collaboration Policy discussions Structure or hierarchy

Dimensions of Systematic Study of Social Media Spatio - Temporal -Thematic+ People - Content - Network 39

Social Information Processing "Who says what, to whom, why, to what extent and with what effect?" [Laswell] Network: Social structure emerges from the aggregate of relationships (ties) People: poster identities, the active effort of accomplishing interaction Content : studying the content of communication 40

Studying Online Human Social Dynamics How does the (semantics or style of) content fit into the observations made about the network? Often, the three-dimensional dynamic of people, content and link structure is what shapes the social dynamic. 41 Example: how does the topic of discussion, emotional charge of a conversation, the presence of an expert and connections between participants; together explain information propagation in a social network? Image: http://bit.ly/dFzjU2

Why People-Content-Network + Spatial-Temporal-Thematic metadata?(Example of Understanding Crisis Data) 42

Metadata/Annotations Metadata: an organized way to study Types Creation/extraction and storage Use 43 Image: http://www.biowisdom.com/tag/metadata/

Metadata Infrastructure: Example for Tweet Annotation (mapped out tweet) 44 Image: http://rww.to/9zyoQa

45 http://www.readwriteweb.com/archives what_twitter_annotations_mean.php

` People Metadata: Variety of Self-expression Modes on Multiple Social Media Platforms Explicit information from user profiles User Names, Pictures, Videos, Links, Demographic Information, Group memberships... Implicit information from user attention metadata Page views, Facebook 'Likes', Comments; Twitter 'Follows', Retweets, Replies.. 47

People Metadata: Various Types Identification Interests Activity Network 48

People Metadata: Continued Web Presence: - User affiliations - KLOUT Score – influence measure (www.klout.com) 50

Content Metadata Content Independent metadata • date, location, author etc. 51 2. Content Dependent metadata Direct content-based metadata i. Explicit/Mentioned Content metadata • named entities in content ii. Implicit/Inferred Content Metadata • related named entities from knowledge sources b. Indirect content-based metadata (External metadata) • context inferred from URLs in content (images, links to articles, FourSquarecheckins etc.) V. Kashyap and A. Sheth, 'Semantic Heterogeneity in Global Information Systems: The Role of Metadata, Context and Ontologies,’ in Cooperative Information Systems: Current Trends and Directions, M. Papazoglou and G. Schlageter (Eds.), Academic Press, 1998, pp. 139-178.

Content Metadata: Content Independent For Tweets Published date and time Location (where tweet was generated from) Tweet posting method (smart-phone, twitter.com, clients for twitter) Author information 52 ,[object Object],Publish date and time Location (where SMS is generated) Receiver (NGO, Government organization) carrier information (available on request)

Content Metadata: Content Dependent (Tweet) 53 Direct Content-based Metadata Indirect content-based metadata (External metadata)

Content Metadata: Content Dependent (SMS) Direct Content-based Metadata 54

Network Metadata Connections/Relationships matter! (foundation for the network) 55

Metadata: Creation, Extraction and Storage 56

Metadata Creation & Extraction Extracted Metadata Directly visible information from the user profile, tweet content & community structure Created Metadata After processing information in the user profile, content and/or network structure 57

An Example Length: 109 charactersGeneral topic: Egypt protest This poor {sentiment_expression: {target: “Lara Logan”, polarity: “negative”}} woman! RT @THRCBS News‘ {entity:{type=“News Agency”}} Lara Logan {entity:{type=“Person”}} Released FromHospital {entity:{type=“Hospital”}} After Egypt {entity:{type=“Country”} Assault {topic} http://bit.ly/dKWTY0 {external_URL} 58

Why Semantic Web is a Standard for Social Metadata? Rich Snippet, open graph: RDFa - Semantic Web based social data standards Relationships/connections play central role (not just hyperlinks as in Web data)– so relationship as first class object is important Semantic Web technologies and standards provide better techniques to capture and represent metadata, relationships 59

Semantic Web in One Slide Representing Semantic Web Data RDF: relationships as first class object <subject, predicate,object> Representing Knowledge and Agreements nomenclature, taxonomy, folksonomy, ontology: OWL Annotation: RDFa, Xlink, model reference Web of Data: Linked Open Data Querying: SPARQL Rules: SWRL, RIF 60

How to Save and Use Metadata? Store metadata as data and use standard database technique Use filtering and clustering, summarization, statistics - implicit semantics 61 ,[object Object]

Richer representation, support for relationships, context

Supports use of background knowledge

Better integration, powerful analysis

Semantics- the implicit, the formal and the powerful

Social metadata on the Web [H. Dacquin],[object Object]

Metadata Extraction from Informal Text 63 Meena Nagarajan,‘Understanding User-Generated Content on Social Media,’ Ph.D. Dissertation, Wright State University, 2010

64 Characteristics of Text on Social Media

Content Analysis: Typical Sub-tasks ,[object Object]

What opinions are people conveying via the content?

What can we infer about the author from the content he posts?

Context (external to content) extraction

URL extraction, analyzing external contentRecognize key entities mentioned in content Information Extraction (entity recognition, anaphora resolution, entity classification..) Discovery of Semantic Associations between entities Topic Classification, Aboutness of content What is the content about? Intention Analysis Why did they share this content? 66

Research Efforts, Contributions in this space.. Examining usefulness of multiple context cues for text mining algorithms Compensating for informal, highly variable language, lack of context Using context cues: Document corpus, syntactic, structural cues, social medium, external domain knowledge… In this talk, highlighting sample metadata creation tasks: NER Key Phrase Extraction Intention Sentiment/Opinion Mining 67

Named Entity Recognition I loved <movie> the hangover </movie>! Key Phrase Extraction 68 Part 1: NER, Key Phrase Extraction

Multiple Context Cues Utilized for NER in Blogs and MySpace Forums 69 Meena Nagarajan,‘Understanding User-Generated Content on Social Media,’ Ph.D. Dissertation, Wright State University, 2010

70 Multiple Context Cues Utilized for Keyphrase Extraction from Twitter, Facebook and MySpace Meena Nagarajan,‘Understanding User-Generated Content on Social Media,’ Ph.D. Dissertation, Wright State University, 2010

Focus, Impact We focus on techniques that exploit content and context aspects on social media platforms Our methods highlight a combination of top-down, bottom-up analysis for informal text Statistical NLP, ML algorithms over large corpora (bottom-up) Models and rich knowledge bases in a domain(top-down) 71

Named Entity Recognition “I loved your music Yesterday!” Yesterday is an album “It was THEHANGOVER of the year..lasted forever.. The Hangover is not a movie So I went to the movies..badchoice picking “GI Jane”worse now” GI Jane is a movie 73 Task of NER : Identifying and classifying tokens

NER in prior work vs. NER for Informal Text 74

Cultural Named Entities • NER focus in this work: Cultural Named Entities Artifacts of Culture Name of a books, music albums, films, video games, etc. Common words in a language The Lord of the Rings, Lips, Crash, Up, Wanted, Today, Twilight, Dark Knight… 75

What makes cultural entity extraction challenging.. Varied senses, several poorly documented Star Trek: movies, TV series, media franchise.. and cuisines !! Changing contexts with recent events The Dark Knight is a movie, it is also a reference to Obamaand the health care policy Comprehensive sense definitions, enumeration of contexts, labeled corpora for all senses .. Are Unrealistic expectationswhen building a NER system NER Relaxing the closed-world sense assumptions 76

77 NER in prior work vs. NER for Informal Text

A Spot and Disambiguate Paradigm NER is generally a sequential prediction problem NER system that achieves 90.8 F1 score on the CoNLL-2003 NER shared task (PER, LOC, ORGN entities) 78 ,[object Object]

Starting off with a dictionary or list of entities we want to spot

Spot, then disambiguate in context (natural language, domain knowledge cues)

Is this mention of “the hangover” in a sentence referring to a movie?CoNLL 2003 -- http://www.cnts.ua.ac.be/conll2003/ner/

79 NER in prior work vs. NER for Informal Text

(a) Multiple Senses in the Same Domain 81

Algorithm Preliminaries Problem Definition – Cultural Entity Identification : Music album, tracks e.g. Smile (Lilly Allen), Celebration (Madonna) • Corpus: MySpace comments – Context-poor utterances e.g. “Happy 25th Lilly, Alfieis funny” 82 • Goal: Semantic Annotation of music named entities (w.r.t MusicBrainz) MusicBrainz Schema

Using a Knowledge Resource for NER is not straight-forward.. 83

Approach Overview Which ‘Merry Christmas’?; ‘So Good’is also a song! Scoped Relationship graphs – Using context cues from the content, webpage title, url… e.g. new Merry Christmas tune – Reduce potential entity spot size e.g. new albums/songs • Generate candidate entities • Spot and Disambiguate 84

Sample Real-world Constraints Which ‘Merry Christmas’?; ‘So Good’is also a song! Career Restrictions - “release your third album already..” Recent Album restrictions - “I loved your new album..” Artist age restrictions -”happy 25thrihanna, loved alfie btw..” etc. 85

Scoping via Real-world Restrictions 87

Scoped Entity Lists User comments are on MySpace artist pages – Contextual Restriction: Artist name – Assumption: no other artist/work mention Naive spotter has advantage of spotting all possible mentions (modulo spelling errors) – Generates several false positives “this is bad news, ill miss you MJ” 88

But there are also non-music mentions Challenge 1: Several senses in the same domain Scoping relationship graphs narrows possible senses Solves the named entity identification problem partially Challenge 2: Non-music mentions Got your new album Smile. Loved it! Keep your SMILE on! 89

Using Language Features to eliminate incorrect mentions.. Syntactic features POS Tags, Typed dependencies.. Word-level features Capitalization, Quotes Domain-level features 90

Hand-labeling - Fairly Subjective 1800+ spots in MySpace user comments from artist pages Manual annotations for a post: “Keep your <track>SMILE<track>on!” valid album/track named entity (good spot)invalid named entity (bad spot)hard-to tell (inconclusive) 4-way annotator agreements – shows that agreeing on the accuracy of a spot is hard to do even for domain experts – Madonna 90% agreement – Rihanna 84% agreement – Lily Allen 53% agreement (many named entities of ambiguous nature and usage) 92

Combining a Dictionary Spotter + NLP Analytics 93 Daniel Gruhl, Meena Nagarajan, Jan Pieper, Christine Robson, AmitSheth,‘Context and Domain Knowledge Enhanced Entity Spotting in Informal Text,’ The 8th International Semantic Web Conference, 2009: 260-276

Lessons Learned - NER on Social Media Text using a Knowledge Base Intelligent pruning of a knowledge base goes a long way in improving precision Two stage approach: chaining NL learners over results of domain model based spotters Improves accuracy up to a further 50% allows the more time-intensive NLP analytics to run on less than the full set of input data 94

95 Music NER application : BBC SoundIndex (IBM Almaden)Pulse of the Online Music Populace Daniel Gruhl, MeenakshiNagarajan, Jan Pieper, Christine Robson, Amit Sheth: ‘Multimodal Social Intelligence in a Real-Time Dashboard System,’ special issue of the VLDB Journal on "Data Management and Mining for Social Networks and Social Media", 2010 Project: http://www.almaden.ibm.com/cs/projects/iis/sound/

The Vision http://www.almaden.ibm.com/cs/projects/iis/sound/ 96

Several Insights 98 Trending popularity of artists Trending topics in artist pages Only 4% -ve sentiments, perhaps ignore the Sentiment Annotator on this data source? Ignoring Spam can change ordering of popular artists

Predictive Power of Data Billboards Top 50 Singles chart during the week of Sept 22-28 ’07 vs. MySpace popularity charts. User study indicated 2:1 and upto 7:1 (younger age groups) preference for MySpace list. Challenging traditional polling methods! 99

Key Phrase Extraction - Example Key phrases extracted from prominent discussions on Twitter around the 2009 Health Care Reform debate and 2008 Mumbai Terror Attack on one day 101

Key Phrase Extraction from Social Media Text Different from Information Extraction Key phrase extraction does not concern itself with classification into a type Extracting vs. Assigning Key Phrases Focus: Key Phrase Extraction Prior work focus: extracting phrases that summarize a document -- a news article, a web page, a journal article, a book.. Focus here: summarize multiple documents (UGC) around same event/topic of interest 102

Key Phrase Extraction on Social Media Content has some differences 1. Need to preserve/isolate the social behind the social data in summarizing key phrases What is said in Egypt vs. the USA should be viewed in isolation 2. Need to Accounting for redundancy, variability, off-topic content “Met up with mom for lunch, she looks lovely as ever, good genes .. Thanks Nike, I love my new Gladiators ..smooth as a feather. I burnt all the calories of Italian joy in one run.. if you are looking for good Italian food on Main, Bucais the place to go.” 103

Where is the Social and Cultural Logic in UGC ? Thematic components similar messages convey similar ideas Space, time metadata role of community and geography in communication Poster attributes age, gender, socio-economic status reflect similar perceptions ‘Social applies to data as well as metadata’ 104

Features used in social Key Phrase extraction (common to prior efforts) Focus: n-grams, spatio-temporal metadata (social components) Syntactic Cues: In quotes, italics, bold; in document headers; phrases collocated with acronyms Document and Structural Cues: Two word phrases, appearing in the beginning of a document, frequency, presence in multiple similar documents etc. Linguistic Cues: Stemmed form of a phrase, phrases that are simple and compound nouns in sentences etc. 105

Key Phrase Extraction Overview “President Obama in trying to regain control of the health-care debate will likely shift his pitch in September” 1-grams: President, Obama, in, trying, to, regain, ... 2-grams: “President Obama”, “Obama in”, “in trying”, “trying to”... 3-grams: “President Obama in”, “Obama in trying”; “in trying to”... 106

A descriptor is an n-gram weighted by: Thematic Importance ,[object Object]

Redundancy: statistically discriminatory in nature

Variability: contextually importantSpatial Importance (local vs. global popularity) Temporal Importance (always popular vs. currently trending) ` 107

108 TF-IDF vs. Spatio-temporal-thematic scores rank phrases differently Foreign relations surfaces up M. Nagarajan et al., Spatio-Temporal-Thematic Analysis of Citizen-Sensor Data - Challenges and Experiences, Tenth International Conference on Web Information Systems Engineering, Oct 5-7, 2009: 539-553

Next task : Eliminating Off-topic Content Frequency based heuristics will not eliminate off-topic content that is ALSO POPULAR 109 Popular Key phrases “single”, “Jesus” are unrelated to Madonna’s music M. Nagarajan et al., Monetizing User Activity on Social Networks - Challenges and Experiences, 2009 IEEE/WIC/ACM International Conference on Web Intelligence, Sep 15-18 2009: 92-99

Elimination off-topic content : Example • “Yeah i know this a bit off topic but the other electronics forum is dead right now. im looking for a good camcorder, somethin not to large that can record in full HD only ones so far that ive seen are sonys” • “CanonHV20.Great little cameras under $1000.” Possible relevant phrases are: ['camcorder', 'canon hv20', 'little camera', 'hd', 'cameras', 'canon'] 110

• Assume one or more seed words (from domain knowledge base) C1 -['camcorder'] • Extracted Key words / phrasesC2 -['electronics forum', 'hd', 'camcorder', 'somethin', 'ive', 'canon', 'little camera', 'canon hv20', 'cameras', 'offtopic'] • Gradually expand C1 by adding phrases from C2 that are strongly associated with C1 • Mutual Information based algorithm [WISE2009] 111 Eliminating off-topic content : Approach Overview

Key Phrases & Aboutness - Evaluations Are the key phrases we extracted topical and good indicators of what the content is about? If it is, it should act as an effective index/search phrase and return relevant content Evaluation Application: Targeted Content Delivery 112

Targeted Content Delivery -Evaluations We took 12K posts from MySpace and Facebook Electronics forums Extracted Baseline phrases using Yahoo Term Extractor Extracted phrases using the Key phrase extraction, elimination algorithm described earlier Generated Targeted Content from Google AdSense Asked users if the delivered content matched the posts 113

Targeted Content for all content vs. extracted key phrases 114

Social Key Phrase Extraction : Impact, Contributions TFIDF + social contextual cues yield more useful phrases that preserve social perceptions Corpus + seeds from a domain knowledge base eliminate off-topic phrases effectively 116

Why do people share? Outside of the psychological incentives, broadly, people share to Seek Information OR Share Information If we understand the intent behind a post, we can build systems that respond to it better Focus of our work: Understand intent to deliver targeted content Use case: Online Content-Targeted Advertisements on Social Media Platforms 118

Circa 2009 -Content-based Ads 119

Today – Content-based Ads on Profiles 120

What is going on here.. ,[object Object]

But Interests on profiles do not translate to purchase intents – Interests are often outdated.. – Intents are rarely stated on a profile.. • Some profile data does seem to work – Example: New store openings, sales targeted at location information in a profile 121

But Monetizable Intents are Elsewhere, away from their profiles.. 122

Showing clear intents on MySpace posts but no relevant ads.. 123

Targeted Content-based Advertizing –Non-trivial –Non-policed content •Brand image, Unfavorable sentiments –People are there to network •User attention to ads is not guaranteed –Informal, casual nature of content •People are sharing experiences and events –Main message overloaded with off topic content I NEED HELP WITHSONY VEGAS PRO 8!! Ugh and ihave a video project due tomorrow for merrilllynch :(( all ineed to do is simple: Extract several scenes from a clip, insert captions, transitions and thatsit. really. omggicant figure out anything!! help!! and igot food poisoning from eggs. its not fun. Pleasssse, help? :( 1Learning from Multi-topic Web Documents for Contextual Advertisement, Zhang, Y., Surendran, A. C., Platt, J. C., and Narasimhan, M.,KDD 2008 124

Focus: Discuss Methodology, Preliminary Results in… • Identifying intents behind user posts on social networks – Identify Content with monetization potential • Identifying keywords for advertizing in user-generated content – Considering interpersonal communication & off-topic chatter 125 M. Nagarajan et al., ‘Monetizing User Activity on Social Networks - Challenges and Experiences,’ 2009 IEEE/WIC/ACM International Conference on Web Intelligence, Sep 15-18 2009: 92-99

Investigations User studies – Hard to compare activity based ads to s.o.t.a – Impressions to Clickthroughs – How well are we able to identify monetizable posts – How targeted are ads generated using our keywords vs. entire user generated content 126 M. Nagarajan et al., ‘Monetizing User Activity on Social Networks - Challenges and Experiences,’ 2009 IEEE/WIC/ACM International Conference on Web Intelligence, Sep 15-18 2009: 92-99

Identifying Intents on SM is different from that on the Web.. Scribe Intent not same as Web Search Intent1 People write sentences, not keywords or phrases Presence of a keyword does not imply navigational / transactional intents – ‘am thinking of getting X’ (transactional) – ‘I like my new X’ (information sharing) – ‘what do you think about X’ (information seeking) Useful here would be to identify: Transactional and Information Seeking intents 1B. J. Jansen, D. L. Booth, and A. Spink, “Determining the informational, navigational, and transactional intent of web queries,”Inf. Process. Manage., vol. 44, no. 3, 2008. 127

Not Focusing on the entity but Action Patterns surrounding the entity “where can I find a chottopspcam” – User post also has an entity, which is a plus but not the main target of intent identification.. Goal is to study How questions are asked and not topic words that indicate what the question is about 128

Conceptual Overview Bootstrapping to learn IS patterns Take a set of user posts from SNSs Not annotated for presence or absence of any intent 129

Bootstrapping to learn IS patterns Generate a universal set of n- gram patterns; freq > f S = set of all 4-grams; freq > 3 130

Bootstrapping to learn IS patterns Generate set of candidate patterns from seed words (why,when,where,how,what) Sc= all 4-grams in S that extract seed words 131

Bootstrapping to learn IS patterns User picks 10 seed patterns from Sc Sis= ‘does anyone know how’, ‘where do I find’, ‘someone tell me where’… 132

Bootstrapping to learn IS patterns Gradually expand Sis by adding Information Seeking patterns from Sc 133

Bootstrapping to learn IS patterns For every pis in Sis generate set of filler patterns 134

Bootstrapping to learn IS patterns ‘.* anyone know how’‘ does .* know how’ ‘does anyone .* how’ ‘does anyone know .*’ Look for patterns in Sc Functional compatibility of filler ,[object Object],Empirical support for filler 135

Expanding the Pattern Pool Functional properties / communicative functions of words From a subset of LIWC1 – cognitive mechanical (e.g., if, whether, wondering, find) • ‘I am thinking about getting X’ – adverbs(e.g., how, somehow, where) –(e.g., someone, anybody, whichever) • ‘Someone tell me where can I find X’ 1Linguistic Inquiry Word Count, LIWC, http://liwc.net 136

Example - Acquiring New Intent Patterns.. • ‘does * know how’ – ‘does someone know how’ • Functional Compatibility -Impersonal pronouns • Empirical Support –1/3 – ‘does somebody know how’ • Functional Compatibility -Impersonal pronouns • Empirical Support –0 • Pattern Retained – ‘does john know how’ • Pattern discarded Sc= {‘does anyone know how’, ‘where do I find’, ‘someone tell me where’} • pis= `does anyone know how’ 137

Finer Details of the Approach are in the paper.. Iterative algorithm, single-word substitutions, functional usage and empirical support conservatively expand the intent-seeking pool of patterns.. Infusing new patterns and seed words Stopping conditions 138 M. Nagarajan et al., Monetizing User Activity on Social Networks - Challenges and Experiences, 2009 IEEE/WIC/ACM International Conference on Web Intelligence, Sep 15-18 2009: 92-99

Identifying Monetizable Posts Information Seeking patterns just described are generated offline Finding the Information seeking intent score of a post – Extract and compare patterns in posts with extracted information seeking patterns ,[object Object]

Using LIWC ‘Money’ dictionary : 173 words and word forms indicative of transactions, e.g., trade, deal, buy, sell, worth, price etc.140

Benchmarking with FB Marketplace Training corpus 8000 user posts from MySpace Computers, Electronics, Gadgets forum ,[object Object]

309 unique new patterns, 263 unambiguous• Testing patterns for recall using ‘To buy’ Facebook Marketplace where all posts are information seeking – extracted patterns average 81 % recall 141

Next task: Identifying Keywords for Advertizing Identifying keywords in monetizable posts – Plethora of work in this space Off-topic noise removal is our focus I NEED HELP WITHSONY VEGAS PRO 8!! Ugh and ihave a video project due tomorrow for merrilllynch :(( all ineed to do is simple: Extract several scenes from a clip, insert captions, transitions and thatsit. really. omggicant figure out anything!! help!! and igot food poisoning from eggs. its not fun. Pleasssse, help? :( 142

Conceptual Overview (also see slides in Key Phrase elimination section) ,[object Object], – C1 -['camcorder'] ,[object Object], – C2 -['electronics forum', 'hd', 'camcorder', 'somethin', 'ive', 'canon', 'little camera', 'canon hv20', 'cameras', 'offtopic'] ,[object Object], – Relatedness determined using information gain – Using the Web as a corpus, domain independent 143

Example: Off-topic Chatter Elimination • C1 -['camcorder'] • C2 -['electronics forum', 'hd', 'camcorder', 'somethin', 'ive', 'canon', 'little camera', 'canon hv20', 'cameras', 'offtopic'] • Informative words ['camcorder', 'canon hv20', 'little camera', 'hd', 'cameras', 'canon'] 144

Evaluations- User Study Keywords from 60 monetizable user posts – Monetizable intent, at least 3 keywords in content – 45 MySpace Forums, 15 Facebook Marketplace, 30 graduate students – 10 sets of 6 posts each – Each set evaluated by 3 randomly selected users • Monetizable intents? – All 60 posts voted as unambiguously information seeking in intent 145

Effectiveness of using topical keywords • Google AdSenseads for user post vs. extracted topical keywords 146

Instructions –User Study 147

Result -2X Relevant Impressions Users picked ads relevant to the post – At least 50% inter-evaluator agreement For the 60 posts – Total of 144 ad impressions – 17% of ads picked as relevant For the topical keywords – Total of 162 ad impressions – 40% of ads picked as relevant 148

Evaluations: Profile Ads vs. Activity Ads • Are ads generated from activity more interesting than those generated from user profiles? Gather user’s profile information – Interests, hobbies, TV shows.. (non-demographic information) • Ask them to submit a post (simulating their social media entry) – Looking to buy and why (induce off-topic content) • Generate ads from profiles, from post (keywords) 149

Result - 8X more interest for non-profile ads.. • Using profile ads – Total of 56 ad impressions – 7% of ads generated interest • Using authored posts – Total of 56 ad impressions – 43% of ads generated interest • Using topical keywords from authored posts – Total of 59 ad impressions – 59% of ads generated interest 150

To note… ,[object Object], – Monetization potential in user activity – Improvement for Ad programs in terms of relevant impressions ,[object Object], – Verbose content – Status updates, notes, community and event memberships… – One size may not fit all 151

To note… A world between relevant impressions and click throughs – Objectionable content, vocabulary impedance, Ad placement, network behavior – In a pipeline of other community efforts ,[object Object],– Cannot custom send information to Google AdSense 152

SENTIMENT / OPINION MINING 153

Content Analysis: Sentiment Analysis/Opinion Mining Two main types of information we can learn from user-generated content: fact vs. opinion Much of social media text (e.g., blogs, Twitter, Facebook) is a mix of facts and opinions. For example," Latest news: Mobile web services not working in #Bahrain and Internet is extremely slow #feb14{fact}... looks like they "learned" from #Egypt {opinion}"

Sentiment Analysis: Motivation Why do people oppose health care reform? What customers complain about? Which movie should I see? 155 Image: http://bit.ly/eZtKBF

Sentiment Analysis: Tasks Example: “How awful that many #Egypt ian artifacts are in danger of being Destroyed. What ZahiHawassmust be thinking#jan25” Classification: Overall sentiment polarity [Pang et al. 2002], [Turney 2002], etc. the overall polarity is positive, neutral or negative (on the document/sentence/word level) For the example: overall polarity is negative Target-specific sentiment polarity [Yi et al. 2003], [Hu et al. 2004], etc. The polarity toward the given target is positive, neutral or negative For the Example: polarity is "negative“ for the target "egyptian artifacts“; polarity is "neutral“for target "ZahiHawass" 156

Sentiment Analysis: Tasks Example: “How awful that many #Egypt ian artifacts are in danger of being Destroyed. What ZahiHawassmust be thinking #jan25” Identification & Extraction: opinion[Dave et al. 2003] etc. opinion holder [Bethard et al. 2004] etc. opinion target [Hu et al. 2004] etc. For the example: opinion="awful", opinion holder="the author", target="egyptian artifacts are in danger” opinion="must be thinking", opinion holder="the author", target="ZahiHawass" 157

Sentiment Analysis: Classification Supervised[Pang et al. 2002] etc. Labeled training data: e.g., product review, movie review, etc. Features: e.g., term-based, part-of-speech, syntactic relations, etc. Learning strategies: e.g.,SVMs), Naive Bayes, .. Unsupervised [Turney 2002] etc. lexicon-based approach [Hu et al. 2004], [Ding et al. 2008] etc. Using a sentiment lexicon of positive/negative sentiment words Bootstrapping [Thelen et al. 2002] etc. Iteratively trains and evaluates a classifier, starting from an unannotated corpus and a few predefined seed words, The task of extracting the opinion/holder/target is similar to the traditional IE task. Key distinction- the relations between opinion and opinion target are considered important. 158

159 Sentiment Analysis: Identification & Extraction ,[object Object]

Proximity[Hu et al. 2004] etc.

extract the nearby adjectives modifying the target topic as opinion clues

Syntactic dependency [Popescu et al. 2005] etc.

employed language parser to compute the syntactic dependencies to extract the opinion clues with a given target topic

Co-occurrence[Choi et al. 2009]etc.

heuristics: the more frequently a candidate opinion target co-occurs with any opinion clues, the more likely it is the real opinion target

Prepared patterns/rules [Kobayashi et al. 2004] etc.

using a set of predefined extraction patterns/rules,[object Object]

Highlight the potential of text streams as a substitute and supplement for traditional polling.Connect public opinion measured from polls with sentiment measured from tweets. Lexicon-based approachfor sentiment analysis of tweets Within topic tweets, count messages containing positive and negative words defined by the sentiment lexicon 160

Sentiment Analysis: Predicting the Future With Social Media [Asur et al. 2010] Use tweets to forecast box-ofﬁce revenues for movies. Traina language model classifier for sentiment classification of tweets. Findings: The prediction model using the rate at which tweets are created about a movie outperforms the market-based methods. The sentiments present in tweets can be used to improve the prediction. 161

Sentiment Analysis: Target-specific Opinion Identification & Classification of Tweets-Unsupervised Approach [kno.e.sis ongoing work] Simple lexicon-based method doesn't work well. Target of “sexy” is “Helena” Target of “terrific” is “reviews” “free” is not opinionated in movie domain. Target of “loving” is “telling” “well” in “as well” is not opinionated Observations: The opinion clues may not be toward the given target (1,2,3,6) The opinion clues are domain and context dependent (5,7) Single words are not enough (4,7,8) 162

Domain and context-aware sentiment lexicon generation (here take the movie domain as example) General subjective lexicon Commonly used subjective lexicon + polar slangs learned from dictionary Select candidate opinion clues from the domain-specific corpus based on the general lexicon word + surrounding context E.g., {“free”, “free movie”, “free movie streaming”... }, {“must”, “must see”, “a must see”, “must see movie”…} , {“well”, “as well”, “well done”… } Identify the opinion clues and their polarity Utilize information from multiple sources, including the corpus, domain knowledge (e.g., freebase, imdb), general lexicon, etc. Bootstrapping + statistical model E.g., <“must”, “must see”, positive>; <“well”, “well done”, positive> Sentiment Analysis: Target-specific Opinion Identification & Classification of Tweets-Unsupervised Approach [kno.e.sis ongoing work] 163

164 Sentiment Analysis: Target-specific Opinion Identification & Classification of Tweets-Unsupervised Approach [kno.e.sis ongoing work] ,[object Object]

When generating the domain and context-aware sentiment lexicon, use a set of predefined rules to select toward-target candidate opinion clues

When using the generated lexicon to extract target-specific opinion, for each pair of <target, opinion clues> in one tweet, determine whether the opinion clues is toward the target based on their syntactic dependency.

E.g., Lovedthe King's Speech. Funny, moving...Colin Firth is so amazing. I know, you already knew that. (“amazing” won’t be extracted since nsubj(amazing, Firth) )

We also use predefined rules and proximity for complement ,[object Object]

People Analysis showing use of Merger approach (Content+Network) and derived metadata

Finding User Types & Affiliation

Measuring Social Engagement 165

People Analysis: Extracting People Metadata 166

People Analysis: Using Content to Derive People Metadata Personality Signals Extrovert, agreeable, open etc Blogs, Style of Writing Loose and periodic sentence, connotation etc. Psychometric analysis of content Knowledge, abilities, attitude etc. Sample study: Gendered writing styles online [Ellison et al. 2006, Nagarajan et al. 2009, ICWSM etc.] Self-expression tends towards attempting homophily in online dating profiles, given the tendency to 'imitate and impress' in courtship 167 Image: http://bit.ly/JZ6eF Read: ‘How’ people write @Kno.e.sis

People Analysis: Using Network to Derive People Metadata Interesting questions to ask: Who are the most popular people* in the network Who are the most influential people in the network What are the types of people in the network Who are the most active people in the communities Who are the bridges between communities in the network, etc. (*People may also refer to an organization) 168 Metadata from Network: ,[object Object], e.g., An Influential node in the network will be function of time and interest of his audience.

People Analysis: Influence Adding Flavor of Context Analysis ,[object Object]

For individuals to become influential they must not only obtain attention and thus be popular, but also overcome user passivity. [Romero et al. 2010]

Homophily causing Reciprocity on Twitter [TwitterRank, Weng et al. 2010]

Klout Score - True Reach, Amplification [http://klout.com]By Link Analysis Algorithms Hits [Kleinberg 1999] & variants PageRank[Brin et al. 1998] & variants etc.. Links not sufficient! Audience size doesn’t prove influence on twitter [Million Follower Fallacy,Cha et al. 2010] 169 Image: http://bit.ly/9pfTO4

People Analysis: User types & Affiliation Blogger, Scientist, Journalist, Artist, Trustee, Company X in Domain Y.. - Multiple types and affiliations! User interest mining Key Phrase Extraction followed by semantic association on user bio, tweets, lists, favorite posts Twitter Study [Banerjee et al. 2009] ,[object Object]

Web Presence: Use of Web & Knowledge bases (Wikipedia, Blogs) to build context for user types

Entity Spotting & Extraction, followed by Semantic Association and Similarity with user-type context170 Image: kahunainstitute.com *Read Semantics driven Social Media Analysis@ Kno.e.sis

People Analysis: Social Engagement 171 Imagine a crisis scenario such as Haiti (2010) or Japan (2011) Earthquake ,[object Object]

How effectively the community of people talking about this event online, can grow to reach potential donors and people in need of resources (food, water, first aids etc.)?

What are the best possible ways to communicate between resource providers and people in need of resources?

How teams can coordinate well between volunteers at a victim site, to managers in organizational structure, sitting in offices?,[object Object]

NETWORK ANALYSIS - Deriving Network Metadata Interesting questions Network Analysis – Methods Models Metrics Network Analysis – Algorithms Graph Partitioning, Traversal Community Discovery, Evolution Social Network Analysis Diffusion Homophily Study of 3-D Dynamics (People-Content-Network) - Analysis & Visualization tools 173

Network Analysis “To Discover How A, Who is in Touch with B and C, Is Affected by the Relation Between B & C” -John Barnes Interesting questions to ask: How communitiesform around topics- growth & evolution What are the effectsof presence of influential participants in the communities What are the effectsof content nature (or sentiment, opinions) flowing in network on the community life What is the community structure: degree of separation and sub-communities 174 Foundation of network: ,[object Object]

Connections/RelationshipsImage: http://www.onasurveys.com/

Network Analysis: Methods Network Modeling Approaches Random graph model (Erdos-Renyi model) start with n vertices and add edges between them at random Small-world model most nodes are not neighbors of one another, but they can be reached from every other by a small number of hops or steps (Small World Phenomenon) Scale-free model degree distribution follows a power law, i.e., frequency of degree varies as a power of its size 175 Image: http://www.kudosdynamics.com/ Important Literature: [Wasserman et al. 1992, Watts et al. 1998, Albert et al. 2002, Newman et al. 2006, Marin et al. 2010, Easley et al. 2010]

Network Analysis: Methods 176 Network Structure metrics Centrality, Connected Component, Avg. Degree, Clustering Coefficient, Avg. Path Length, Bridge, Cohesion, Prestige, Reciprocity etc. Social Network Analysis methods ,[object Object]

Clusters (Cliques and extensions, Communities)Image: http://www.kudosdynamics.com/ Important Literature: [Wasserman et al. 1992, Watts et al. 1998, Albert et al. 2002, Newman et al. 2006, Marin et al. 2010, Easley et al. 2010]

Network Analysis: Algorithms Graph Partitioning & Traversal Goal: Best time-complexity & reachability Generally follows Greedy paths e.g., K-way multilevel Partitioning, Bron-Kerbosch, K-plex, K-core or N-cliques, DFS, BFS, MST Community Discovery, growth, evolution Based on relationship types (e.g., signed network), geography/location based, interest based etc. Generally follow cluster analysis e.g., Hierarchical clustering algorithms – Top-down, bottom-up Further Reading: Modularity Maximization [Newman et al. 2006] Algorithms comparison survey [Balakrishnan et al. 2006] Online Communities [Preece 2001] 177 "We dream in Graph and We analyze in Matrix” - Barry Wellman, INSNA

Social Network Analysis: Diffusion & Homophily Social Network Analysis (Interested in information flow) Can we predict user actions? Understanding dynamics is challenging! Why to study Diffusion Maximizing Spread (Opinion, Innovation, Recommendation) Outbreak Detection (e.g., disease) Diffusion Behavior Power Law distribution[Leskovec et al. 2007] Factors impacting Diffusion User Homophily – similar behavior tendency [McPherson et al. 2001] Sampling strategy [Choudhury et al. 2010], content nature[Nagarajan et al. 2010]etc. 178 Image: http://bit.ly/fGkIBK

Study of 3-D Dynamics- People, Content, Network Intra Community Activity and connectivity How well connected are individual nodes (People) What keeps them strongly connected over time (Relationship types - Knowledge of Content) 179 Will the two communities coordinate well during an event- crisis or disaster? - Interplay between all three dimensions – P, C, N ,[object Object]

Any bridges to connect to the other community? (People)

Any Similarity in actions with the other community (Can Content help?)Image: http://themelis-cuiper.com

Study of 3-D Dynamics-People, Content, Network Metadata a powerful tool to explore this dynamics* Studies in this direction A Qualitative Examination of Topical Tweet and Retweet Practices [Nagarajan et al. 2010] How content dictates the network flow User-Community Engagement by Multi-faceted Features: A Case Study on Twitter [Purohit et al. 2011] [TO BE PRESENTED TOMORROW IN SoME'11] What factors impact user engagement in topic discussion 180 *Read People-Content-Network Analysis @ Kno.esis

Graphs showing sparse (A) and dense (B) RT networks and their corresponding follower graphs for 'call for action' and 'information sharing' type of tweets M. Nagarajan, H. Purohit, and A. Sheth, ’A Qualitative Examination of Topical Tweet and Retweet Practices,’ 4th Int'l AAAI Conference on Weblogs and Social Media, ICWSM 2010 181

Analysis & Visualization Tools Network WorkBench (NWB) Truthy Graph-tool Orange Pajek Tulip …. Many tools!! Resource: http://en.wikipedia.org/wiki/Social_network_analysis_software 182 Image:http://truthy.indiana.edu/

Trustworthiness in Social Media Why? ,[object Object]

In Disaster scenarios (e.g. Haiti earthquake, Gulf oil spill)

For Political revolution (e.g. Egypt political crisis)

In Political and Social policies (e.g. health care reforms) What? ,[object Object]

(How?) Detect spam and misleading content.

Assess data quality of on-topic content

Citizen Sensor Data Mining, Social Media Analytics and Development Centric Web Applications

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (16)

Similar a Citizen Sensor Data Mining, Social Media Analytics and Development Centric Web Applications

Similar a Citizen Sensor Data Mining, Social Media Analytics and Development Centric Web Applications (20)

Último

Último (20)

Citizen Sensor Data Mining, Social Media Analytics and Development Centric Web Applications

Notas del editor