CSE509 Lecture 5

CSE509: Introduction to Web Science and Technology Lecture 5: Social Network Analysis ArjumandYounus Web Science Research Group Institute of Business Administration (IBA)

Last Time… Web Data Explosion Part I MapReduce Basics MapReduce Example and Details MapReduce Case-Study: Web Crawler based on MapReduce Architecture Part II Large-Scale File Systems Google File System Case-Study August 06, 2011

Today Transition from Web 1.0 to Web 2.0 Social Media Characteristics Part I: Theoretical Aspects Social Networks as a Graph Properties of Social Networks Part II: Getting Hands-On Experience on Social Media Analytics Twitter Data Hacks Part III: Example Researches August 06, 2011

Quick Survey Do you have a Facebook, MySpace, Twitter, or LinkedIn account? Do you own a blog? Do you read blogs? Have you ever searched for something on Wikipedia? Have you ever submitted content to a social network? August 06, 2011

Web 1.0 vs. Web 2.0 August 06, 2011 Borrowed from SIGKDD 2008 tutorial slides of Professor Huan Liu and Professor Nitin Agarwal with permission

What is so Different about Web 2.0? User Generated Content Collaborative Environment: Participatory Web, Citizen Journalism User is the Driving Factor August 06, 2011 A Paradigm Shift rather than a Technology Shift

Top 20 Most Visited Web Sites Internet traffic report by Alexa on July 29th 2008 August 06, 2011 Borrowed from SIGKDD 2008 tutorial slides of Professor Huan Liu and Professor Nitin Agarwal with permission

Various forms of Social Media Blog: Wordpress, blogspot, LiveJournal Forum: Yahoo! Answers, Epinions Media Sharing: Flickr, YouTube, Scribd Microblogging: Twitter, FourSquare Social Networking: Facebook, LinkedIn, Orkut Social Bookmarking: Del.icio.us, Diigo Wikis: Wikipedia, scholarpedia, AskDrWiki August 06, 2011

Characteristics of Social Media “Consumers” become “Producers” Rich User Interaction User-Generated Contents Collaborative environment Collective Wisdom Long Tail Broadcast Media Filter, then Publish Social Media Publish, then Filter August 06, 2011

PART I: Theoretical Aspects August 06, 2011

Networks and Representation Social Network: A social structure made of nodes (individuals or organizations) and edges that connect nodes in various relationships like friendship, kinship etc. August 06, 2011 ,[object Object]

Matrix Representation,[object Object]

Scale-Free Distributions Degree distribution in large-scale networks often follows a power law. A.k.a. long tail distribution, scale-free distribution August 06, 2011 Degrees Nodes

Small-World Effect “Six Degrees of Separation” A famous experiment conducted by Travers and Milgram (1969) Subjects were asked to send a chain letter to his acquaintance in order to reach a target person The average path length is around 5.5 Verified on a planetary-scale IM network of 180 million users (Leskovec and Horvitz 2008) The average path length is 6.6 August 06, 2011

Small World Facebook Experiment by Yahoo! Labs Anyone in the world can get a message to anyone else in just "six degrees of separation" by passing it from friend to friend. Sociologists have tried to prove (or disprove) this claim for decades, but it is still unresolved. http://smallworld.sandbox.yahoo.com/ August 06, 2011

Community Structure Community: People in a group interact with each other more frequently than those outside the group ki = number of edges among node Ni’s neighbors Friends of a friend are likely to be friends as well Measured by clustering coefficient: Density of connections among one’s friends August 06, 2011

Clustering Coefficient August 06, 2011 ,[object Object]

k6=4 as e(4,5), e(5,7), e(5,8), e(7,8)

Average clustering coefficientC = (C1 + C2 + … + Cn)/n ,[object Object]

In a random graph, the expected coefficient is 14/(9*8/2) = 0.19.,[object Object]

Social Computing Tasks Social Computing: a young and vibrant field Conferences: KDD, WSDM, WWW, ICML, AAAI/IJCAI, SocialCom, etc. Tasks Centrality Analysis and Influence Modeling Community Detection Classification and Recommendation Privacy, Spam and Security August 06, 2011

Centrality Analysis and Influence Modeling Centrality Analysis: Identify the most important actors or edges E.g. PageRank in Google Various other criteria Influence modeling: How is information diffused? How does one influence each other? Related Problems Viral marketing: word-of-mouth effect Influence maximization August 06, 2011

Community Detection A community is a set of nodes between which the interactions are (relatively) frequent A.k.a., group, cluster, cohesive subgroups, modules Applications: Recommendation based communities, Network Compression, Visualization of a huge network New lines of research in social media Community Detection in Heterogeneous Networks Community Evolution in Dynamic Networks Scalable Community Detection in Large-Scale Networks August 06, 2011

Classification and Recommendation Common in social media applications Tag suggestion, Product/Friend/Group Recommendation August 06, 2011 Link prediction Network-Based Classification

Privacy, Spam and Security Privacy is a big concern in social media Facebook, Google buzz often appear in debates about privacy NetFlix Prize Sequel cancelled due to privacy concern Simple anonymization does not necessarily protect privacy Spam blog (splog), spam comments, fake identity, etc., all requires new techniques As private information is involved, a secure and trustable system is critical Need to achieve a balance between sharing and privacy August 06, 2011

PART II: Practical SNA with Twittersphere Mining August 06, 2011

Pre-Requisites Expectation that Python is installed and you have some hands-on experience with it Dependencies easy_install networkx twitter (Twitter API for Python) For Windows users Install ActivePython: comes bundled with easy_install easy_installnetworkx easy_install twitter For Linux users sh setuptools-0.6c11-py2.6.egg sudoeasy_installnetworkx sudoeasy_install twitter August 06, 2011

Getting Tweets from Twitter Search API import twitter import json twitter_search=twitter.Twitter(domain="search.twitter.com") search_results=[] for page in range(1,6): search_results.append(twitter_search.search(q="pakistan",rpp=100,page=page)) print json.dumps(search_results, sort_keys=True, indent=1) tweets=[r['text'] for result in search_results for r in result['results']] print tweets August 06, 2011

Lexical Diversity for Tweets words=[] for t in tweets: words+= [w for w in t.split()] lexical_diversity=1.0*len(set(words))/len(words) August 06, 2011

What People are Tweeting: Frequency Analysis freq_dist=nltk.FreqDist(words) freq_dist.keys()[:50] freq_dist.keys()[-50:] August 06, 2011

Extracting Relationships from Tweets (1/3) Step 1: Extracting Graph Data import networkx as nx import re g=nx.DiGraph() twitter_search=twitter.Twitter(domain="search.twitter.com") search_results=[] for page in range(1,6): search_results.append(twitter_search.search(q="pakistan",rpp=100,page=page)) all_tweets=[tweet for page in search_results for tweet in page["results"]] def get_rt_sources(tweet): rt_patterns=re.compile(r"(RT|via)((?:*@+)+)",re.IGNORECASE) return [source.strip() for tuple in rt_patterns.findall(tweet) for source in tuple if source not in ("RT", "via")] for tweet in all_tweets: rt_sources=get_rt_sources(tweet["text"]) if not rt_sources: continue for rt_source in rt_sources: g.add_edge(rt_source,tweet["from_user"],{"tweet_id":tweet["id"]}) August 06, 2011

Extracting Relationships from Tweets (2/3) Step 2: Generating DOT File OUT = "pakistan_search_results.dot“ dot=['"%s" -> "%s" [tweet_id=%s]' % (n1.encode('utf-8'), n2.encode('utf-8'), g[n1][n2]['tweet_id']) for n1, n2 in g.edges()] f=open(OUT, 'w') f.write('strict digraph {%s}' % (';'.join(dot),)) f.close() August 06, 2011

Extracting Relationships from Tweets (3/3) Step 3: Visualizing the Retweet Data in Graphical Form For Windows users For Linux users circo -Tpng -Osnl_search_results pakistan_search_results.dot August 06, 2011

PART III: Example Researches August 06, 2011

Million Follower Fallacy (New York Times) August 06, 2011

Twitter: More a News Medium than a Social Network (PC World) August 06, 2011

Twitter for World Peace (Business Week) August 06, 2011

SocialFlow: Social Media Optimization Social Media Optimization Platform Works in Domains of Viral and Word-of-Mouth Marketing Provides Services to Major Media Outlets Recent study How different audiences consumed and rebroadcast messages news organizations were sending out: AlJazeera English, BBC News, CNN, The Economist, Fox News and New York Times August 06, 2011

August 06, 2011 Twitter as a Real-Time News Analysis Service

Studying Ins and Outs of News Using Twitter to study hot news items people are heavily tweeting about August 06, 2011

Algorithm for Identification of Popular News August 06, 2011

CSE509 Lecture 5

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a CSE509 Lecture 5

Similar a CSE509 Lecture 5 (20)

Más de Web Science Research Group at Institute of Business Administration, Karachi, Pakistan

Más de Web Science Research Group at Institute of Business Administration, Karachi, Pakistan (8)

Último

Último (20)

CSE509 Lecture 5

Notas del editor