SlideShare una empresa de Scribd logo
1 de 62
Descargar para leer sin conexión
mining the social web
Aris Gionis Michael Mathioudakis
Mon, Feb 2 — lecture #2
structure and dynamics of social networks
T-61.6020: Mining the social web — lecture #2
class web page in piazza
https://piazza.com/aalto.fi/spring2015/t616020/home
share resources and also use as a discussion forum
sensible posts :
looking for a project mate
looking for a project mate on idea X
anyone knows how to access dataY?
anyone has seen some analysis on data Z?
… or just anything else
2
T-61.6020: Mining the social web — lecture #2
today’s themes
analysis of the structure and dynamics of social-network
how social networks look like?
how social networks evolve over time?
how people in social networks behave and interact?
how information spreads in social networks and social media?
who is influential?
what is the interplay between structure and content?
3
T-61.6020: Mining the social web — lecture #2
objectives in today’s presentation
focus on one particular topic
review some “classic” papers in the literature
ideas for projects
assess the presented papers
what is the main idea?
what is the novelty?
why they had impact?
4
T-61.6020: Mining the social web — lecture #2
criteria to evaluate the
research projects
originality (has it done before?)
potential impact (how interesting it is and why)
rigorousness and technical novelty
reproducibility
presentation
5
T-61.6020: Mining the social web — lecture #2
structure of social networks
social networks and social-media data can be
represented as graphs (or networks)
how these graphs look like?
what is their structure
data contain additional information
(actions, interactions, dynamics, attributes,…)
mining this additional information as part of
the network structure
6
T-61.6020: Mining the social web — lecture #2
contrast against
random graphs
random graph model by Erdős-Rényi
edges independently drawn with probability p
real-world networks do not look like random graphs
also, random graphs are static
7
degree
distribution
hubs
triangle
coefficient
clusters diameter
giant
component
random
graphs
binomial no no no small yes
real-world
networks
power law yes yes yes small yes
T-61.6020: Mining the social web — lecture #2
graph generation models
a large number of graph generations models have been proposed
preferential-attachment model
copy model
Strogatz-Watts model
typically trying to capture some property of the data
beyond the scope of this class and the project
8
arXiv:0810.1355v1[cs.DS]8Oct2008
Community Structure in Large Networks: Natural Cluster Sizes
and the Absence of Large Well-Defined Clusters ∗
Jure Leskovec †
Kevin J. Lang ‡
Anirban Dasgupta †
Michael W. Mahoney §
Abstract
A large body of work has been devoted to defining and identifying clusters or communities in social
and information networks, i.e., in graphs in which the nodes represent underlying social entities and
the edges represent some sort of interaction between pairs of nodes. Most such research begins with
the premise that a community or a cluster should be thought of as a set of nodes that has more
and/or better connections between its members than to the remainder of the network. In this paper,
we explore from a novel perspective several questions related to identifying meaningful communities
in large social and information networks, and we come to several striking conclusions.
Rather than defining a procedure to extract sets of nodes from a graph and then attempt to
interpret these sets as a “real” communities, we employ approximation algorithms for the graph
partitioning problem to characterize as a function of size the statistical and structural properties of
partitions of graphs that could plausibly be interpreted as communities. In particular, we define the
network community profile plot, which characterizes the “best” possible community—according to the
conductance measure—over a wide range of size scales. We study over 100 large real-world networks,
ranging from traditional and on-line social networks, to technological and information networks and
web graphs, and ranging in size from thousands up to tens of millions of nodes.
Our results suggest a significantly more refined picture of community structure in large networks
than has been appreciated previously. Our observations agree with previous work on small networks,
but we show that large networks have a very different structure. In particular, we observe tight
communities that are barely connected to the rest of the network at very small size scales (up to
≈ 100 nodes); and communities of size scale beyond ≈ 100 nodes gradually “blend into” the expander-
like core of the network and thus become less “community-like,” with a roughly inverse relationship
between community size and optimal community quality. This observation agrees well with the
so-called Dunbar number which gives a limit to the size of a well-functioning community.
However, this behavior is not explained, even at a qualitative level, by any of the commonly-used
network generation models. Moreover, it is exactly the opposite of what one would expect based
on intuition from expander graphs, low-dimensional or manifold-like graphs, and from small social
networks that have served as testbeds of community detection algorithms. The relatively gradual
increase of the network community profile plot as a function of increasing community size depends in
a subtle manner on the way in which local clustering information is propagated from smaller to larger
size scales in the network. We have found that a generative graph model, in which new edges are
added via an iterative “forest fire” burning process, is able to produce graphs exhibiting a network
community profile plot similar to what we observe in our network datasets.
T-61.6020: Mining the social web — lecture #2
community structure in social networks
hypothesis : social networks have well-formed communities
10
Community structure
loose definition of community: a set of vertices densely
connected to each other and sparsely connected to the rest of
the graph
artificial communities:
http://projects.skewed.de/graph-tool/
T-61.6020: Mining the social web — lecture #2
community structure in social networks
11
study community structure in an extensive collection of real-
world networks
authors introduce the network community profile (NCP) plot
characterizes best possible community over a range of scales
T-61.6020: Mining the social web — lecture #2
community structure in social networks
12
dolphins network and its NPC
Community structure
dolphins network and its NCP
(source [Leskovec et al., 2009])
Frieze, Gionis, Tsourakakis Algorithmic Techniques for Modeling and Mining Large Graphs 34 / 277
T-61.6020: Mining the social web — lecture #2
community structure in social networks
13
NPC on DBLP
co-authorship
munity structure
do large-scale real-world networks have this nice artifical
structure? NO!
NCP of a DBLP graph (source [Leskovec et al., 2009])
do large-scale real-world networks have such nice artifical structure?
NO!
10Leskovec,Lang,Dasgupta,andMaho
Network N E Nb Eb
¯d ˜d ¯C D ¯D Description
Social networks
Delicious 147,567 301,921 0.40 0.65 4.09 48.44 0.30 24 6.28 del.icio.us collaborative tagging social network
Epinions 75,877 405,739 0.48 0.90 10.69 183.88 0.26 15 4.27 Who-trusts-whom network from epinions.com [142]
Flickr 404,733 2,110,078 0.33 0.86 10.43 442.75 0.40 18 5.42 Flickr photo sharing social network [101]
LinkedIn 6,946,668 30,507,070 0.47 0.88 8.78 351.66 0.23 23 5.43 Social network of professional contacts
LiveJournal01 3,766,521 30,629,297 0.78 0.97 16.26 111.24 0.36 23 5.55 Friendship network of a blogging community [20]
LiveJournal11 4,145,160 34,469,135 0.77 0.97 16.63 122.44 0.36 23 5.61 Friendship network of a blogging community [20]
LiveJournal12 4,843,953 42,845,684 0.76 0.97 17.69 170.66 0.35 20 5.53 Friendship network of a blogging community [20]
Messenger 1,878,736 4,079,161 0.53 0.78 4.34 15.40 0.09 26 7.42 Instant messenger social network
Email-All 234,352 383,111 0.18 0.50 3.27 576.87 0.50 14 4.07 Research organization email network (all addresses) [113]
Email-InOut 37,803 114,199 0.47 0.82 6.04 165.73 0.58 8 3.74 (all addresses but email has to be sent both ways) [113]
Email-Inside 986 16,064 0.90 0.99 32.58 74.66 0.45 7 2.60 (only emails inside the research organization) [113]
Email-Enron 33,696 180,811 0.61 0.90 10.73 142.36 0.71 13 3.99 Enron email dataset [100]
Answers 488,484 1,240,189 0.45 0.78 5.08 251.78 0.11 22 5.72 Yahoo Answers social network
Answers-1 26,971 91,812 0.56 0.87 6.81 59.17 0.08 16 4.49 Cluster 1 from Yahoo Answers
Answers-2 25,431 65,551 0.48 0.80 5.16 56.57 0.10 15 4.76 Cluster 2 from Yahoo Answers
Answers-3 45,122 165,648 0.53 0.87 7.34 417.83 0.21 15 3.94 Cluster 3 from Yahoo Answers
Answers-4 93,971 266,199 0.49 0.82 5.67 94.48 0.08 16 4.91 Cluster 4 from Yahoo Answers
Answers-5 5,313 11,528 0.41 0.73 4.34 29.55 0.12 14 4.75 Cluster 5 from Yahoo Answers
Answers-6 290,351 613,237 0.40 0.71 4.22 57.16 0.09 22 5.92 Cluster 6 from Yahoo Answers
Information (citation) networks
Cit-Patents 3,764,105 16,511,682 0.82 0.96 8.77 21.34 0.09 26 8.15 Citation network of all US patents [112]
Cit-hep-ph 34,401 420,784 0.96 1.00 24.46 63.50 0.30 14 4.33 Citations between physics (arxiv hep-th) papers [78]
Cit-hep-th 27,400 352,021 0.94 0.99 25.69 106.40 0.33 15 4.20 Citations between physics (arxiv hep-ph) papers [78]
Blog-nat05-6m 29,150 182,212 0.74 0.96 12.50 342.51 0.24 10 3.40 Blog citation network (6 months of data) [116]
Blog-nat06all 32,384 315,713 0.87 0.99 19.50 153.08 0.20 18 3.94 Blog citation network (1 year of data) [116]
Post-nat05-6m 238,305 297,338 0.21 0.34 2.50 39.51 0.13 45 10.34 Blog post citation network (6 months) [116]
Post-nat06all 437,305 565,072 0.22 0.38 2.58 35.54 0.11 54 10.48 Blog post citation network (1 year) [116]
Collaboration networks
AtA-IMDB 883,963 27,473,042 0.87 0.99 62.16 517.40 0.79 15 3.48 IMDB actor collaboration network from Dec 2007
CA-astro-ph 17,903 196,972 0.89 0.98 22.00 65.70 0.67 14 4.21 Co-authorship in astro-ph of arxiv.org [112]
CA-cond-mat 21,363 91,286 0.81 0.93 8.55 22.47 0.70 15 5.36 Co-authorship in cond-mat category [112]
CA-gr-qc 4,158 13,422 0.64 0.78 6.46 17.98 0.66 17 6.10 Co-authorship in gr-qc category [112]
CA-hep-ph 11,204 117,619 0.81 0.97 21.00 130.88 0.69 13 4.71 Co-authorship in hep-ph category [112]
CA-hep-th 8,638 24,806 0.68 0.85 5.74 12.99 0.58 18 5.96 Co-authorship in hep-th category [112]
CA-DBLP 317,080 1,049,866 0.67 0.84 6.62 21.75 0.73 23 6.75 DBLP co-authorship network [20]
Table 1: Network datasets we analyzed. Statistics of networks we consider: number of nodes N; number of edges E; fraction nodes not in whiskers
(size of largest biconnected component) Nb/N; fraction of edges in biconnected component Eb/E; average degree ¯d = 2E/N; second order average
degree ˜d; average clustering coefficient ¯C; diameter D; and average path length ¯D.
Communitystructureinlargenetworks11
Network N E Nb Eb
¯d ˜d ¯C D ¯D Description
Web graphs
Web-BerkStan 319,717 1,542,940 0.57 0.88 9.65 1,067.55 0.32 35 5.66 Web graph of Stanford and UC Berkeley [98]
Web-Google 855,802 4,291,352 0.75 0.92 10.03 170.35 0.62 24 6.27 Web graph Google released in 2002 [3]
Web-Notredame 325,729 1,090,108 0.41 0.76 6.69 280.68 0.47 46 7.22 Web graph of University of Notre Dame [11]
Web-Trec 1,458,316 6,225,033 0.59 0.78 8.54 682.89 0.68 112 8.58 Web graph of TREC WT10G web corpus [2]
Internet networks
As-RouteViews 6,474 12,572 0.62 0.80 3.88 164.81 0.40 9 3.72 AS from Oregon Exchange BGP Route View [112]
As-Caida 26,389 52,861 0.61 0.81 4.01 281.93 0.33 17 3.86 CAIDA AS Relationships Dataset
As-Skitter 1,719,037 12,814,089 0.99 1.00 14.91 9,934.01 0.17 5 3.44 AS from traceroutes run daily in 2005 by Skitter
As-Newman 22,963 48,436 0.65 0.83 4.22 261.46 0.35 11 3.83 AS graph from Newman [5]
As-Oregon 13,579 37,448 0.72 0.90 5.52 235.97 0.46 9 3.58 Autonomous systems [1]
Gnutella-25 22,663 54,693 0.59 0.83 4.83 10.75 0.01 11 5.57 Gnutella network on March 25 2000 [143]
Gnutella-30 36,646 88,303 0.55 0.81 4.82 11.46 0.01 11 5.75 Gnutella P2P network on March 30 2000 [143]
Gnutella-31 62,561 147,878 0.54 0.81 4.73 11.60 0.01 11 5.94 Gnutella network on March 31 2000 [143]
eDonkey 5,792,297 147,829,887 0.93 1.00 51.04 6,139.99 0.08 5 3.66 P2P eDonkey graph for a period of 47 hours in 2004
Bi-partite networks
IpTraffic 2,250,498 21,643,497 1.00 1.00 19.23 94,889.05 0.00 5 2.53 IP traffic graph a single router for 24 hours
AtP-astro-ph 54,498 131,123 0.70 0.87 4.81 16.67 0.00 28 7.78 Authors-to-papers network of astro-ph [116]
AtP-cond-mat 57,552 104,179 0.65 0.79 3.62 10.54 0.00 31 9.96 Authors-to-papers network of cond-mat [116]
AtP-gr-qc 14,832 22,266 0.47 0.60 3.00 9.72 0.00 35 11.08 Authors-to-papers network of gr-qc [116]
AtP-hep-ph 47,832 86,434 0.60 0.76 3.61 16.80 0.00 27 8.55 Authors-to-papers network of hep-ph [116]
AtP-hep-th 39,986 64,154 0.53 0.68 3.21 13.07 0.00 36 10.74 Authors-to-papers network of hep-th [116]
AtP-DBLP 615,678 944,456 0.49 0.64 3.07 13.61 0.00 48 12.69 DBLP authors-to-papers bipartite network
Spending 1,831,540 2,918,920 0.34 0.58 3.19 1,536.35 0.00 26 5.62 Users-to-keywords they bid
Hw7 653,260 2,278,448 0.99 0.99 6.98 346.85 0.00 24 6.26 Downsampled advertiser-query bid graph
Netflix 497,959 100,480,507 1.00 1.00 403.57 28,432.89 0.00 5 2.31 Users-to-movies they rated. From Netflix prize [4]
QueryTerms 13,805,808 17,498,668 0.28 0.41 2.53 14.92 0.00 86 19.81 Users-to-queries they submit to a search engine
Clickstream 199,308 951,649 0.39 0.87 9.55 430.74 0.00 7 3.83 Users-to-URLs they visited [126]
Biological networks
Bio-Proteins 4,626 14,801 0.72 0.91 6.40 24.25 0.12 12 4.24 Yeast protein interaction network [51]
Bio-Yeast 1,458 1,948 0.37 0.51 2.67 7.13 0.14 19 6.89 Yeast protein interaction network data [92]
Bio-YeastP0.001 353 1,517 0.73 0.93 8.59 20.18 0.57 11 4.33 Yeast protein-protein interaction map [135]
Bio-YeastP0.01 1,266 8,511 0.79 0.97 13.45 47.73 0.44 12 3.87 Yeast protein-protein interaction map [135]
Table 2: Network datasets we analyzed. Statistics of networks we consider: number of nodes N; number of edges E; fraction nodes not in whiskers
(size of largest biconnected component) Nb/N; fraction of edges in biconnected component Eb/E; average degree ¯d = 2E/N; second order average
degree ˜d; average clustering coefficient ¯C; diameter D; and average path length ¯D.
Leskovec,Lang,Dasgupta,andMahoney
Network N E Nb Eb
¯d ˜d ¯C D ¯D Description
Nearly low-dimensional networks
Road-CA 1,957,027 2,760,388 0.80 0.85 2.82 3.17 0.06 865 310.97 California road network
Road-USA 126,146 161,950 0.97 0.98 2.57 2.81 0.03 617 218.55 USA road network (only main roads)
Road-PA 1,087,562 1,541,514 0.79 0.85 2.83 3.20 0.06 794 306.89 Pennsylvania road network
Road-TX 1,351,137 1,879,201 0.78 0.84 2.78 3.15 0.06 1,064 418.73 Texas road network
PowerGrid 4,941 6,594 0.62 0.69 2.67 3.87 0.11 46 19.07 Power grid of Western States Power Grid [156]
Mani-faces7k 696 6,979 0.98 0.99 20.05 37.99 0.56 16 5.52 Faces (64x64 grayscale images) (connect 7k closest pairs)
Mani-faces4k 663 3,465 0.90 0.97 10.45 20.20 0.56 29 8.96 Faces (connect 4k closest pairs)
Mani-faces2k 551 1,981 0.84 0.94 7.19 12.77 0.54 32 11.07 Faces (connect 2k closest pairs)
Mani-facesK10 698 6,935 1.00 1.00 19.87 25.32 0.51 6 3.25 Faces (connect every to 10 nearest neighbors)
Mani-facesK3 698 2,091 1.00 1.00 5.99 7.98 0.45 9 4.89 Faces (connect every to 5 nearest neighbors)
Mani-facesK5 698 3,480 1.00 1.00 9.97 12.91 0.48 7 4.03 Faces (connect every to 3 nearest neighbors)
Mani-swiss200k 20,000 200,000 1.00 1.00 20.00 21.08 0.59 103 37.21 Swiss-roll (connect 200k nearest pairs of nodes)
Mani-swiss100k 19,990 99,979 1.00 1.00 10.00 11.02 0.59 162 58.32 Swiss-roll (connect 100k nearest pairs of nodes)
Mani-swiss60k 19,042 57,747 0.93 0.96 6.07 7.03 0.59 243 89.15 Swiss-roll (connect 60k nearest pairs of nodes)
Mani-swissK10 20,000 199,955 1.00 1.00 20.00 25.38 0.56 10 5.47 Swiss-roll (every node connects to 10 nearest neighbors)
Mani-swissK5 20,000 99,990 1.00 1.00 10.00 12.89 0.54 13 8.34 Swiss-roll (every node connects to 5 nearest neighbors)
Mani-swissK3 20,000 59,997 1.00 1.00 6.00 7.88 0.50 17 6.89 Swiss-roll (every node connects to 3 nearest neighbors)
IMDB Actor-to-Movie graphs
AtM-IMDB 2,076,978 5,847,693 0.49 0.82 5.63 65.41 0.00 32 6.82 Actors-to-movies graph from IMDB (imdb.com)
Imdb-top30 198,430 566,756 0.99 1.00 5.71 18.19 0.00 26 8.32 Actors-to-movies graph heavily preprocessed
Imdb-raw07 601,481 1,320,616 0.54 0.79 4.39 20.94 0.00 32 8.55 Country clusters were extracted from this graph
Imdb-France 35,827 74,201 0.51 0.76 4.14 14.62 0.00 20 6.57 Cluster of French movies
Imdb-Germany 21,258 42,197 0.56 0.78 3.97 13.69 0.00 34 7.47 German movies (to actors that played in them)
datasets!
publicly
available
in SNAP
T-61.6020: Mining the social web — lecture #2
community structure in social networks
15
1. up to a certain size k (k ∼ 100 vertices) there exist good cuts
as the size increases so does the quality of the community
2. at the size k we observe the best possible community
such communities are typically connected to the remainder
with a single edge
3. above the size k the community quality decreases
this is because they blend in and gradually disappear
main findings
T-61.6020: Mining the social web — lecture #2 16
hypothesis : well-formed and interesting, assumed true
data : very extensive collection
methodology : introduce a new metric (NCP)
impact / interestingness : challenged the starting hypothesis
reproducibility : datasets and code publicly available
Community structure in large networks: natural cluster
sizes and the absence of large well-defined clusters
Leskovec, Lang, Dasgupta, Mahoney
summary
T-61.6020: Mining the social web — lecture #2
Community structure in large networks: natural cluster
sizes and the absence of large well-defined clusters
Leskovec, Lang, Dasgupta, Mahoney
17
1 2 3 4 5
originality
low high
1 2 3 4 5
impact
low high
1 2 3 4 5
rigorousness / technical novelty
low high
1 2 3 4 5
reproducibility
low high
rXiv:physics/0603229v3[physics.soc-ph]28Jan2007 Graph Evolution:
Densification and Shrinking Diameters
Jure Leskovec
School of Computer Science, Carnegie Mellon University, Pittsburgh, PA
Jon Kleinberg
Department of Computer Science, Cornell University, Ithaca, NY
Christos Faloutsos
School of Computer Science, Carnegie Mellon University, Pittsburgh, PA
February 2, 2008
Abstract
How do real graphs evolve over time? What are “normal” growth patterns in
social, technological, and information networks? Many studies have discovered
patterns in static graphs, identifying properties in a single snapshot of a large
network, or in a very small number of snapshots; these include heavy tails
for in- and out-degree distributions, communities, small-world phenomena, and
others. However, given the lack of information about network evolution over
long periods, it has been hard to convert these findings into statements about
trends over time.
Here we study a wide range of real graphs, and we observe some surprising
phenomena. First, most of these graphs densify over time, with the number
of edges growing super-linearly in the number of nodes. Second, the average
distance between nodes often shrinks over time, in contrast to the conventional
wisdom that such distance parameters should increase slowly as a function of
the number of nodes (like O(log n) or O(log(log n)).
Existing graph generation models do not exhibit these types of behavior,
even at a qualitative level. We provide a new graph generator, based on a
“forest fire” spreading process, that has a simple, intuitive justification, requires
very few parameters (like the “flammability” of nodes), and produces graphs
exhibiting the full range of properties observed both in prior work and in the
T-61.6020: Mining the social web — lecture #2
graph evolution and shrinking diameters
19
networks evolve over time
typically new vertices/edges are added (not many deletions)
how do network distances change over time?
constant average degree and vertex addition…
… implies diameter = O(logn) — slowly increasing
according to random-graph model
also according to other more “realistic” models
e.g., preferential attachment
T-61.6020: Mining the social web — lecture #2
graph evolution and shrinking diameters
20
empirical observation :
as networks evolve distances shrink (e.g., diameter shrink)
why?
number of edges grow faster than number of vertices
graph become denser — graph densification
me-evolving networks
J. Leskovec J. Kleinberg C. Faloutsos
[Leskovec et al., 2005b]
• densification power law:
|Et| / |Vt|↵
1  ↵  2
• shrinking diameters: diameter is shrinking over time.
Graphs Over Time 7
1994 1996 1998 2000 2002
0
5
10
15
20
Year of publication
Averageout−degree
1975 1980 1985 1990 1995
4
6
8
10
12
Year granted
Averageout−degree
(a) arXiv (b) Patents
0 200 400 600
3.4
3.6
3.8
4
4.2
Averageout−degree
Time [days]
1994 1996 1998 2000
1
1.5
2
2.5
3
Year of publication
Averageout−degree
(c) Autonomous Systems (d) Affiliation network
Figure 1: The average node out-degree over time. Notice that it increases, in all 4 datasets.
That is, all graphs are densifying.
average degree
graph evolution and shrinking diameters
number of edges
graph evolution and shrinking diameters
10
2
10
3
10
4
10
5
10
2
10
3
10
4
10
Numberofedges
Number of nodes
Jan 1993
Edges
= 0.0113 x
1.69
R
2
=1.0
10
5
10
6
10
7
10
5
10
6
10
7
Number of nodes
Numberofedges
1975
Edges
= 0.0002 x
1.66
R
2
=0.99
(a) arXiv (b) Patents
10
3.5
10
3.6
10
3.7
10
3.8
10
4.1
10
4.2
10
4.3
10
4.4
Numberofedges
Number of nodes
Edges
= 0.87 x
1.18
R
2
=1.00
10
2
10
3
10
4
10
5
10
2
10
3
10
4
10
5
10
6
Numberofedges
Number of nodes
Edges
= 0.4255 x
1.15
R
2
=1.0
(c) Autonomous Systems (d) Affiliation network
10
3
10
4
10
5
10
3
10
4
10
5
10
6
Number of nodes
Numberofedges
Oct ’03
May ’05
Edges
= 1 x
1.12
R
2
=1.00
10
4
10
5
10
6
10
3
10
4
10
5
10
6
10
7
Number of nodes
Numberofedges
1910
2004
Edges
= 0.9 x
1.11
R
2
=0.98
(e) Email network (f) IMDB actors to movies network
Figure 2: Number of edges e(t) versus number of nodes n(t), in log-log scales, for several
graphs. All 4 graphs obey the Densification Power Law, with a consistently good fit. Slopes:
a = 1.68, 1.66, 1.18, 1.15, 1.12, and 1.11 respectively.
effective diameter
graph evolution and shrinking diameters
1992 1994 1996 1998 2000 2002 2004
4
5
6
7
8
9
Time [years]
Effectivediameter
Post ’95 subgraph, no past
1992 1994 1996 1998 2000 2002
4
5
6
7
8
9
10
Time [years]
Effectivediameter
Post ’95 subgraph, no past
(a) arXiv citation graph (b) Affiliation network
1975 1980 1985 1990 1995 2000
5
10
15
20
25
30
35
Time [years]
Effectivediameter Full graph
Post ’85 subgraph
Post ’85 subgraph, no past
3000 3500 4000 4500 5000 5500 6000 6500
4
4.2
4.4
4.6
4.8
5
Effectivediameter
Size of the graph [number of nodes]
Linear fit
(c) Patents citation graph (d) Autonomous Systems
0 5 10 15 20
4
4.5
5
5.5
6
6.5
7
Time [months]
Effectivediameter
Full graph
Post Jan ’04 subgraph
Post Jan ’04 subgraph, no past
1920 1940 1960 1980 2000
8
9
10
11
12
13
14
15
16
Time [years]
Effectivediameter
Full graph
Post ’40 subgraph
Post ’40 subgraph, no past
(e) Email network (f) IMDB actors to movies network
Figure 3: The effective diameter over time for 6 different datasets. Notice consistent decrease
of the diameter over time.
T-61.6020: Mining the social web — lecture #2
graph evolution and shrinking diameters
24
theoretical justification :
proposed a graph-evolution model that explains the
empirical findings
(graph densification and shrinking diameters)
forest fire model (FF)
T-61.6020: Mining the social web — lecture #2 25
hypothesis : well-formed, assumed true
as graph evolves distances increase
data : extensive collection — how to collect evolving networks?
methodology : simple statistics, but never done before
impact/interestingness : challenged the hypothesis, interesting findings
reproducibility : datasets and code publicly available
summary
Graph evolution: densification and shrinking diameters
Leskovec, Kleinberg, Faloutsos
T-61.6020: Mining the social web — lecture #2
Graph evolution: densification and shrinking diameters
Leskovec, Kleinberg, Faloutsos
26
1 2 3 4 5
originality
low high
1 2 3 4 5
impact
low high
1 2 3 4 5
rigorousness / technical novelty
low high
1 2 3 4 5
reproducibility
low high
Feedback Effects between Similarity and Social Influence
in Online Communities
David Crandall
Dept. of Computer Science
Cornell University
Ithaca, NY 14853
crandall@cs.cornell.edu
Dan Cosley
Dept. of Communication
Cornell University
Ithaca, NY 14853
drc44@cornell.edu
Daniel Huttenlocher
Dept. of Computer Science
Cornell University
Ithaca, NY 14853
dph@cs.cornell.edu
Jon Kleinberg
Dept. of Computer Science
Cornell University
Ithaca, NY 14853
kleinber@cs.cornell.edu
Siddharth Suri
Dept. of Computer Science
Cornell University
Ithaca, NY 14853
ssuri@cs.cornell.edu
ABSTRACT
A fundamental open question in the analysis of social net-
works is to understand the interplay between similarity and
social ties. People are similar to their neighbors in a social
network for two distinct reasons: first, they grow to resemble
their current friends due to social influence; and second, they
tend to form new links to others who are already like them,
a process often termed selection by sociologists. While both
factors are present in everyday social processes, they are in
tension: social influence can push systems toward unifor-
mity of behavior, while selection can lead to fragmentation.
As such, it is important to understand the relative e↵ects
of these forces, and this has been a challenge due to the
di culty of isolating and quantifying them in real settings.
We develop techniques for identifying and modeling the in-
teractions between social influence and selection, using data
from online communities where both social interaction and
changes in behavior over time can be measured. We find
clear feedback e↵ects between the two factors, with rising
similarity between two individuals serving, in aggregate, as
an indicator of future interaction — but with similarity then
continuing to increase steadily, although at a slower rate, for
the current activities of their friends, or of the people most
similar to them?
Categories and Subject Descriptors: H.2.8 Database
Management: Database Applications – Data Mining
General Terms: Measurement, Theory
Keywords: social networks, online communities, social in-
fluence
1. INTRODUCTION
Social influence and selection. A fundamental property
of social networks is that people tend to have attributes
similar to those of their friends. There are two underlying
reasons for this. First, the process of social influence [7] leads
people to adopt behaviors exhibited by those they interact
with; this e↵ect is at work in many settings where new ideas
di↵use by word-of-mouth or imitation through a network of
people [19, 22]. A second, distinct reason is that people tend
to form relationships with others who are already similar to
them. This phenomenon, which is often termed selection,
has a long history of study in sociology [13, 16].1
The two forces of social influence and selection are both
seen in a wide range of social settings: people decide to adopt
T-61.6020: Mining the social web — lecture #2
similarity and social influence
28
observation : people are similar to their friends
selection or influence?
questions :
how social interaction affects interests, and vice versa?
can we use social similarity and interaction to predict
future behavior?
T-61.6020: Mining the social web — lecture #2
user interests and
similarity between users
29
focus on wikipedia editors
who edits which page?
edits up to time t forms a vector expressing user interests
up to that time point
similarity between users
consider the similarity of two users who “meet”
one posts in the discussion page of the other
in the
r.
d to a
inter-
ooting.
tworks
wer of
g peo-
ts, re-
use one of the more common measures, the cosine metric,
Cosine(~u,~v) = cos ~u ~v =
~u · ~v
||~u||2||~v||2
, (1)
where ||~v||2 denotes the Euclidean norm of v.
While a comparison of similarity measures is not the fo-
cus of our current work, we have evaluated a wide range of
measures for our purpose. We use the cosine metric here be-
cause it is independent of the rate at which people are edit-
user interests and
similarity between users
main finding :
Figure 1: Average cosine similarity of user pairs as
a function of the number of edits from time of first
interaction, for Wikipedia.
2.2
The
logues
media
throug
pattern
social i
article
a site
sharing
out cle
people
A na
— rap
but ste
a mod
networ
a mini
and in
ters of
that th
user interests and
similarity between users
possible explanation :
feedback loop between social influence and selection
similarity leads to interaction, which leads to further similarity
proposed a theoretical model to explain the findings
(neighbors may affect actions and interactions)
predicting future behavior based on
user similarity and user interaction
(a) Wikipedia (b) LiveJournal
Figure 4: (a) Probability of joining a community based on k exposure via social ties versus similarity ties
for (a) Wikipedia and (b) LiveJournal. The solid black curves corresponds to social ties and the dashed red
curves to similarity ties. The error bars represent ±2 standard errors.
solid black curves are drawn using neighbors in the social
influence graph for each community, while the dashed red
months apart. Many first edits close to t1 would suggest
e↵ects based on short-term processes, such as immediate
T-61.6020: Mining the social web — lecture #2 33
question to study : interplay between influence and selection
data : wikipedia edits (creative but somewhat limited)
methodology : simple statistics, theoretical model, prediction model
impact/interestingness : some interesting findings
reproducibility : datasets publicly available
summary
Feedback effects between similarity and social influence in
online communities
Crandall et al.
T-61.6020: Mining the social web — lecture #2
Feedback effects between similarity and social influence in
online communities
Crandall et al.
34
1 2 3 4 5
originality
low high
1 2 3 4 5
impact
low high
1 2 3 4 5
rigorousness / technical novelty
low high
1 2 3 4 5
reproducibility
low high
Meme-tracking and the Dynamics of the News Cycle
Jure Leskovec
∗†
Lars Backstrom
∗
Jon Kleinberg
∗
∗
Cornell University
†
Stanford University
jure@cs.stanford.edu lars@cs.cornell.edu kleinber@cs.cornell.edu
ABSTRACT
Tracking new topics, ideas, and “memes” across the Web has been
an issue of considerable interest. Recent work has developed meth-
ods for tracking topic shifts over long time scales, as well as abrupt
spikes in the appearance of particular named entities. However,
these approaches are less well suited to the identification of content
that spreads widely and then fades over time scales on the order of
days — the time scale at which we perceive news and events.
We develop a framework for tracking short, distinctive phrases
that travel relatively intact through on-line text; developing scalable
algorithms for clustering textual variants of such phrases, we iden-
tify a broad class of memes that exhibit wide spread and rich vari-
ation on a daily basis. As our principal domain of study, we show
how such a meme-tracking approach can provide a coherent repre-
sentation of the news cycle — the daily rhythms in the news media
that have long been the subject of qualitative interpretation but have
never been captured accurately enough to permit actual quantitative
analysis. We tracked 1.6 million mainstream media sites and blogs
over a period of three months with the total of 90 million articles
and we find a set of novel and persistent temporal patterns in the
news cycle. In particular, we observe a typical lag of 2.5 hours
between the peaks of attention to a phrase in the news media and
in blogs respectively, with divergent behavior around the overall
peak and a “heartbeat”-like pattern in the handoff between news
and blogs. We also develop and analyze a mathematical model for
the kinds of temporal variation that the system exhibits.
Categories and Subject Descriptors: H.2.8 [Database Manage-
ment]: Database applications—Data mining
General Terms: Algorithms; Experimentation.
Keywords: Meme-tracking, Blogs, News media, News cycle, In-
formation cascades, Information diffusion, Social networks
abilistic term mixtures have been successful at identifying long-
range trends in general topics over time [5, 7, 16, 17, 30, 31]. At the
other extreme, identifying hyperlinks between blogs and extracting
rare named entities has been used to track short information cas-
cades through the blogosphere [3, 14, 20, 23]. However, between
these two extremes lies much of the temporal and textual range
over which propagation on the web and between people typically
occurs, through the continuous interaction of news, blogs, and web-
sites on a daily basis. Intuitively, short units of text, short phrases,
and “memes” that act as signatures of topics and events propagate
and diffuse over the web, from mainstream media to blogs, and vice
versa. This is exactly the focus of our study here.
Moreover, it is at this intermediate temporal and textual granular-
ity of memes and phrases that people experience news and current
events. A succession of story lines that evolve and compete for at-
tention within a relatively stable set of broader topics collectively
produces an effect that commentators refer to as the news cycle.
Tracking dynamic information at this temporal and topical resolu-
tion has proved difficult, since the continuous appearance, growth,
and decay of new story lines takes place without significant shifts
in the overall vocabulary; in general, this process can also not be
closely aligned with the appearance and disappearance of specific
named entities (or hyperlinks) in the text. As a result, while the
dynamics of the news cycle has been a subject of intense interest to
researchers in media and the political process, the focus has been
mainly qualitative, with a corresponding lack of techniques for un-
dertaking quantitative analysis of the news cycle as a whole.
Our approach to meme-tracking, with applications to the news
cycle. Here we develop a method for tracking units of information
as they spread over the web. Our approach is the first to scalably
identify short distinctive phrases that travel relatively intact through
T-61.6020: Mining the social web — lecture #2
meme tracking
36
understand the dynamics of reported news
focus on 24-hour news cycles
questions :
do such news cycles exist?
can we detect them in the data?
can we measure their properties
T-61.6020: Mining the social web — lecture #2
meme tracking
37
dataset :
90 m news articles from the 2008 US presidential elections
how to identify news cycles :
urls, topics, name entities, bag-of-words…?
approach taken : quotes (memes)
easy to manage at large scale
travel relatively unchanged via many articles
is palling around with terrorists
as being so imperfect he is palling around with terrorists who would target their own country
a force for good in the world
we see america as a force for good in this world we see america as
a force for exceptionalism our opponents see america as imperfect
enough to pal around with terrorists who would bomb their own country
s as being so imperfect enough
uld target their own country
america it seems as being so imperfect
this is not a man who sees america as you see america and as i see america
this is not a man who sees america as you see it and how i see america
palling around with terrorists who would target their own country
that he s palling around with terrorists who would target their own country
pal around with terrorists who targeted their own country
palling around with terrorists who target their own country
this is someone who sees america as impe
around with terrorists who targeted th
our opponent is someone who sees america as imperfect enough to pal around with
terrorists who targeted their own country
our opponent though is someone who sees america it seems as being so imperfect
that he s palling around with terrorists who would target their own country
this is not a man who sees america as you see it and how i see america we see
imperfect imperfect enough that
ld target their own country
perfect imperfect enough that
would target their own country
is someone who sees america it seems as being so imperfect that he s palling
around with terrorists who would target their own country
our opponent is someone who sees america it seems as being so imperfect that
he s palling around with terrorists who would target their own country
our opponent is someone who sees america as imperfect enough to pal around with
terrorists who target their own country
we see america as a force of good in this
world we see an america of exceptionalism
someone who sees america as imperfe
around with terrorists who targeted th
someone who sees america it seems as being so imperfect that he s palling around
with terrorists who would target their own country
sees america as imperfect enough to pal around with terrorists who targeted their own country
terrorists who would target their own country
imperfect enough that he s palling around
with terrorists who would target their country
Figure 1: A small portion of the full set of variants of Sarah Palin’s quote, “Our opponent is someone who sees America, it seems,
as being so imperfect, imperfect enough that he’s palling around with terrorists who would target their own country.” The arrows
indicate the (approximate) inclusion of one variant in another, as part of the methodology developed in Section 2.
1
4 8
9
13
phrases with this property are exclusively produced by spammers.
(We use ε = .25, L = 4, and M = 10 in our implementation.)
meme tracking
interesting optimization problem
identify single-rooted propagations
s as being so imperfect enough
uld target their own country
america it seems as being so imperfect
our opponent though is someone who sees america it seems as being so impe
that he s palling around with terrorists who would target their own count
this is not a man who sees america as you see it and how i see am
imperfect imperfect enough that
ld target their own country
perfect imperfect enough that
would target their own country
is someone who sees america it seems as being so imperfect that he s pallin
around with terrorists who would target their own country
our opponent is someone who sees america it seems as being so imperfect th
he s palling around with terrorists who would target their own country
Figure 1: A small portion of the full set of variants of Sarah Palin’s
as being so imperfect, imperfect enough that he’s palling around wit
indicate the (approximate) inclusion of one variant in another, as part
1
2
3
4
5
6
7
8
9
10
11
13
15
14
12
Figure 2: Phrase graph. Each phrase is a node and we want to
delete the least edges so that each resulting connected compo-
nent has a single root node/phase, a node with zero out-edges.
By deleting the indicated edges we obtain the optimal solution.
To begin, we define some terminology. We will refer to each
news article or blog post as an item, and refer to a quoted string
meme tracking
volume distributions
5 in Fig. 2). So, the phrase cluster should be a
ll paths terminate in a single root node.
o identify phrase clusters, we would like delete
weight from the phrase graph so it falls apart
with the property that each piece “feeds into”
hat can serve as the exemplar for the phrase
ely, we define a directed acyclic graph to be
ntains exactly one root node. (Note that ev-
one root.) We now define the following DAG
ng: Given a directed acyclic graph with
delete a set of edges of minimum to-
hat each of the resulting components is
2 shows a DAG with all edge weights equal to
edges forms the unique optimal solution.
DAG Partitioning is computationally intractable
We then discuss the heuristic we use for the
which we find to work well in practice.
DAG Partitioning is NP-hard.
10-1
100
10
1
102
103
104
105
10
6
10
7
108
109
100
101
102
103
104
105
No.ofitemswithvolume≥x
Volume, x
Phrases: ∝ x-1.8
Clusters: ∝ x-2.1
Lipstick: ∝ x-0.85
Figure 3: Phrase volume distribution. We consider the volume
of individual phrases, phrase-clusters, and the phrases that
compose the “Lipstick on a pig” cluster. Notice phrases and
phrase-clusters have similar power-law distribution while the
“Lipstick on a pig” cluster has much fatter tail, which means
that popular phrases have unexpectedly high popularity.
to the cluster to which it has the most edges. For example, in Fig. 2
Figure 4: Top 50 threads in the news cycle with highest volume for the period Aug. 1 – Oct. 31, 2008. Each thread consists of all news
articles and blog posts containing a textual variant of a particular quoted phrases. (Phrase variants for the two largest threads in
each week are shown as labels pointing to the corresponding thread.) The data is drawn as a stacked plot in which the thickness of the
strand corresponding to each thread indicates its volume over time. Interactive visualization is available at http://memetracker.org.
threads dynamics
T-61.6020: Mining the social web — lecture #2 42
question to study : identify news cycles, study their dynamics
data : news articles
methodology : interesting computational problems in
managing memes
impact/interestingness : interesting methods
interesting findings
reproducibility : datasets publicly available
summary
Meme-tracking and the dynamics of the news cycle
Leskovec, Backstrom, Kleinberg
T-61.6020: Mining the social web — lecture #2
Meme-tracking and the dynamics of the news cycle
Leskovec, Backstrom, Kleinberg
43
1 2 3 4 5
originality
low high
1 2 3 4 5
impact
low high
1 2 3 4 5
rigorousness / technical novelty
low high
1 2 3 4 5
reproducibility
low high
Everyone’s an Influencer:
Quantifying Influence on Twitter
Eytan Bakshy∗
University of Michigan, USA
ebakshy@umich.edu
Jake M. Hofman
Yahoo! Research, NY, USA
hofman@yahoo-inc.com
Winter A. Mason
Yahoo! Research, NY, USA
winteram@yahoo-
inc.com
Duncan J. Watts
Yahoo! Research, NY, USA
djw@yahoo-inc.com
ABSTRACT
In this paper we investigate the attributes and relative influ-
ence of 1.6M Twitter users by tracking 74 million diffusion
events that took place on the Twitter follower graph over
a two month interval in 2009. Unsurprisingly, we find that
the largest cascades tend to be generated by users who have
been influential in the past and who have a large number
of followers. We also find that URLs that were rated more
interesting and/or elicited more positive feelings by workers
on Mechanical Turk were more likely to spread. In spite of
these intuitive results, however, we find that predictions of
which particular user or URL will generate large cascades
are relatively unreliable. We conclude, therefore, that word-
of-mouth diffusion can only be harnessed reliably by tar-
geting large numbers of potential influencers, thereby cap-
turing average effects. Finally, we consider a family of hy-
pothetical marketing strategies, defined by the relative cost
of identifying versus compensating potential “influencers.”
We find that although under some circumstances, the most
influential users are also the most cost-effective, under a
wide range of plausible assumptions the most cost-effective
performance can be realized using “ordinary influencers”—
individuals who exert average or even less-than-average in-
fluence.
Categories and Subject Descriptors
H.1.2 [Models and Principles]: User/Machine Systems;
J.4 [Social and Behavioral Sciences]: Sociology
Keywords
Communication networks, Twitter, diffusion, influence, word
of mouth marketing.
1. INTRODUCTION
Word-of-mouth diffusion has long been regarded as an im-
portant mechanism by which information can reach large
populations, possibly influencing public opinion [14], adop-
tion of innovations [26], new product market share [4], or
brand awareness [15]. In recent years, interest among re-
searchers and marketers alike has increasingly focused on
whether or not diffusion can be maximized by seeding a
piece of information or a new product with certain spe-
cial individuals, often called “influentials” [34, 15] or sim-
ply “influencers,” who exhibit some combination of desirable
attributes—whether personal attributes like credibility, ex-
pertise, or enthusiasm, or network attributes such as connec-
tivity or centrality—that allows them to influence a dispro-
portionately large number of others [10], possibly indirectly
via a cascade of influence [31, 16].
Although appealing, the claim that word-of-mouth diffu-
sion is driven disproportionately by a small number of key
influencers necessarily makes certain assumptions about the
underlying influence process that are not based directly on
empirical evidence. Empirical studies of diffusion are there-
fore highly desirable, but historically have suffered from two
major difficulties. First, the network over which word-of-
mouth influence spreads is generally unobservable, hence
T-61.6020: Mining the social web — lecture #2
who is influential in twitter?
45
questions :
who is influential and in which content?
(celebrity vs. expert on a topic vs. trusted friend…)
can we predict who is influential?
T-61.6020: Mining the social web — lecture #2
who is influential in twitter?
46
dataset :
track 1.6 m users
74 m diffusion events (cascades of shortened urls)
two-month period in 2009
definition of influential :
someone who posts urls that many retweet
(narrow for the purpose of the study)
T-61.6020: Mining the social web — lecture #2
the dataset
47
URLs posted
Density
10!10
10!8
10!6
10!4
10!2
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
101
102
103
104
Figure 1: Probability density of number of bit.ly
URLs posted per user
“leaders,” not on prediction.) Second, whereas the focus of
previous studies has been largely descriptive (e.g. compar-
ing the most influential users), we are interested explicitly in
the same two-month period. We did this by querying the
Twitter API to find the followers of every user who posted
a bit.ly URL. Subsequently, we placed those followers in a
queue to be crawled, thereby identifying their followers, who
were then also placed in the queue, and so on. In this way,
we obtained a large fraction of the Twitter follower graph
comprising all active bit.ly posters and anyone connected to
these users via one-way directed chains of followers. Specifi-
cally, the subgraph comprised approximately 56M users and
1.7B edges.
Consistent with previous work [7, 18, 35], both the in-
degree (‘followers”) and out-degree (“friends”) distributions
are highly skewed, but the former much more so—whereas
the maximum # of followers was nearly 4M, the maximum
# of friends was only about 760K—reflecting the passive
and one-way nature of the “follow” action on Twitter (i.e.
A can follow B without any action required from B). We
emphasize, moreover, that because the crawled graph was
seeded exclusively with active users, it is almost certainly
not representative of the entire follower graph. In particular,
active users are likely to have more followers than average,
in which case we would expect that the average in-degree
will exceed the average out-degree for our sample—as indeed
we observe. Table 1 presents some basic statistics of the
distributions of the number of friends, followers and number
of URLs posted per user.
URLs posted
!
101
102
103
104
gure 1: Probability density of number of bit.ly
RLs posted per user
aders,” not on prediction.) Second, whereas the focus of
evious studies has been largely descriptive (e.g. compar-
g the most influential users), we are interested explicitly in
edicting influence; thus we consider all users, not merely
e most influential. Third, in addition to predicting diffu-
on as a function of the attributes of individual seeds, we
so study the effects of content. We believe these differ-
ces bring the understanding of diffusion on Twitter closer
practical applications, although as we describe later, ex-
rimental studies are still required.
DATA
To study diffusion on Twitter, we combined two separate
t related sources of data. First, over the two-month pe-
od of September 13 2009 - November 15 2009 we recorded
1.03B public tweets broadcast on Twitter, excluding Oc-
ber 14-16 during which there were intermittent outages in
e Twitter API. Of these, we extracted 87M tweets that
cluded bit.ly URLs and which corresponded to distinct
ffusion “events,” where each event comprised a single ini-
tor, or “seed,” followed by some number of repostings of
e same URL by the seed’s followers, their followers, and so
1
. Finally, we identified a subset of 74M diffusion events
at were initiated by seed users who were active in both
e first and second months of the observation period; thus
abling us to train our regression model on first month
# of friends was only about 760K—reflecting the passive
and one-way nature of the “follow” action on Twitter (i.e.
A can follow B without any action required from B). We
emphasize, moreover, that because the crawled graph was
seeded exclusively with active users, it is almost certainly
not representative of the entire follower graph. In particular,
active users are likely to have more followers than average,
in which case we would expect that the average in-degree
will exceed the average out-degree for our sample—as indeed
we observe. Table 1 presents some basic statistics of the
distributions of the number of friends, followers and number
of URLs posted per user.
Table 1: Statistics of the Twitter follower graph and
seed activity
# Followers # Friends # Seeds Posted
Median 85.00 82.00 11.00
Mean 557.10 294.10 46.33
Max. 3,984,000.00 759,700.00 54,890
4. COMPUTING INFLUENCE ON TWITTER
To calculate the influence score for a given URL post,
we tracked the diffusion of the URL from its origin at a
particular “seed” node through a series of reposts—by that
user’s followers, those users’ followers, and so on—until the
diffusion event, or cascade, terminated. To do this, we used
the time each URL was posted: if person B is following
person A, and person A posted the URL before B and was
the only of B’s friends to post the URL, we say person A
influenced person B to post the URL. As Figure 2 shows,
if B has more than one friend who has previously posted
the same URL, we have three choices for how to assign the
corresponding influence: first, we can assign full credit to the
the urls the follower graph
T-61.6020: Mining the social web — lecture #2
cascades
48
l-
en
a
ed
ry
ck
p-
r-
s.
ss
al;
e.
to
n-
RT
Figure 3: Examples of information cascades on
Twitter.
there are many reasons why individuals may choose to pass
along information other than the number and identity of
the individuals from whom they received it—in particular,
the nature of the content itself. Moreover, influencing an-
other individual to pass along a piece of information does not
Size
Density
10−7
10−6
10−5
10−4
10−3
10−2
10−1
G
G
G
G
G
G
G
G
G
G
G
G
G
G
100
101
102
103
104
(a) Cascade Sizes
Depth
Frequency
101
102
103
104
105
106
107
G
G
G
G
G
G
G
G
G
G
0 2 4 6 8
(b) Cascade Depths
Figure 4: (a). Frequency distribution of cascade
sizes. (b). Distribution of cascade depths.
we study size or depth, therefore, the implication is that
most events do not spread at all, and even moderately sized
cascades are extremely rare.
To identify consistently influential individuals, we aggre-
gated all URL posts by user and computed individual-level
influence as the logarithm of the average size of all cascades
for which that user was a seed. We then fit a regression
tree model [6], in which a greedy optimization process recur-
sively partitions the feature space, resulting in a piecewise-
constant function where the value in each partition is fit to
the mean of the corresponding training data. An important
whe
isfie
mea
part
that
age
pred
atin
U
prov
the
this
are o
dict
follo
are
ablin
as s
as o
follo
from
that
also
Fi
five
cate
the t
pear
the
T-61.6020: Mining the social web — lecture #2
prediction task
49
build a model to predict influence
model features :
user attributes
# followers
# friends
# tweets
date of joining
past influence of seed users
average, minimum, and maximum total influence
average, minimum, and maximum local influence
(repeat study with additional content features)
T-61.6020: Mining the social web — lecture #2
prediction task — results
50
# followers and past influence are important features
individuals who have been influential in the past and who have
many followers are more likely to be influential in the future
however, this is correct only on average
predictor features are necessary but not sufficient
cannot really predict who will initiate a cascade
advertisers need a diverse portfolio of users to target
T-61.6020: Mining the social web — lecture #2
prediction task — results
51
log10(pastLocalInfluence + 1)< 0.09791 log10(pastLocalInfluence + 1)< 0.3028 log10(pastLocalInfluence + 1)< 0.3027 log10(pastLocalInfluence + 1)< 0.856
0.0124 0.03631 0.05991
0.09241 0.1452
0.1229
0.1929 0.3045 0.275 0.4118 0.6034 0.9854
Figure 5: Regression tree fit for one of the five cross-validation folds. Leaf nodes give the predicted influence
for the corresponding partition, where the left (right) child is followed if the node condition is satisfied
(violated).
(a) All users
Past Local Influence
Followers
102
103
104
105
106
TreySongz
Orbitz
stephenfry
marissamayer
disneypollsMrEdLover
BarackObama
pigeonPOLL
iphone_dev
geohot
mslayel
cnnbrk
TreysAngels
OFA_TX
britneyspears
riskybusinessmb
nprnews
wealthtv
garagemkorova
michelebachmann
billprady
10-1
100
101
102
(b) Top 25 users
Figure 6: Influence as a function of past local influence and number of followers for (a) all users and (b)
users with the top 25 actual influence. Each circle represents a single seed user, where the size of the circle
represents that user’s actual average influence.
than others (e.g. news articles of specialized interest), or First, we filtered URLs that we knew to be spam or in a lan-
Predicted Influence
ActualInfluence
0.0
0.2
0.4
0.6
0.8
1.0
1.2
GGGGGGGGGGGGGGG
GGGGG
GGGGGGGGGG
GGGGG
GGGGG
GGGGG
GGG
GG
GG
G
G
G
GG
G
G
G
0.2 0.4 0.6 0.8 1.0
Figure 7: Actual vs. predicted influence for regres-
sion tree. The model assigns each seed user to a leaf
in the regression tree. Points representing the av-
erage actual influence values are placed at the pre-
Duncan Watts’ youtube video
T-61.6020: Mining the social web — lecture #2 53
question to study : can we identify influential users in twitter?
data : large twitter dataset over two months (proprietary)
methodology : prediction and analysis of a regression task
impact/interestingness : interesting question, potentially very
high impact for advertisers
reproducibility : not publicly available dataset
summary
Everyone is an influencer: quantifying influence on twitter
Bakshy, Hofman, Mason,Watts
T-61.6020: Mining the social web — lecture #2
Everyone is an influencer: quantifying influence on twitter
Bakshy, Hofman, Mason,Watts
54
1 2 3 4 5
originality
low high
1 2 3 4 5
impact
low high
1 2 3 4 5
rigorousness / technical novelty
low high
1 2 3 4 5
reproducibility
low high
Coevolution of Network Structure and Content
Chun-Yuen Teng
School of Information
University of Michigan
Ann Arbor, MI 48109
chunyuen@umich.edu
Liuling Gong
School of Information
University of Michigan
Ann Arbor, MI 48109
llgong@umich.edu
Avishay Livne EECS
University of Michigan
Ann Arbor, MI 48109
avishay@umich.edu
Celso Brunetti
Carey Business School
Johns Hopkins
Baltimore, MD 21202
celsob@jhu.edu
Lada Adamic
School of Information
University of Michigan
Ann Arbor, MI 48109
ladamic@umich.edu
ABSTRACT
As individuals communicate, their exchanges form a dy-
namic network. We demonstrate, using time series analy-
sis of communication in three online settings, that network
structure alone can be highly revealing of the diversity and
novelty of the information being communicated. Our ap-
proach uses both standard and novel network metrics to
characterize how unexpected a network configuration is, and
to capture a network’s ability to conduct information. We
find that networks with a higher conductance in link struc-
ture exhibit higher information entropy, while unexpected
network configurations can be tied to information novelty.
We use a simulation model to explain the observed corre-
spondence between the evolution of a network’s structure
and the information it carries.
Categories and Subject Descriptors
J.4 [Computer Applications]: Social and Behavioral Sci-
ences; H.2.8 [Database Applications]: Data Mining
General Terms
Measurement, Human Factors
Keywords
social media, information networks, network evolution
adoption of ideas and behavior [28, 6, 3], convergence of
opinion [5], or the speed and extent of innovation [14].
In practice, networks are rarely static, unless one consid-
ers only the strongest and most stable ties [7] or experimen-
tally dictates the network topology to be fixed [6]. However,
even stable ties transfer information at di↵erent rates [25,
13, 21], and a portion of information flow occurs outside
of established social ties [4]. New ties are also induced by
information flow, e.g. a Pakistani Twitter user who inad-
vertently live-tweeted the Bin Laden assassination quickly
gained tens of thousands of new followers on Twitter. This
points to a need to approach the relationship between net-
work structure and information content in a substantively
di↵erent way.
In this paper, rather than treating the network structure
as static, we specifically use its dynamic nature to infer two
properties of the information being communicated through
the network. The first is the diversity of the information;
whether everyone is talking about the same topic or whether
one is observing many disparate conversation topics being
discussed. The second is the novelty of the information;
whether individuals in the network are continuing to talk
about the same topic they talked about in the previous time
period, or whether new topics have arisen that are di↵er-
ent from what has been discussed before. For example, one
could imagine oneself at a dinner party, where most conver-
sations are out of earshot, but one can easily observe who is
conversing with whom. While individuals are milling about
Xiv:1107.5543v2[cs.SI]21May2012
T-61.6020: Mining the social web — lecture #2
content vs. structure
56
questions :
understand the interplay between content and structure
what is said in the network vs. how the information spreads
more concretely :
can the network structure tell what people talk about?
are they talking about the same thing or they gossip?
is what people talk about novel?
T-61.6020: Mining the social web — lecture #2
content vs. structure
57
What’s different here
!  We look at network dynamics at relatively short time
scales and construct time series
!  A range of network metrics, instead of just community
structure
!  Information novelty and diversity as opposed to tracking
single events / pieces of information
big news! virus epidemic weather is horrible today
T-61.6020: Mining the social web — lecture #2
content vs. structure — methodology
58
extract features that capture network structure
# vertices, # edges, avg degree, degree correlations, …
conductance (is information flows along many paths?)
expectedness of conversation (have I seen this edge before?)
extract features that characterize content diversity and novelty
correlation analysis between structure and content features
analysis on 3 datasets : twitter, virtual game, enron email network
T-61.6020: Mining the social web — lecture #2
content vs. structure — findings
59
“simple” structure features are not correlated with content
diversity and novelty
conductance correlates with content diversity
expectedness correlates with content novelty
T-61.6020: Mining the social web — lecture #2 60
question to study : interplay between structure and content
data : three datasets
methodology : feature extraction and correlation analysis
impact/interestingness : interesting question, potentially very
high impact
reproducibility : some datasets publicly available
summary
Coevolution of network structure and content
Teng, Cong, Livne, Brunetti, and Adamic
T-61.6020: Mining the social web — lecture #2
Coevolution of network structure and content
Teng, Cong, Livne, Brunetti, and Adamic
61
1 2 3 4 5
originality
low high
1 2 3 4 5
impact
low high
1 2 3 4 5
rigorousness / technical novelty
low high
1 2 3 4 5
reproducibility
low high
T-61.6020: Mining the social web — lecture #2
what is next?
continue literature review (next week, Michael)
meanwhile…
keep thinking about project ideas
browse papers
the ones in Noppa
main conferences : ICWSM,WSDM,WWW
talk to your colleagues
talk to your instructors
62

Más contenido relacionado

La actualidad más candente

Making More Sense Out of Social Data
Making More Sense Out of Social DataMaking More Sense Out of Social Data
Making More Sense Out of Social DataThe Open University
 
2015 pdf-marc smith-node xl-social media sna
2015 pdf-marc smith-node xl-social media sna2015 pdf-marc smith-node xl-social media sna
2015 pdf-marc smith-node xl-social media snaMarc Smith
 
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...Xiaohan Zeng
 
Social network analysis intro part I
Social network analysis intro part ISocial network analysis intro part I
Social network analysis intro part ITHomas Plotkowiak
 
Small Worlds Social Graphs Social Media
Small Worlds Social Graphs Social MediaSmall Worlds Social Graphs Social Media
Small Worlds Social Graphs Social Mediasuresh sood
 
2009 December NodeXL Overview
2009 December NodeXL Overview2009 December NodeXL Overview
2009 December NodeXL OverviewMarc Smith
 
Monitoring and Analysis of Online Communities
Monitoring and Analysis of Online CommunitiesMonitoring and Analysis of Online Communities
Monitoring and Analysis of Online CommunitiesThe Open University
 
Lecture 7: How to STUDY the Social Web? (2014)
Lecture 7: How to STUDY the Social Web? (2014)Lecture 7: How to STUDY the Social Web? (2014)
Lecture 7: How to STUDY the Social Web? (2014)Lora Aroyo
 
20120301 strata-marc smith-mapping social media networks with no coding using...
20120301 strata-marc smith-mapping social media networks with no coding using...20120301 strata-marc smith-mapping social media networks with no coding using...
20120301 strata-marc smith-mapping social media networks with no coding using...Marc Smith
 
Network Visualization guest lecture at #DataVizQMSS at @Columbia / #SNA at PU...
Network Visualization guest lecture at #DataVizQMSS at @Columbia / #SNA at PU...Network Visualization guest lecture at #DataVizQMSS at @Columbia / #SNA at PU...
Network Visualization guest lecture at #DataVizQMSS at @Columbia / #SNA at PU...Denis Parra Santander
 
2013 NodeXL Social Media Network Analysis
2013 NodeXL Social Media Network Analysis2013 NodeXL Social Media Network Analysis
2013 NodeXL Social Media Network AnalysisMarc Smith
 
20111103 con tech2011-marc smith
20111103 con tech2011-marc smith20111103 con tech2011-marc smith
20111103 con tech2011-marc smithMarc Smith
 
LSS'11: Charting Collections Of Connections In Social Media
LSS'11: Charting Collections Of Connections In Social MediaLSS'11: Charting Collections Of Connections In Social Media
LSS'11: Charting Collections Of Connections In Social MediaLocal Social Summit
 
The Basics of Social Network Analysis
The Basics of Social Network AnalysisThe Basics of Social Network Analysis
The Basics of Social Network AnalysisRory Sie
 
From smart meters to smart behaviour
From smart meters to smart behaviourFrom smart meters to smart behaviour
From smart meters to smart behaviourThe Open University
 
Conversation graphs in Online Social Media
Conversation graphs in Online Social MediaConversation graphs in Online Social Media
Conversation graphs in Online Social MediaMarco Brambilla
 

La actualidad más candente (20)

Making More Sense Out of Social Data
Making More Sense Out of Social DataMaking More Sense Out of Social Data
Making More Sense Out of Social Data
 
2015 pdf-marc smith-node xl-social media sna
2015 pdf-marc smith-node xl-social media sna2015 pdf-marc smith-node xl-social media sna
2015 pdf-marc smith-node xl-social media sna
 
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
 
Social network analysis intro part I
Social network analysis intro part ISocial network analysis intro part I
Social network analysis intro part I
 
Small Worlds Social Graphs Social Media
Small Worlds Social Graphs Social MediaSmall Worlds Social Graphs Social Media
Small Worlds Social Graphs Social Media
 
07 Network Visualization
07 Network Visualization07 Network Visualization
07 Network Visualization
 
2009 December NodeXL Overview
2009 December NodeXL Overview2009 December NodeXL Overview
2009 December NodeXL Overview
 
Monitoring and Analysis of Online Communities
Monitoring and Analysis of Online CommunitiesMonitoring and Analysis of Online Communities
Monitoring and Analysis of Online Communities
 
Lecture 7: How to STUDY the Social Web? (2014)
Lecture 7: How to STUDY the Social Web? (2014)Lecture 7: How to STUDY the Social Web? (2014)
Lecture 7: How to STUDY the Social Web? (2014)
 
20120301 strata-marc smith-mapping social media networks with no coding using...
20120301 strata-marc smith-mapping social media networks with no coding using...20120301 strata-marc smith-mapping social media networks with no coding using...
20120301 strata-marc smith-mapping social media networks with no coding using...
 
Network Visualization guest lecture at #DataVizQMSS at @Columbia / #SNA at PU...
Network Visualization guest lecture at #DataVizQMSS at @Columbia / #SNA at PU...Network Visualization guest lecture at #DataVizQMSS at @Columbia / #SNA at PU...
Network Visualization guest lecture at #DataVizQMSS at @Columbia / #SNA at PU...
 
2013 NodeXL Social Media Network Analysis
2013 NodeXL Social Media Network Analysis2013 NodeXL Social Media Network Analysis
2013 NodeXL Social Media Network Analysis
 
15 Network Visualization and Communities
15 Network Visualization and Communities15 Network Visualization and Communities
15 Network Visualization and Communities
 
20111103 con tech2011-marc smith
20111103 con tech2011-marc smith20111103 con tech2011-marc smith
20111103 con tech2011-marc smith
 
LSS'11: Charting Collections Of Connections In Social Media
LSS'11: Charting Collections Of Connections In Social MediaLSS'11: Charting Collections Of Connections In Social Media
LSS'11: Charting Collections Of Connections In Social Media
 
Roles In Networks
Roles In NetworksRoles In Networks
Roles In Networks
 
04 Network Data Collection
04 Network Data Collection04 Network Data Collection
04 Network Data Collection
 
The Basics of Social Network Analysis
The Basics of Social Network AnalysisThe Basics of Social Network Analysis
The Basics of Social Network Analysis
 
From smart meters to smart behaviour
From smart meters to smart behaviourFrom smart meters to smart behaviour
From smart meters to smart behaviour
 
Conversation graphs in Online Social Media
Conversation graphs in Online Social MediaConversation graphs in Online Social Media
Conversation graphs in Online Social Media
 

Similar a Mining the Social Web - Lecture 2 - T61.6020

New Metrics for New Media Bay Area CIO IT Executives Meetup
New Metrics for New Media Bay Area CIO IT Executives MeetupNew Metrics for New Media Bay Area CIO IT Executives Meetup
New Metrics for New Media Bay Area CIO IT Executives MeetupTatyana Kanzaveli
 
16 zaman nips10_workshop_v2
16 zaman nips10_workshop_v216 zaman nips10_workshop_v2
16 zaman nips10_workshop_v2talktoharry
 
It’s a “small world” after all
It’s a “small world” after allIt’s a “small world” after all
It’s a “small world” after allquanmengli
 
01 Introduction to Networks Methods and Measures (2016)
01 Introduction to Networks Methods and Measures (2016)01 Introduction to Networks Methods and Measures (2016)
01 Introduction to Networks Methods and Measures (2016)Duke Network Analysis Center
 
01 Introduction to Networks Methods and Measures
01 Introduction to Networks Methods and Measures01 Introduction to Networks Methods and Measures
01 Introduction to Networks Methods and Measuresdnac
 
Visually Exploring Social Participation in Encyclopedia of Life
Visually Exploring Social Participation in Encyclopedia of LifeVisually Exploring Social Participation in Encyclopedia of Life
Visually Exploring Social Participation in Encyclopedia of LifeHarish Vaidyanathan
 
20120622 web sci12-won-marc smith-semantic and social network analysis of …
20120622 web sci12-won-marc smith-semantic and social network analysis of …20120622 web sci12-won-marc smith-semantic and social network analysis of …
20120622 web sci12-won-marc smith-semantic and social network analysis of …Marc Smith
 
Community Detection in Social Media
Community Detection in Social MediaCommunity Detection in Social Media
Community Detection in Social Mediarezahk
 
The International Journal of Engineering and Science (IJES)
The International Journal of Engineering and Science (IJES)The International Journal of Engineering and Science (IJES)
The International Journal of Engineering and Science (IJES)theijes
 
[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...
[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...
[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...Pei Lee
 
SCALABLE LOCAL COMMUNITY DETECTION WITH MAPREDUCE FOR LARGE NETWORKS
SCALABLE LOCAL COMMUNITY DETECTION WITH MAPREDUCE FOR LARGE NETWORKSSCALABLE LOCAL COMMUNITY DETECTION WITH MAPREDUCE FOR LARGE NETWORKS
SCALABLE LOCAL COMMUNITY DETECTION WITH MAPREDUCE FOR LARGE NETWORKSIJDKP
 
Scalable Local Community Detection with Mapreduce for Large Networks
Scalable Local Community Detection with Mapreduce for Large NetworksScalable Local Community Detection with Mapreduce for Large Networks
Scalable Local Community Detection with Mapreduce for Large NetworksIJDKP
 
Community detection in social networks an overview
Community detection in social networks an overviewCommunity detection in social networks an overview
Community detection in social networks an overvieweSAT Publishing House
 
Group and Community Detection in Social Networks
Group and Community Detection in Social NetworksGroup and Community Detection in Social Networks
Group and Community Detection in Social NetworksKent State University
 
A Survey of Community Detection Approaches From Statistical Modeling to Deep ...
A Survey of Community Detection Approaches From Statistical Modeling to Deep ...A Survey of Community Detection Approaches From Statistical Modeling to Deep ...
A Survey of Community Detection Approaches From Statistical Modeling to Deep ...OKOKPROJECTS
 
2006 hicss - you are who you talk to - detecting roles in usenet newsgroups
2006   hicss - you are who you talk to - detecting roles in usenet newsgroups2006   hicss - you are who you talk to - detecting roles in usenet newsgroups
2006 hicss - you are who you talk to - detecting roles in usenet newsgroupsMarc Smith
 
Higher-order clustering coefficients at Purdue CSoI
Higher-order clustering coefficients at Purdue CSoIHigher-order clustering coefficients at Purdue CSoI
Higher-order clustering coefficients at Purdue CSoIAustin Benson
 
Taxonomy and survey of community
Taxonomy and survey of communityTaxonomy and survey of community
Taxonomy and survey of communityIJCSES Journal
 

Similar a Mining the Social Web - Lecture 2 - T61.6020 (20)

New Metrics for New Media Bay Area CIO IT Executives Meetup
New Metrics for New Media Bay Area CIO IT Executives MeetupNew Metrics for New Media Bay Area CIO IT Executives Meetup
New Metrics for New Media Bay Area CIO IT Executives Meetup
 
16 zaman nips10_workshop_v2
16 zaman nips10_workshop_v216 zaman nips10_workshop_v2
16 zaman nips10_workshop_v2
 
It’s a “small world” after all
It’s a “small world” after allIt’s a “small world” after all
It’s a “small world” after all
 
01 Introduction to Networks Methods and Measures (2016)
01 Introduction to Networks Methods and Measures (2016)01 Introduction to Networks Methods and Measures (2016)
01 Introduction to Networks Methods and Measures (2016)
 
01 Introduction to Networks Methods and Measures
01 Introduction to Networks Methods and Measures01 Introduction to Networks Methods and Measures
01 Introduction to Networks Methods and Measures
 
Visually Exploring Social Participation in Encyclopedia of Life
Visually Exploring Social Participation in Encyclopedia of LifeVisually Exploring Social Participation in Encyclopedia of Life
Visually Exploring Social Participation in Encyclopedia of Life
 
20120622 web sci12-won-marc smith-semantic and social network analysis of …
20120622 web sci12-won-marc smith-semantic and social network analysis of …20120622 web sci12-won-marc smith-semantic and social network analysis of …
20120622 web sci12-won-marc smith-semantic and social network analysis of …
 
Community Detection in Social Media
Community Detection in Social MediaCommunity Detection in Social Media
Community Detection in Social Media
 
The International Journal of Engineering and Science (IJES)
The International Journal of Engineering and Science (IJES)The International Journal of Engineering and Science (IJES)
The International Journal of Engineering and Science (IJES)
 
[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...
[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...
[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...
 
SCALABLE LOCAL COMMUNITY DETECTION WITH MAPREDUCE FOR LARGE NETWORKS
SCALABLE LOCAL COMMUNITY DETECTION WITH MAPREDUCE FOR LARGE NETWORKSSCALABLE LOCAL COMMUNITY DETECTION WITH MAPREDUCE FOR LARGE NETWORKS
SCALABLE LOCAL COMMUNITY DETECTION WITH MAPREDUCE FOR LARGE NETWORKS
 
Scalable Local Community Detection with Mapreduce for Large Networks
Scalable Local Community Detection with Mapreduce for Large NetworksScalable Local Community Detection with Mapreduce for Large Networks
Scalable Local Community Detection with Mapreduce for Large Networks
 
Community detection in social networks an overview
Community detection in social networks an overviewCommunity detection in social networks an overview
Community detection in social networks an overview
 
Group and Community Detection in Social Networks
Group and Community Detection in Social NetworksGroup and Community Detection in Social Networks
Group and Community Detection in Social Networks
 
A Survey of Community Detection Approaches From Statistical Modeling to Deep ...
A Survey of Community Detection Approaches From Statistical Modeling to Deep ...A Survey of Community Detection Approaches From Statistical Modeling to Deep ...
A Survey of Community Detection Approaches From Statistical Modeling to Deep ...
 
2006 hicss - you are who you talk to - detecting roles in usenet newsgroups
2006   hicss - you are who you talk to - detecting roles in usenet newsgroups2006   hicss - you are who you talk to - detecting roles in usenet newsgroups
2006 hicss - you are who you talk to - detecting roles in usenet newsgroups
 
CSE509 Lecture 5
CSE509 Lecture 5CSE509 Lecture 5
CSE509 Lecture 5
 
Higher-order clustering coefficients at Purdue CSoI
Higher-order clustering coefficients at Purdue CSoIHigher-order clustering coefficients at Purdue CSoI
Higher-order clustering coefficients at Purdue CSoI
 
SSRI_pt1.ppt
SSRI_pt1.pptSSRI_pt1.ppt
SSRI_pt1.ppt
 
Taxonomy and survey of community
Taxonomy and survey of communityTaxonomy and survey of community
Taxonomy and survey of community
 

Más de Michael Mathioudakis

Measuring polarization on social media
Measuring polarization on social mediaMeasuring polarization on social media
Measuring polarization on social mediaMichael Mathioudakis
 
Lecture 07 - CS-5040 - modern database systems
Lecture 07 -  CS-5040 - modern database systemsLecture 07 -  CS-5040 - modern database systems
Lecture 07 - CS-5040 - modern database systemsMichael Mathioudakis
 
Lecture 06 - CS-5040 - modern database systems
Lecture 06  - CS-5040 - modern database systemsLecture 06  - CS-5040 - modern database systems
Lecture 06 - CS-5040 - modern database systemsMichael Mathioudakis
 
Modern Database Systems - Lecture 02
Modern Database Systems - Lecture 02Modern Database Systems - Lecture 02
Modern Database Systems - Lecture 02Michael Mathioudakis
 
Modern Database Systems - Lecture 01
Modern Database Systems - Lecture 01Modern Database Systems - Lecture 01
Modern Database Systems - Lecture 01Michael Mathioudakis
 
Modern Database Systems - Lecture 00
Modern Database Systems - Lecture 00Modern Database Systems - Lecture 00
Modern Database Systems - Lecture 00Michael Mathioudakis
 
Mining the Social Web - Lecture 3 - T61.6020
Mining the Social Web - Lecture 3 - T61.6020Mining the Social Web - Lecture 3 - T61.6020
Mining the Social Web - Lecture 3 - T61.6020Michael Mathioudakis
 
Bump Hunting in the Dark - ICDE15 presentation
Bump Hunting in the Dark - ICDE15 presentationBump Hunting in the Dark - ICDE15 presentation
Bump Hunting in the Dark - ICDE15 presentationMichael Mathioudakis
 

Más de Michael Mathioudakis (9)

Measuring polarization on social media
Measuring polarization on social mediaMeasuring polarization on social media
Measuring polarization on social media
 
Lecture 07 - CS-5040 - modern database systems
Lecture 07 -  CS-5040 - modern database systemsLecture 07 -  CS-5040 - modern database systems
Lecture 07 - CS-5040 - modern database systems
 
Lecture 06 - CS-5040 - modern database systems
Lecture 06  - CS-5040 - modern database systemsLecture 06  - CS-5040 - modern database systems
Lecture 06 - CS-5040 - modern database systems
 
Modern Database Systems - Lecture 02
Modern Database Systems - Lecture 02Modern Database Systems - Lecture 02
Modern Database Systems - Lecture 02
 
Modern Database Systems - Lecture 01
Modern Database Systems - Lecture 01Modern Database Systems - Lecture 01
Modern Database Systems - Lecture 01
 
Modern Database Systems - Lecture 00
Modern Database Systems - Lecture 00Modern Database Systems - Lecture 00
Modern Database Systems - Lecture 00
 
Mining the Social Web - Lecture 3 - T61.6020
Mining the Social Web - Lecture 3 - T61.6020Mining the Social Web - Lecture 3 - T61.6020
Mining the Social Web - Lecture 3 - T61.6020
 
Absorbing Random Walk Centrality
Absorbing Random Walk CentralityAbsorbing Random Walk Centrality
Absorbing Random Walk Centrality
 
Bump Hunting in the Dark - ICDE15 presentation
Bump Hunting in the Dark - ICDE15 presentationBump Hunting in the Dark - ICDE15 presentation
Bump Hunting in the Dark - ICDE15 presentation
 

Último

Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxMusic 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxleah joy valeriano
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...JojoEDelaCruz
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptshraddhaparab530
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 

Último (20)

Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxMusic 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptx
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.ppt
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 

Mining the Social Web - Lecture 2 - T61.6020

  • 1. mining the social web Aris Gionis Michael Mathioudakis Mon, Feb 2 — lecture #2 structure and dynamics of social networks
  • 2. T-61.6020: Mining the social web — lecture #2 class web page in piazza https://piazza.com/aalto.fi/spring2015/t616020/home share resources and also use as a discussion forum sensible posts : looking for a project mate looking for a project mate on idea X anyone knows how to access dataY? anyone has seen some analysis on data Z? … or just anything else 2
  • 3. T-61.6020: Mining the social web — lecture #2 today’s themes analysis of the structure and dynamics of social-network how social networks look like? how social networks evolve over time? how people in social networks behave and interact? how information spreads in social networks and social media? who is influential? what is the interplay between structure and content? 3
  • 4. T-61.6020: Mining the social web — lecture #2 objectives in today’s presentation focus on one particular topic review some “classic” papers in the literature ideas for projects assess the presented papers what is the main idea? what is the novelty? why they had impact? 4
  • 5. T-61.6020: Mining the social web — lecture #2 criteria to evaluate the research projects originality (has it done before?) potential impact (how interesting it is and why) rigorousness and technical novelty reproducibility presentation 5
  • 6. T-61.6020: Mining the social web — lecture #2 structure of social networks social networks and social-media data can be represented as graphs (or networks) how these graphs look like? what is their structure data contain additional information (actions, interactions, dynamics, attributes,…) mining this additional information as part of the network structure 6
  • 7. T-61.6020: Mining the social web — lecture #2 contrast against random graphs random graph model by Erdős-Rényi edges independently drawn with probability p real-world networks do not look like random graphs also, random graphs are static 7 degree distribution hubs triangle coefficient clusters diameter giant component random graphs binomial no no no small yes real-world networks power law yes yes yes small yes
  • 8. T-61.6020: Mining the social web — lecture #2 graph generation models a large number of graph generations models have been proposed preferential-attachment model copy model Strogatz-Watts model typically trying to capture some property of the data beyond the scope of this class and the project 8
  • 9. arXiv:0810.1355v1[cs.DS]8Oct2008 Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters ∗ Jure Leskovec † Kevin J. Lang ‡ Anirban Dasgupta † Michael W. Mahoney § Abstract A large body of work has been devoted to defining and identifying clusters or communities in social and information networks, i.e., in graphs in which the nodes represent underlying social entities and the edges represent some sort of interaction between pairs of nodes. Most such research begins with the premise that a community or a cluster should be thought of as a set of nodes that has more and/or better connections between its members than to the remainder of the network. In this paper, we explore from a novel perspective several questions related to identifying meaningful communities in large social and information networks, and we come to several striking conclusions. Rather than defining a procedure to extract sets of nodes from a graph and then attempt to interpret these sets as a “real” communities, we employ approximation algorithms for the graph partitioning problem to characterize as a function of size the statistical and structural properties of partitions of graphs that could plausibly be interpreted as communities. In particular, we define the network community profile plot, which characterizes the “best” possible community—according to the conductance measure—over a wide range of size scales. We study over 100 large real-world networks, ranging from traditional and on-line social networks, to technological and information networks and web graphs, and ranging in size from thousands up to tens of millions of nodes. Our results suggest a significantly more refined picture of community structure in large networks than has been appreciated previously. Our observations agree with previous work on small networks, but we show that large networks have a very different structure. In particular, we observe tight communities that are barely connected to the rest of the network at very small size scales (up to ≈ 100 nodes); and communities of size scale beyond ≈ 100 nodes gradually “blend into” the expander- like core of the network and thus become less “community-like,” with a roughly inverse relationship between community size and optimal community quality. This observation agrees well with the so-called Dunbar number which gives a limit to the size of a well-functioning community. However, this behavior is not explained, even at a qualitative level, by any of the commonly-used network generation models. Moreover, it is exactly the opposite of what one would expect based on intuition from expander graphs, low-dimensional or manifold-like graphs, and from small social networks that have served as testbeds of community detection algorithms. The relatively gradual increase of the network community profile plot as a function of increasing community size depends in a subtle manner on the way in which local clustering information is propagated from smaller to larger size scales in the network. We have found that a generative graph model, in which new edges are added via an iterative “forest fire” burning process, is able to produce graphs exhibiting a network community profile plot similar to what we observe in our network datasets.
  • 10. T-61.6020: Mining the social web — lecture #2 community structure in social networks hypothesis : social networks have well-formed communities 10 Community structure loose definition of community: a set of vertices densely connected to each other and sparsely connected to the rest of the graph artificial communities: http://projects.skewed.de/graph-tool/
  • 11. T-61.6020: Mining the social web — lecture #2 community structure in social networks 11 study community structure in an extensive collection of real- world networks authors introduce the network community profile (NCP) plot characterizes best possible community over a range of scales
  • 12. T-61.6020: Mining the social web — lecture #2 community structure in social networks 12 dolphins network and its NPC Community structure dolphins network and its NCP (source [Leskovec et al., 2009]) Frieze, Gionis, Tsourakakis Algorithmic Techniques for Modeling and Mining Large Graphs 34 / 277
  • 13. T-61.6020: Mining the social web — lecture #2 community structure in social networks 13 NPC on DBLP co-authorship munity structure do large-scale real-world networks have this nice artifical structure? NO! NCP of a DBLP graph (source [Leskovec et al., 2009]) do large-scale real-world networks have such nice artifical structure? NO!
  • 14. 10Leskovec,Lang,Dasgupta,andMaho Network N E Nb Eb ¯d ˜d ¯C D ¯D Description Social networks Delicious 147,567 301,921 0.40 0.65 4.09 48.44 0.30 24 6.28 del.icio.us collaborative tagging social network Epinions 75,877 405,739 0.48 0.90 10.69 183.88 0.26 15 4.27 Who-trusts-whom network from epinions.com [142] Flickr 404,733 2,110,078 0.33 0.86 10.43 442.75 0.40 18 5.42 Flickr photo sharing social network [101] LinkedIn 6,946,668 30,507,070 0.47 0.88 8.78 351.66 0.23 23 5.43 Social network of professional contacts LiveJournal01 3,766,521 30,629,297 0.78 0.97 16.26 111.24 0.36 23 5.55 Friendship network of a blogging community [20] LiveJournal11 4,145,160 34,469,135 0.77 0.97 16.63 122.44 0.36 23 5.61 Friendship network of a blogging community [20] LiveJournal12 4,843,953 42,845,684 0.76 0.97 17.69 170.66 0.35 20 5.53 Friendship network of a blogging community [20] Messenger 1,878,736 4,079,161 0.53 0.78 4.34 15.40 0.09 26 7.42 Instant messenger social network Email-All 234,352 383,111 0.18 0.50 3.27 576.87 0.50 14 4.07 Research organization email network (all addresses) [113] Email-InOut 37,803 114,199 0.47 0.82 6.04 165.73 0.58 8 3.74 (all addresses but email has to be sent both ways) [113] Email-Inside 986 16,064 0.90 0.99 32.58 74.66 0.45 7 2.60 (only emails inside the research organization) [113] Email-Enron 33,696 180,811 0.61 0.90 10.73 142.36 0.71 13 3.99 Enron email dataset [100] Answers 488,484 1,240,189 0.45 0.78 5.08 251.78 0.11 22 5.72 Yahoo Answers social network Answers-1 26,971 91,812 0.56 0.87 6.81 59.17 0.08 16 4.49 Cluster 1 from Yahoo Answers Answers-2 25,431 65,551 0.48 0.80 5.16 56.57 0.10 15 4.76 Cluster 2 from Yahoo Answers Answers-3 45,122 165,648 0.53 0.87 7.34 417.83 0.21 15 3.94 Cluster 3 from Yahoo Answers Answers-4 93,971 266,199 0.49 0.82 5.67 94.48 0.08 16 4.91 Cluster 4 from Yahoo Answers Answers-5 5,313 11,528 0.41 0.73 4.34 29.55 0.12 14 4.75 Cluster 5 from Yahoo Answers Answers-6 290,351 613,237 0.40 0.71 4.22 57.16 0.09 22 5.92 Cluster 6 from Yahoo Answers Information (citation) networks Cit-Patents 3,764,105 16,511,682 0.82 0.96 8.77 21.34 0.09 26 8.15 Citation network of all US patents [112] Cit-hep-ph 34,401 420,784 0.96 1.00 24.46 63.50 0.30 14 4.33 Citations between physics (arxiv hep-th) papers [78] Cit-hep-th 27,400 352,021 0.94 0.99 25.69 106.40 0.33 15 4.20 Citations between physics (arxiv hep-ph) papers [78] Blog-nat05-6m 29,150 182,212 0.74 0.96 12.50 342.51 0.24 10 3.40 Blog citation network (6 months of data) [116] Blog-nat06all 32,384 315,713 0.87 0.99 19.50 153.08 0.20 18 3.94 Blog citation network (1 year of data) [116] Post-nat05-6m 238,305 297,338 0.21 0.34 2.50 39.51 0.13 45 10.34 Blog post citation network (6 months) [116] Post-nat06all 437,305 565,072 0.22 0.38 2.58 35.54 0.11 54 10.48 Blog post citation network (1 year) [116] Collaboration networks AtA-IMDB 883,963 27,473,042 0.87 0.99 62.16 517.40 0.79 15 3.48 IMDB actor collaboration network from Dec 2007 CA-astro-ph 17,903 196,972 0.89 0.98 22.00 65.70 0.67 14 4.21 Co-authorship in astro-ph of arxiv.org [112] CA-cond-mat 21,363 91,286 0.81 0.93 8.55 22.47 0.70 15 5.36 Co-authorship in cond-mat category [112] CA-gr-qc 4,158 13,422 0.64 0.78 6.46 17.98 0.66 17 6.10 Co-authorship in gr-qc category [112] CA-hep-ph 11,204 117,619 0.81 0.97 21.00 130.88 0.69 13 4.71 Co-authorship in hep-ph category [112] CA-hep-th 8,638 24,806 0.68 0.85 5.74 12.99 0.58 18 5.96 Co-authorship in hep-th category [112] CA-DBLP 317,080 1,049,866 0.67 0.84 6.62 21.75 0.73 23 6.75 DBLP co-authorship network [20] Table 1: Network datasets we analyzed. Statistics of networks we consider: number of nodes N; number of edges E; fraction nodes not in whiskers (size of largest biconnected component) Nb/N; fraction of edges in biconnected component Eb/E; average degree ¯d = 2E/N; second order average degree ˜d; average clustering coefficient ¯C; diameter D; and average path length ¯D. Communitystructureinlargenetworks11 Network N E Nb Eb ¯d ˜d ¯C D ¯D Description Web graphs Web-BerkStan 319,717 1,542,940 0.57 0.88 9.65 1,067.55 0.32 35 5.66 Web graph of Stanford and UC Berkeley [98] Web-Google 855,802 4,291,352 0.75 0.92 10.03 170.35 0.62 24 6.27 Web graph Google released in 2002 [3] Web-Notredame 325,729 1,090,108 0.41 0.76 6.69 280.68 0.47 46 7.22 Web graph of University of Notre Dame [11] Web-Trec 1,458,316 6,225,033 0.59 0.78 8.54 682.89 0.68 112 8.58 Web graph of TREC WT10G web corpus [2] Internet networks As-RouteViews 6,474 12,572 0.62 0.80 3.88 164.81 0.40 9 3.72 AS from Oregon Exchange BGP Route View [112] As-Caida 26,389 52,861 0.61 0.81 4.01 281.93 0.33 17 3.86 CAIDA AS Relationships Dataset As-Skitter 1,719,037 12,814,089 0.99 1.00 14.91 9,934.01 0.17 5 3.44 AS from traceroutes run daily in 2005 by Skitter As-Newman 22,963 48,436 0.65 0.83 4.22 261.46 0.35 11 3.83 AS graph from Newman [5] As-Oregon 13,579 37,448 0.72 0.90 5.52 235.97 0.46 9 3.58 Autonomous systems [1] Gnutella-25 22,663 54,693 0.59 0.83 4.83 10.75 0.01 11 5.57 Gnutella network on March 25 2000 [143] Gnutella-30 36,646 88,303 0.55 0.81 4.82 11.46 0.01 11 5.75 Gnutella P2P network on March 30 2000 [143] Gnutella-31 62,561 147,878 0.54 0.81 4.73 11.60 0.01 11 5.94 Gnutella network on March 31 2000 [143] eDonkey 5,792,297 147,829,887 0.93 1.00 51.04 6,139.99 0.08 5 3.66 P2P eDonkey graph for a period of 47 hours in 2004 Bi-partite networks IpTraffic 2,250,498 21,643,497 1.00 1.00 19.23 94,889.05 0.00 5 2.53 IP traffic graph a single router for 24 hours AtP-astro-ph 54,498 131,123 0.70 0.87 4.81 16.67 0.00 28 7.78 Authors-to-papers network of astro-ph [116] AtP-cond-mat 57,552 104,179 0.65 0.79 3.62 10.54 0.00 31 9.96 Authors-to-papers network of cond-mat [116] AtP-gr-qc 14,832 22,266 0.47 0.60 3.00 9.72 0.00 35 11.08 Authors-to-papers network of gr-qc [116] AtP-hep-ph 47,832 86,434 0.60 0.76 3.61 16.80 0.00 27 8.55 Authors-to-papers network of hep-ph [116] AtP-hep-th 39,986 64,154 0.53 0.68 3.21 13.07 0.00 36 10.74 Authors-to-papers network of hep-th [116] AtP-DBLP 615,678 944,456 0.49 0.64 3.07 13.61 0.00 48 12.69 DBLP authors-to-papers bipartite network Spending 1,831,540 2,918,920 0.34 0.58 3.19 1,536.35 0.00 26 5.62 Users-to-keywords they bid Hw7 653,260 2,278,448 0.99 0.99 6.98 346.85 0.00 24 6.26 Downsampled advertiser-query bid graph Netflix 497,959 100,480,507 1.00 1.00 403.57 28,432.89 0.00 5 2.31 Users-to-movies they rated. From Netflix prize [4] QueryTerms 13,805,808 17,498,668 0.28 0.41 2.53 14.92 0.00 86 19.81 Users-to-queries they submit to a search engine Clickstream 199,308 951,649 0.39 0.87 9.55 430.74 0.00 7 3.83 Users-to-URLs they visited [126] Biological networks Bio-Proteins 4,626 14,801 0.72 0.91 6.40 24.25 0.12 12 4.24 Yeast protein interaction network [51] Bio-Yeast 1,458 1,948 0.37 0.51 2.67 7.13 0.14 19 6.89 Yeast protein interaction network data [92] Bio-YeastP0.001 353 1,517 0.73 0.93 8.59 20.18 0.57 11 4.33 Yeast protein-protein interaction map [135] Bio-YeastP0.01 1,266 8,511 0.79 0.97 13.45 47.73 0.44 12 3.87 Yeast protein-protein interaction map [135] Table 2: Network datasets we analyzed. Statistics of networks we consider: number of nodes N; number of edges E; fraction nodes not in whiskers (size of largest biconnected component) Nb/N; fraction of edges in biconnected component Eb/E; average degree ¯d = 2E/N; second order average degree ˜d; average clustering coefficient ¯C; diameter D; and average path length ¯D. Leskovec,Lang,Dasgupta,andMahoney Network N E Nb Eb ¯d ˜d ¯C D ¯D Description Nearly low-dimensional networks Road-CA 1,957,027 2,760,388 0.80 0.85 2.82 3.17 0.06 865 310.97 California road network Road-USA 126,146 161,950 0.97 0.98 2.57 2.81 0.03 617 218.55 USA road network (only main roads) Road-PA 1,087,562 1,541,514 0.79 0.85 2.83 3.20 0.06 794 306.89 Pennsylvania road network Road-TX 1,351,137 1,879,201 0.78 0.84 2.78 3.15 0.06 1,064 418.73 Texas road network PowerGrid 4,941 6,594 0.62 0.69 2.67 3.87 0.11 46 19.07 Power grid of Western States Power Grid [156] Mani-faces7k 696 6,979 0.98 0.99 20.05 37.99 0.56 16 5.52 Faces (64x64 grayscale images) (connect 7k closest pairs) Mani-faces4k 663 3,465 0.90 0.97 10.45 20.20 0.56 29 8.96 Faces (connect 4k closest pairs) Mani-faces2k 551 1,981 0.84 0.94 7.19 12.77 0.54 32 11.07 Faces (connect 2k closest pairs) Mani-facesK10 698 6,935 1.00 1.00 19.87 25.32 0.51 6 3.25 Faces (connect every to 10 nearest neighbors) Mani-facesK3 698 2,091 1.00 1.00 5.99 7.98 0.45 9 4.89 Faces (connect every to 5 nearest neighbors) Mani-facesK5 698 3,480 1.00 1.00 9.97 12.91 0.48 7 4.03 Faces (connect every to 3 nearest neighbors) Mani-swiss200k 20,000 200,000 1.00 1.00 20.00 21.08 0.59 103 37.21 Swiss-roll (connect 200k nearest pairs of nodes) Mani-swiss100k 19,990 99,979 1.00 1.00 10.00 11.02 0.59 162 58.32 Swiss-roll (connect 100k nearest pairs of nodes) Mani-swiss60k 19,042 57,747 0.93 0.96 6.07 7.03 0.59 243 89.15 Swiss-roll (connect 60k nearest pairs of nodes) Mani-swissK10 20,000 199,955 1.00 1.00 20.00 25.38 0.56 10 5.47 Swiss-roll (every node connects to 10 nearest neighbors) Mani-swissK5 20,000 99,990 1.00 1.00 10.00 12.89 0.54 13 8.34 Swiss-roll (every node connects to 5 nearest neighbors) Mani-swissK3 20,000 59,997 1.00 1.00 6.00 7.88 0.50 17 6.89 Swiss-roll (every node connects to 3 nearest neighbors) IMDB Actor-to-Movie graphs AtM-IMDB 2,076,978 5,847,693 0.49 0.82 5.63 65.41 0.00 32 6.82 Actors-to-movies graph from IMDB (imdb.com) Imdb-top30 198,430 566,756 0.99 1.00 5.71 18.19 0.00 26 8.32 Actors-to-movies graph heavily preprocessed Imdb-raw07 601,481 1,320,616 0.54 0.79 4.39 20.94 0.00 32 8.55 Country clusters were extracted from this graph Imdb-France 35,827 74,201 0.51 0.76 4.14 14.62 0.00 20 6.57 Cluster of French movies Imdb-Germany 21,258 42,197 0.56 0.78 3.97 13.69 0.00 34 7.47 German movies (to actors that played in them) datasets! publicly available in SNAP
  • 15. T-61.6020: Mining the social web — lecture #2 community structure in social networks 15 1. up to a certain size k (k ∼ 100 vertices) there exist good cuts as the size increases so does the quality of the community 2. at the size k we observe the best possible community such communities are typically connected to the remainder with a single edge 3. above the size k the community quality decreases this is because they blend in and gradually disappear main findings
  • 16. T-61.6020: Mining the social web — lecture #2 16 hypothesis : well-formed and interesting, assumed true data : very extensive collection methodology : introduce a new metric (NCP) impact / interestingness : challenged the starting hypothesis reproducibility : datasets and code publicly available Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters Leskovec, Lang, Dasgupta, Mahoney summary
  • 17. T-61.6020: Mining the social web — lecture #2 Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters Leskovec, Lang, Dasgupta, Mahoney 17 1 2 3 4 5 originality low high 1 2 3 4 5 impact low high 1 2 3 4 5 rigorousness / technical novelty low high 1 2 3 4 5 reproducibility low high
  • 18. rXiv:physics/0603229v3[physics.soc-ph]28Jan2007 Graph Evolution: Densification and Shrinking Diameters Jure Leskovec School of Computer Science, Carnegie Mellon University, Pittsburgh, PA Jon Kleinberg Department of Computer Science, Cornell University, Ithaca, NY Christos Faloutsos School of Computer Science, Carnegie Mellon University, Pittsburgh, PA February 2, 2008 Abstract How do real graphs evolve over time? What are “normal” growth patterns in social, technological, and information networks? Many studies have discovered patterns in static graphs, identifying properties in a single snapshot of a large network, or in a very small number of snapshots; these include heavy tails for in- and out-degree distributions, communities, small-world phenomena, and others. However, given the lack of information about network evolution over long periods, it has been hard to convert these findings into statements about trends over time. Here we study a wide range of real graphs, and we observe some surprising phenomena. First, most of these graphs densify over time, with the number of edges growing super-linearly in the number of nodes. Second, the average distance between nodes often shrinks over time, in contrast to the conventional wisdom that such distance parameters should increase slowly as a function of the number of nodes (like O(log n) or O(log(log n)). Existing graph generation models do not exhibit these types of behavior, even at a qualitative level. We provide a new graph generator, based on a “forest fire” spreading process, that has a simple, intuitive justification, requires very few parameters (like the “flammability” of nodes), and produces graphs exhibiting the full range of properties observed both in prior work and in the
  • 19. T-61.6020: Mining the social web — lecture #2 graph evolution and shrinking diameters 19 networks evolve over time typically new vertices/edges are added (not many deletions) how do network distances change over time? constant average degree and vertex addition… … implies diameter = O(logn) — slowly increasing according to random-graph model also according to other more “realistic” models e.g., preferential attachment
  • 20. T-61.6020: Mining the social web — lecture #2 graph evolution and shrinking diameters 20 empirical observation : as networks evolve distances shrink (e.g., diameter shrink) why? number of edges grow faster than number of vertices graph become denser — graph densification me-evolving networks J. Leskovec J. Kleinberg C. Faloutsos [Leskovec et al., 2005b] • densification power law: |Et| / |Vt|↵ 1  ↵  2 • shrinking diameters: diameter is shrinking over time.
  • 21. Graphs Over Time 7 1994 1996 1998 2000 2002 0 5 10 15 20 Year of publication Averageout−degree 1975 1980 1985 1990 1995 4 6 8 10 12 Year granted Averageout−degree (a) arXiv (b) Patents 0 200 400 600 3.4 3.6 3.8 4 4.2 Averageout−degree Time [days] 1994 1996 1998 2000 1 1.5 2 2.5 3 Year of publication Averageout−degree (c) Autonomous Systems (d) Affiliation network Figure 1: The average node out-degree over time. Notice that it increases, in all 4 datasets. That is, all graphs are densifying. average degree graph evolution and shrinking diameters
  • 22. number of edges graph evolution and shrinking diameters 10 2 10 3 10 4 10 5 10 2 10 3 10 4 10 Numberofedges Number of nodes Jan 1993 Edges = 0.0113 x 1.69 R 2 =1.0 10 5 10 6 10 7 10 5 10 6 10 7 Number of nodes Numberofedges 1975 Edges = 0.0002 x 1.66 R 2 =0.99 (a) arXiv (b) Patents 10 3.5 10 3.6 10 3.7 10 3.8 10 4.1 10 4.2 10 4.3 10 4.4 Numberofedges Number of nodes Edges = 0.87 x 1.18 R 2 =1.00 10 2 10 3 10 4 10 5 10 2 10 3 10 4 10 5 10 6 Numberofedges Number of nodes Edges = 0.4255 x 1.15 R 2 =1.0 (c) Autonomous Systems (d) Affiliation network 10 3 10 4 10 5 10 3 10 4 10 5 10 6 Number of nodes Numberofedges Oct ’03 May ’05 Edges = 1 x 1.12 R 2 =1.00 10 4 10 5 10 6 10 3 10 4 10 5 10 6 10 7 Number of nodes Numberofedges 1910 2004 Edges = 0.9 x 1.11 R 2 =0.98 (e) Email network (f) IMDB actors to movies network Figure 2: Number of edges e(t) versus number of nodes n(t), in log-log scales, for several graphs. All 4 graphs obey the Densification Power Law, with a consistently good fit. Slopes: a = 1.68, 1.66, 1.18, 1.15, 1.12, and 1.11 respectively.
  • 23. effective diameter graph evolution and shrinking diameters 1992 1994 1996 1998 2000 2002 2004 4 5 6 7 8 9 Time [years] Effectivediameter Post ’95 subgraph, no past 1992 1994 1996 1998 2000 2002 4 5 6 7 8 9 10 Time [years] Effectivediameter Post ’95 subgraph, no past (a) arXiv citation graph (b) Affiliation network 1975 1980 1985 1990 1995 2000 5 10 15 20 25 30 35 Time [years] Effectivediameter Full graph Post ’85 subgraph Post ’85 subgraph, no past 3000 3500 4000 4500 5000 5500 6000 6500 4 4.2 4.4 4.6 4.8 5 Effectivediameter Size of the graph [number of nodes] Linear fit (c) Patents citation graph (d) Autonomous Systems 0 5 10 15 20 4 4.5 5 5.5 6 6.5 7 Time [months] Effectivediameter Full graph Post Jan ’04 subgraph Post Jan ’04 subgraph, no past 1920 1940 1960 1980 2000 8 9 10 11 12 13 14 15 16 Time [years] Effectivediameter Full graph Post ’40 subgraph Post ’40 subgraph, no past (e) Email network (f) IMDB actors to movies network Figure 3: The effective diameter over time for 6 different datasets. Notice consistent decrease of the diameter over time.
  • 24. T-61.6020: Mining the social web — lecture #2 graph evolution and shrinking diameters 24 theoretical justification : proposed a graph-evolution model that explains the empirical findings (graph densification and shrinking diameters) forest fire model (FF)
  • 25. T-61.6020: Mining the social web — lecture #2 25 hypothesis : well-formed, assumed true as graph evolves distances increase data : extensive collection — how to collect evolving networks? methodology : simple statistics, but never done before impact/interestingness : challenged the hypothesis, interesting findings reproducibility : datasets and code publicly available summary Graph evolution: densification and shrinking diameters Leskovec, Kleinberg, Faloutsos
  • 26. T-61.6020: Mining the social web — lecture #2 Graph evolution: densification and shrinking diameters Leskovec, Kleinberg, Faloutsos 26 1 2 3 4 5 originality low high 1 2 3 4 5 impact low high 1 2 3 4 5 rigorousness / technical novelty low high 1 2 3 4 5 reproducibility low high
  • 27. Feedback Effects between Similarity and Social Influence in Online Communities David Crandall Dept. of Computer Science Cornell University Ithaca, NY 14853 crandall@cs.cornell.edu Dan Cosley Dept. of Communication Cornell University Ithaca, NY 14853 drc44@cornell.edu Daniel Huttenlocher Dept. of Computer Science Cornell University Ithaca, NY 14853 dph@cs.cornell.edu Jon Kleinberg Dept. of Computer Science Cornell University Ithaca, NY 14853 kleinber@cs.cornell.edu Siddharth Suri Dept. of Computer Science Cornell University Ithaca, NY 14853 ssuri@cs.cornell.edu ABSTRACT A fundamental open question in the analysis of social net- works is to understand the interplay between similarity and social ties. People are similar to their neighbors in a social network for two distinct reasons: first, they grow to resemble their current friends due to social influence; and second, they tend to form new links to others who are already like them, a process often termed selection by sociologists. While both factors are present in everyday social processes, they are in tension: social influence can push systems toward unifor- mity of behavior, while selection can lead to fragmentation. As such, it is important to understand the relative e↵ects of these forces, and this has been a challenge due to the di culty of isolating and quantifying them in real settings. We develop techniques for identifying and modeling the in- teractions between social influence and selection, using data from online communities where both social interaction and changes in behavior over time can be measured. We find clear feedback e↵ects between the two factors, with rising similarity between two individuals serving, in aggregate, as an indicator of future interaction — but with similarity then continuing to increase steadily, although at a slower rate, for the current activities of their friends, or of the people most similar to them? Categories and Subject Descriptors: H.2.8 Database Management: Database Applications – Data Mining General Terms: Measurement, Theory Keywords: social networks, online communities, social in- fluence 1. INTRODUCTION Social influence and selection. A fundamental property of social networks is that people tend to have attributes similar to those of their friends. There are two underlying reasons for this. First, the process of social influence [7] leads people to adopt behaviors exhibited by those they interact with; this e↵ect is at work in many settings where new ideas di↵use by word-of-mouth or imitation through a network of people [19, 22]. A second, distinct reason is that people tend to form relationships with others who are already similar to them. This phenomenon, which is often termed selection, has a long history of study in sociology [13, 16].1 The two forces of social influence and selection are both seen in a wide range of social settings: people decide to adopt
  • 28. T-61.6020: Mining the social web — lecture #2 similarity and social influence 28 observation : people are similar to their friends selection or influence? questions : how social interaction affects interests, and vice versa? can we use social similarity and interaction to predict future behavior?
  • 29. T-61.6020: Mining the social web — lecture #2 user interests and similarity between users 29 focus on wikipedia editors who edits which page? edits up to time t forms a vector expressing user interests up to that time point similarity between users consider the similarity of two users who “meet” one posts in the discussion page of the other in the r. d to a inter- ooting. tworks wer of g peo- ts, re- use one of the more common measures, the cosine metric, Cosine(~u,~v) = cos ~u ~v = ~u · ~v ||~u||2||~v||2 , (1) where ||~v||2 denotes the Euclidean norm of v. While a comparison of similarity measures is not the fo- cus of our current work, we have evaluated a wide range of measures for our purpose. We use the cosine metric here be- cause it is independent of the rate at which people are edit-
  • 30. user interests and similarity between users main finding : Figure 1: Average cosine similarity of user pairs as a function of the number of edits from time of first interaction, for Wikipedia. 2.2 The logues media throug pattern social i article a site sharing out cle people A na — rap but ste a mod networ a mini and in ters of that th
  • 31. user interests and similarity between users possible explanation : feedback loop between social influence and selection similarity leads to interaction, which leads to further similarity proposed a theoretical model to explain the findings (neighbors may affect actions and interactions)
  • 32. predicting future behavior based on user similarity and user interaction (a) Wikipedia (b) LiveJournal Figure 4: (a) Probability of joining a community based on k exposure via social ties versus similarity ties for (a) Wikipedia and (b) LiveJournal. The solid black curves corresponds to social ties and the dashed red curves to similarity ties. The error bars represent ±2 standard errors. solid black curves are drawn using neighbors in the social influence graph for each community, while the dashed red months apart. Many first edits close to t1 would suggest e↵ects based on short-term processes, such as immediate
  • 33. T-61.6020: Mining the social web — lecture #2 33 question to study : interplay between influence and selection data : wikipedia edits (creative but somewhat limited) methodology : simple statistics, theoretical model, prediction model impact/interestingness : some interesting findings reproducibility : datasets publicly available summary Feedback effects between similarity and social influence in online communities Crandall et al.
  • 34. T-61.6020: Mining the social web — lecture #2 Feedback effects between similarity and social influence in online communities Crandall et al. 34 1 2 3 4 5 originality low high 1 2 3 4 5 impact low high 1 2 3 4 5 rigorousness / technical novelty low high 1 2 3 4 5 reproducibility low high
  • 35. Meme-tracking and the Dynamics of the News Cycle Jure Leskovec ∗† Lars Backstrom ∗ Jon Kleinberg ∗ ∗ Cornell University † Stanford University jure@cs.stanford.edu lars@cs.cornell.edu kleinber@cs.cornell.edu ABSTRACT Tracking new topics, ideas, and “memes” across the Web has been an issue of considerable interest. Recent work has developed meth- ods for tracking topic shifts over long time scales, as well as abrupt spikes in the appearance of particular named entities. However, these approaches are less well suited to the identification of content that spreads widely and then fades over time scales on the order of days — the time scale at which we perceive news and events. We develop a framework for tracking short, distinctive phrases that travel relatively intact through on-line text; developing scalable algorithms for clustering textual variants of such phrases, we iden- tify a broad class of memes that exhibit wide spread and rich vari- ation on a daily basis. As our principal domain of study, we show how such a meme-tracking approach can provide a coherent repre- sentation of the news cycle — the daily rhythms in the news media that have long been the subject of qualitative interpretation but have never been captured accurately enough to permit actual quantitative analysis. We tracked 1.6 million mainstream media sites and blogs over a period of three months with the total of 90 million articles and we find a set of novel and persistent temporal patterns in the news cycle. In particular, we observe a typical lag of 2.5 hours between the peaks of attention to a phrase in the news media and in blogs respectively, with divergent behavior around the overall peak and a “heartbeat”-like pattern in the handoff between news and blogs. We also develop and analyze a mathematical model for the kinds of temporal variation that the system exhibits. Categories and Subject Descriptors: H.2.8 [Database Manage- ment]: Database applications—Data mining General Terms: Algorithms; Experimentation. Keywords: Meme-tracking, Blogs, News media, News cycle, In- formation cascades, Information diffusion, Social networks abilistic term mixtures have been successful at identifying long- range trends in general topics over time [5, 7, 16, 17, 30, 31]. At the other extreme, identifying hyperlinks between blogs and extracting rare named entities has been used to track short information cas- cades through the blogosphere [3, 14, 20, 23]. However, between these two extremes lies much of the temporal and textual range over which propagation on the web and between people typically occurs, through the continuous interaction of news, blogs, and web- sites on a daily basis. Intuitively, short units of text, short phrases, and “memes” that act as signatures of topics and events propagate and diffuse over the web, from mainstream media to blogs, and vice versa. This is exactly the focus of our study here. Moreover, it is at this intermediate temporal and textual granular- ity of memes and phrases that people experience news and current events. A succession of story lines that evolve and compete for at- tention within a relatively stable set of broader topics collectively produces an effect that commentators refer to as the news cycle. Tracking dynamic information at this temporal and topical resolu- tion has proved difficult, since the continuous appearance, growth, and decay of new story lines takes place without significant shifts in the overall vocabulary; in general, this process can also not be closely aligned with the appearance and disappearance of specific named entities (or hyperlinks) in the text. As a result, while the dynamics of the news cycle has been a subject of intense interest to researchers in media and the political process, the focus has been mainly qualitative, with a corresponding lack of techniques for un- dertaking quantitative analysis of the news cycle as a whole. Our approach to meme-tracking, with applications to the news cycle. Here we develop a method for tracking units of information as they spread over the web. Our approach is the first to scalably identify short distinctive phrases that travel relatively intact through
  • 36. T-61.6020: Mining the social web — lecture #2 meme tracking 36 understand the dynamics of reported news focus on 24-hour news cycles questions : do such news cycles exist? can we detect them in the data? can we measure their properties
  • 37. T-61.6020: Mining the social web — lecture #2 meme tracking 37 dataset : 90 m news articles from the 2008 US presidential elections how to identify news cycles : urls, topics, name entities, bag-of-words…? approach taken : quotes (memes) easy to manage at large scale travel relatively unchanged via many articles
  • 38. is palling around with terrorists as being so imperfect he is palling around with terrorists who would target their own country a force for good in the world we see america as a force for good in this world we see america as a force for exceptionalism our opponents see america as imperfect enough to pal around with terrorists who would bomb their own country s as being so imperfect enough uld target their own country america it seems as being so imperfect this is not a man who sees america as you see america and as i see america this is not a man who sees america as you see it and how i see america palling around with terrorists who would target their own country that he s palling around with terrorists who would target their own country pal around with terrorists who targeted their own country palling around with terrorists who target their own country this is someone who sees america as impe around with terrorists who targeted th our opponent is someone who sees america as imperfect enough to pal around with terrorists who targeted their own country our opponent though is someone who sees america it seems as being so imperfect that he s palling around with terrorists who would target their own country this is not a man who sees america as you see it and how i see america we see imperfect imperfect enough that ld target their own country perfect imperfect enough that would target their own country is someone who sees america it seems as being so imperfect that he s palling around with terrorists who would target their own country our opponent is someone who sees america it seems as being so imperfect that he s palling around with terrorists who would target their own country our opponent is someone who sees america as imperfect enough to pal around with terrorists who target their own country we see america as a force of good in this world we see an america of exceptionalism someone who sees america as imperfe around with terrorists who targeted th someone who sees america it seems as being so imperfect that he s palling around with terrorists who would target their own country sees america as imperfect enough to pal around with terrorists who targeted their own country terrorists who would target their own country imperfect enough that he s palling around with terrorists who would target their country Figure 1: A small portion of the full set of variants of Sarah Palin’s quote, “Our opponent is someone who sees America, it seems, as being so imperfect, imperfect enough that he’s palling around with terrorists who would target their own country.” The arrows indicate the (approximate) inclusion of one variant in another, as part of the methodology developed in Section 2. 1 4 8 9 13 phrases with this property are exclusively produced by spammers. (We use ε = .25, L = 4, and M = 10 in our implementation.)
  • 39. meme tracking interesting optimization problem identify single-rooted propagations s as being so imperfect enough uld target their own country america it seems as being so imperfect our opponent though is someone who sees america it seems as being so impe that he s palling around with terrorists who would target their own count this is not a man who sees america as you see it and how i see am imperfect imperfect enough that ld target their own country perfect imperfect enough that would target their own country is someone who sees america it seems as being so imperfect that he s pallin around with terrorists who would target their own country our opponent is someone who sees america it seems as being so imperfect th he s palling around with terrorists who would target their own country Figure 1: A small portion of the full set of variants of Sarah Palin’s as being so imperfect, imperfect enough that he’s palling around wit indicate the (approximate) inclusion of one variant in another, as part 1 2 3 4 5 6 7 8 9 10 11 13 15 14 12 Figure 2: Phrase graph. Each phrase is a node and we want to delete the least edges so that each resulting connected compo- nent has a single root node/phase, a node with zero out-edges. By deleting the indicated edges we obtain the optimal solution. To begin, we define some terminology. We will refer to each news article or blog post as an item, and refer to a quoted string
  • 40. meme tracking volume distributions 5 in Fig. 2). So, the phrase cluster should be a ll paths terminate in a single root node. o identify phrase clusters, we would like delete weight from the phrase graph so it falls apart with the property that each piece “feeds into” hat can serve as the exemplar for the phrase ely, we define a directed acyclic graph to be ntains exactly one root node. (Note that ev- one root.) We now define the following DAG ng: Given a directed acyclic graph with delete a set of edges of minimum to- hat each of the resulting components is 2 shows a DAG with all edge weights equal to edges forms the unique optimal solution. DAG Partitioning is computationally intractable We then discuss the heuristic we use for the which we find to work well in practice. DAG Partitioning is NP-hard. 10-1 100 10 1 102 103 104 105 10 6 10 7 108 109 100 101 102 103 104 105 No.ofitemswithvolume≥x Volume, x Phrases: ∝ x-1.8 Clusters: ∝ x-2.1 Lipstick: ∝ x-0.85 Figure 3: Phrase volume distribution. We consider the volume of individual phrases, phrase-clusters, and the phrases that compose the “Lipstick on a pig” cluster. Notice phrases and phrase-clusters have similar power-law distribution while the “Lipstick on a pig” cluster has much fatter tail, which means that popular phrases have unexpectedly high popularity. to the cluster to which it has the most edges. For example, in Fig. 2
  • 41. Figure 4: Top 50 threads in the news cycle with highest volume for the period Aug. 1 – Oct. 31, 2008. Each thread consists of all news articles and blog posts containing a textual variant of a particular quoted phrases. (Phrase variants for the two largest threads in each week are shown as labels pointing to the corresponding thread.) The data is drawn as a stacked plot in which the thickness of the strand corresponding to each thread indicates its volume over time. Interactive visualization is available at http://memetracker.org. threads dynamics
  • 42. T-61.6020: Mining the social web — lecture #2 42 question to study : identify news cycles, study their dynamics data : news articles methodology : interesting computational problems in managing memes impact/interestingness : interesting methods interesting findings reproducibility : datasets publicly available summary Meme-tracking and the dynamics of the news cycle Leskovec, Backstrom, Kleinberg
  • 43. T-61.6020: Mining the social web — lecture #2 Meme-tracking and the dynamics of the news cycle Leskovec, Backstrom, Kleinberg 43 1 2 3 4 5 originality low high 1 2 3 4 5 impact low high 1 2 3 4 5 rigorousness / technical novelty low high 1 2 3 4 5 reproducibility low high
  • 44. Everyone’s an Influencer: Quantifying Influence on Twitter Eytan Bakshy∗ University of Michigan, USA ebakshy@umich.edu Jake M. Hofman Yahoo! Research, NY, USA hofman@yahoo-inc.com Winter A. Mason Yahoo! Research, NY, USA winteram@yahoo- inc.com Duncan J. Watts Yahoo! Research, NY, USA djw@yahoo-inc.com ABSTRACT In this paper we investigate the attributes and relative influ- ence of 1.6M Twitter users by tracking 74 million diffusion events that took place on the Twitter follower graph over a two month interval in 2009. Unsurprisingly, we find that the largest cascades tend to be generated by users who have been influential in the past and who have a large number of followers. We also find that URLs that were rated more interesting and/or elicited more positive feelings by workers on Mechanical Turk were more likely to spread. In spite of these intuitive results, however, we find that predictions of which particular user or URL will generate large cascades are relatively unreliable. We conclude, therefore, that word- of-mouth diffusion can only be harnessed reliably by tar- geting large numbers of potential influencers, thereby cap- turing average effects. Finally, we consider a family of hy- pothetical marketing strategies, defined by the relative cost of identifying versus compensating potential “influencers.” We find that although under some circumstances, the most influential users are also the most cost-effective, under a wide range of plausible assumptions the most cost-effective performance can be realized using “ordinary influencers”— individuals who exert average or even less-than-average in- fluence. Categories and Subject Descriptors H.1.2 [Models and Principles]: User/Machine Systems; J.4 [Social and Behavioral Sciences]: Sociology Keywords Communication networks, Twitter, diffusion, influence, word of mouth marketing. 1. INTRODUCTION Word-of-mouth diffusion has long been regarded as an im- portant mechanism by which information can reach large populations, possibly influencing public opinion [14], adop- tion of innovations [26], new product market share [4], or brand awareness [15]. In recent years, interest among re- searchers and marketers alike has increasingly focused on whether or not diffusion can be maximized by seeding a piece of information or a new product with certain spe- cial individuals, often called “influentials” [34, 15] or sim- ply “influencers,” who exhibit some combination of desirable attributes—whether personal attributes like credibility, ex- pertise, or enthusiasm, or network attributes such as connec- tivity or centrality—that allows them to influence a dispro- portionately large number of others [10], possibly indirectly via a cascade of influence [31, 16]. Although appealing, the claim that word-of-mouth diffu- sion is driven disproportionately by a small number of key influencers necessarily makes certain assumptions about the underlying influence process that are not based directly on empirical evidence. Empirical studies of diffusion are there- fore highly desirable, but historically have suffered from two major difficulties. First, the network over which word-of- mouth influence spreads is generally unobservable, hence
  • 45. T-61.6020: Mining the social web — lecture #2 who is influential in twitter? 45 questions : who is influential and in which content? (celebrity vs. expert on a topic vs. trusted friend…) can we predict who is influential?
  • 46. T-61.6020: Mining the social web — lecture #2 who is influential in twitter? 46 dataset : track 1.6 m users 74 m diffusion events (cascades of shortened urls) two-month period in 2009 definition of influential : someone who posts urls that many retweet (narrow for the purpose of the study)
  • 47. T-61.6020: Mining the social web — lecture #2 the dataset 47 URLs posted Density 10!10 10!8 10!6 10!4 10!2 ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 101 102 103 104 Figure 1: Probability density of number of bit.ly URLs posted per user “leaders,” not on prediction.) Second, whereas the focus of previous studies has been largely descriptive (e.g. compar- ing the most influential users), we are interested explicitly in the same two-month period. We did this by querying the Twitter API to find the followers of every user who posted a bit.ly URL. Subsequently, we placed those followers in a queue to be crawled, thereby identifying their followers, who were then also placed in the queue, and so on. In this way, we obtained a large fraction of the Twitter follower graph comprising all active bit.ly posters and anyone connected to these users via one-way directed chains of followers. Specifi- cally, the subgraph comprised approximately 56M users and 1.7B edges. Consistent with previous work [7, 18, 35], both the in- degree (‘followers”) and out-degree (“friends”) distributions are highly skewed, but the former much more so—whereas the maximum # of followers was nearly 4M, the maximum # of friends was only about 760K—reflecting the passive and one-way nature of the “follow” action on Twitter (i.e. A can follow B without any action required from B). We emphasize, moreover, that because the crawled graph was seeded exclusively with active users, it is almost certainly not representative of the entire follower graph. In particular, active users are likely to have more followers than average, in which case we would expect that the average in-degree will exceed the average out-degree for our sample—as indeed we observe. Table 1 presents some basic statistics of the distributions of the number of friends, followers and number of URLs posted per user. URLs posted ! 101 102 103 104 gure 1: Probability density of number of bit.ly RLs posted per user aders,” not on prediction.) Second, whereas the focus of evious studies has been largely descriptive (e.g. compar- g the most influential users), we are interested explicitly in edicting influence; thus we consider all users, not merely e most influential. Third, in addition to predicting diffu- on as a function of the attributes of individual seeds, we so study the effects of content. We believe these differ- ces bring the understanding of diffusion on Twitter closer practical applications, although as we describe later, ex- rimental studies are still required. DATA To study diffusion on Twitter, we combined two separate t related sources of data. First, over the two-month pe- od of September 13 2009 - November 15 2009 we recorded 1.03B public tweets broadcast on Twitter, excluding Oc- ber 14-16 during which there were intermittent outages in e Twitter API. Of these, we extracted 87M tweets that cluded bit.ly URLs and which corresponded to distinct ffusion “events,” where each event comprised a single ini- tor, or “seed,” followed by some number of repostings of e same URL by the seed’s followers, their followers, and so 1 . Finally, we identified a subset of 74M diffusion events at were initiated by seed users who were active in both e first and second months of the observation period; thus abling us to train our regression model on first month # of friends was only about 760K—reflecting the passive and one-way nature of the “follow” action on Twitter (i.e. A can follow B without any action required from B). We emphasize, moreover, that because the crawled graph was seeded exclusively with active users, it is almost certainly not representative of the entire follower graph. In particular, active users are likely to have more followers than average, in which case we would expect that the average in-degree will exceed the average out-degree for our sample—as indeed we observe. Table 1 presents some basic statistics of the distributions of the number of friends, followers and number of URLs posted per user. Table 1: Statistics of the Twitter follower graph and seed activity # Followers # Friends # Seeds Posted Median 85.00 82.00 11.00 Mean 557.10 294.10 46.33 Max. 3,984,000.00 759,700.00 54,890 4. COMPUTING INFLUENCE ON TWITTER To calculate the influence score for a given URL post, we tracked the diffusion of the URL from its origin at a particular “seed” node through a series of reposts—by that user’s followers, those users’ followers, and so on—until the diffusion event, or cascade, terminated. To do this, we used the time each URL was posted: if person B is following person A, and person A posted the URL before B and was the only of B’s friends to post the URL, we say person A influenced person B to post the URL. As Figure 2 shows, if B has more than one friend who has previously posted the same URL, we have three choices for how to assign the corresponding influence: first, we can assign full credit to the the urls the follower graph
  • 48. T-61.6020: Mining the social web — lecture #2 cascades 48 l- en a ed ry ck p- r- s. ss al; e. to n- RT Figure 3: Examples of information cascades on Twitter. there are many reasons why individuals may choose to pass along information other than the number and identity of the individuals from whom they received it—in particular, the nature of the content itself. Moreover, influencing an- other individual to pass along a piece of information does not Size Density 10−7 10−6 10−5 10−4 10−3 10−2 10−1 G G G G G G G G G G G G G G 100 101 102 103 104 (a) Cascade Sizes Depth Frequency 101 102 103 104 105 106 107 G G G G G G G G G G 0 2 4 6 8 (b) Cascade Depths Figure 4: (a). Frequency distribution of cascade sizes. (b). Distribution of cascade depths. we study size or depth, therefore, the implication is that most events do not spread at all, and even moderately sized cascades are extremely rare. To identify consistently influential individuals, we aggre- gated all URL posts by user and computed individual-level influence as the logarithm of the average size of all cascades for which that user was a seed. We then fit a regression tree model [6], in which a greedy optimization process recur- sively partitions the feature space, resulting in a piecewise- constant function where the value in each partition is fit to the mean of the corresponding training data. An important whe isfie mea part that age pred atin U prov the this are o dict follo are ablin as s as o follo from that also Fi five cate the t pear the
  • 49. T-61.6020: Mining the social web — lecture #2 prediction task 49 build a model to predict influence model features : user attributes # followers # friends # tweets date of joining past influence of seed users average, minimum, and maximum total influence average, minimum, and maximum local influence (repeat study with additional content features)
  • 50. T-61.6020: Mining the social web — lecture #2 prediction task — results 50 # followers and past influence are important features individuals who have been influential in the past and who have many followers are more likely to be influential in the future however, this is correct only on average predictor features are necessary but not sufficient cannot really predict who will initiate a cascade advertisers need a diverse portfolio of users to target
  • 51. T-61.6020: Mining the social web — lecture #2 prediction task — results 51 log10(pastLocalInfluence + 1)< 0.09791 log10(pastLocalInfluence + 1)< 0.3028 log10(pastLocalInfluence + 1)< 0.3027 log10(pastLocalInfluence + 1)< 0.856 0.0124 0.03631 0.05991 0.09241 0.1452 0.1229 0.1929 0.3045 0.275 0.4118 0.6034 0.9854 Figure 5: Regression tree fit for one of the five cross-validation folds. Leaf nodes give the predicted influence for the corresponding partition, where the left (right) child is followed if the node condition is satisfied (violated). (a) All users Past Local Influence Followers 102 103 104 105 106 TreySongz Orbitz stephenfry marissamayer disneypollsMrEdLover BarackObama pigeonPOLL iphone_dev geohot mslayel cnnbrk TreysAngels OFA_TX britneyspears riskybusinessmb nprnews wealthtv garagemkorova michelebachmann billprady 10-1 100 101 102 (b) Top 25 users Figure 6: Influence as a function of past local influence and number of followers for (a) all users and (b) users with the top 25 actual influence. Each circle represents a single seed user, where the size of the circle represents that user’s actual average influence. than others (e.g. news articles of specialized interest), or First, we filtered URLs that we knew to be spam or in a lan- Predicted Influence ActualInfluence 0.0 0.2 0.4 0.6 0.8 1.0 1.2 GGGGGGGGGGGGGGG GGGGG GGGGGGGGGG GGGGG GGGGG GGGGG GGG GG GG G G G GG G G G 0.2 0.4 0.6 0.8 1.0 Figure 7: Actual vs. predicted influence for regres- sion tree. The model assigns each seed user to a leaf in the regression tree. Points representing the av- erage actual influence values are placed at the pre-
  • 53. T-61.6020: Mining the social web — lecture #2 53 question to study : can we identify influential users in twitter? data : large twitter dataset over two months (proprietary) methodology : prediction and analysis of a regression task impact/interestingness : interesting question, potentially very high impact for advertisers reproducibility : not publicly available dataset summary Everyone is an influencer: quantifying influence on twitter Bakshy, Hofman, Mason,Watts
  • 54. T-61.6020: Mining the social web — lecture #2 Everyone is an influencer: quantifying influence on twitter Bakshy, Hofman, Mason,Watts 54 1 2 3 4 5 originality low high 1 2 3 4 5 impact low high 1 2 3 4 5 rigorousness / technical novelty low high 1 2 3 4 5 reproducibility low high
  • 55. Coevolution of Network Structure and Content Chun-Yuen Teng School of Information University of Michigan Ann Arbor, MI 48109 chunyuen@umich.edu Liuling Gong School of Information University of Michigan Ann Arbor, MI 48109 llgong@umich.edu Avishay Livne EECS University of Michigan Ann Arbor, MI 48109 avishay@umich.edu Celso Brunetti Carey Business School Johns Hopkins Baltimore, MD 21202 celsob@jhu.edu Lada Adamic School of Information University of Michigan Ann Arbor, MI 48109 ladamic@umich.edu ABSTRACT As individuals communicate, their exchanges form a dy- namic network. We demonstrate, using time series analy- sis of communication in three online settings, that network structure alone can be highly revealing of the diversity and novelty of the information being communicated. Our ap- proach uses both standard and novel network metrics to characterize how unexpected a network configuration is, and to capture a network’s ability to conduct information. We find that networks with a higher conductance in link struc- ture exhibit higher information entropy, while unexpected network configurations can be tied to information novelty. We use a simulation model to explain the observed corre- spondence between the evolution of a network’s structure and the information it carries. Categories and Subject Descriptors J.4 [Computer Applications]: Social and Behavioral Sci- ences; H.2.8 [Database Applications]: Data Mining General Terms Measurement, Human Factors Keywords social media, information networks, network evolution adoption of ideas and behavior [28, 6, 3], convergence of opinion [5], or the speed and extent of innovation [14]. In practice, networks are rarely static, unless one consid- ers only the strongest and most stable ties [7] or experimen- tally dictates the network topology to be fixed [6]. However, even stable ties transfer information at di↵erent rates [25, 13, 21], and a portion of information flow occurs outside of established social ties [4]. New ties are also induced by information flow, e.g. a Pakistani Twitter user who inad- vertently live-tweeted the Bin Laden assassination quickly gained tens of thousands of new followers on Twitter. This points to a need to approach the relationship between net- work structure and information content in a substantively di↵erent way. In this paper, rather than treating the network structure as static, we specifically use its dynamic nature to infer two properties of the information being communicated through the network. The first is the diversity of the information; whether everyone is talking about the same topic or whether one is observing many disparate conversation topics being discussed. The second is the novelty of the information; whether individuals in the network are continuing to talk about the same topic they talked about in the previous time period, or whether new topics have arisen that are di↵er- ent from what has been discussed before. For example, one could imagine oneself at a dinner party, where most conver- sations are out of earshot, but one can easily observe who is conversing with whom. While individuals are milling about Xiv:1107.5543v2[cs.SI]21May2012
  • 56. T-61.6020: Mining the social web — lecture #2 content vs. structure 56 questions : understand the interplay between content and structure what is said in the network vs. how the information spreads more concretely : can the network structure tell what people talk about? are they talking about the same thing or they gossip? is what people talk about novel?
  • 57. T-61.6020: Mining the social web — lecture #2 content vs. structure 57 What’s different here !  We look at network dynamics at relatively short time scales and construct time series !  A range of network metrics, instead of just community structure !  Information novelty and diversity as opposed to tracking single events / pieces of information big news! virus epidemic weather is horrible today
  • 58. T-61.6020: Mining the social web — lecture #2 content vs. structure — methodology 58 extract features that capture network structure # vertices, # edges, avg degree, degree correlations, … conductance (is information flows along many paths?) expectedness of conversation (have I seen this edge before?) extract features that characterize content diversity and novelty correlation analysis between structure and content features analysis on 3 datasets : twitter, virtual game, enron email network
  • 59. T-61.6020: Mining the social web — lecture #2 content vs. structure — findings 59 “simple” structure features are not correlated with content diversity and novelty conductance correlates with content diversity expectedness correlates with content novelty
  • 60. T-61.6020: Mining the social web — lecture #2 60 question to study : interplay between structure and content data : three datasets methodology : feature extraction and correlation analysis impact/interestingness : interesting question, potentially very high impact reproducibility : some datasets publicly available summary Coevolution of network structure and content Teng, Cong, Livne, Brunetti, and Adamic
  • 61. T-61.6020: Mining the social web — lecture #2 Coevolution of network structure and content Teng, Cong, Livne, Brunetti, and Adamic 61 1 2 3 4 5 originality low high 1 2 3 4 5 impact low high 1 2 3 4 5 rigorousness / technical novelty low high 1 2 3 4 5 reproducibility low high
  • 62. T-61.6020: Mining the social web — lecture #2 what is next? continue literature review (next week, Michael) meanwhile… keep thinking about project ideas browse papers the ones in Noppa main conferences : ICWSM,WSDM,WWW talk to your colleagues talk to your instructors 62