In this study, we attempted to study the network of Twitter users and the mentions between them. Starting with a very large and incorrectly structured dataset, we used the Unix terminal (sed) and regular expressions to efficiently perform filtering and various transformations to end up with a lighter dataset. Then, using Python, we completely transformed the dataset from a linear (line by line) to a tabular format (columns), in order to load the data in iGraph. Using iGraph, we created a weighted directed graph and performed various tasks to explore the network:
- Identifying basic properties of the network, such as the Number of vertices, Number of edges, Diameter of the graph, Average in-degree and Average out-degree.
- Visualising the 5-day evolution of these metrics and commenting on observed fluctuations.
- Identifying the important nodes of the graph, based on In-degree, Out-degree and PageRank
- Performing community detections on the mention graphs, by applying fast greedy clustering, infomap clustering, and louvain clustering on the undirected versions of the 5 mention graphs.
- Visualising the different communities in the mention graph.
1. 1
S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S
T W I T T E R M E N T I O N G R A P H
Twitter Mention Graph
SOCIAL NETWORK ANALYSIS
Sotiris Baratsas
MSc in Business Analytics
TASK 1: Twitter Mention Graph
Our first task is to create a weighted directed graph with igraph, using raw data from
Twitter. To do that, first, we will clean the data and try to bring it as close as possible to
the usable format we want to import to R. Probably, the most efficient way to do that,
is using the native commands of the Unix terminal (bash), which are able to process the
data much faster than the alternatives.
Step 1: Extract only the dates we want
The first thing we can do, in order to work faster with the dataset, is to extract only the
dates we want. To do that, we will use the “grep” command and keep only the rows that
start with “T 2009-07-01” as well as the next 2 rows after every much (passed through
the -A 2 parameter). In this way, we will keep a total of 3 lines for every match, the date,
the user and the tweet.
time grep -A 2 "^T.2009-07-01" tweets2009-07.txt > tweets1.txt
# real 1m45.001s
# user 1m31.585s
# sys 0m4.816s
grep -A 2 "^T.2009-07-02" tweets2009-07.txt > tweets2.txt
grep -A 2 "^T.2009-07-03" tweets2009-07.txt > tweets3.txt
grep -A 2 "^T.2009-07-04" tweets2009-07.txt > tweets4.txt
grep -A 2 "^T.2009-07-05" tweets2009-07.txt > tweets5.txt
As we can see, the time spent to create each file is about 1,5 minutes, which is quite
good.
Step 2: Clean the data
After extracting only the dates we want, we could load the data into python and start
the data cleaning, however, it would be far more efficient to do some part of the data
cleaning inside the terminal, using the sed command.
2. 2
S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S
T W I T T E R M E N T I O N G R A P H
Sed allows us to pass multiple sed’s in a single line, so, we will combine the following
sed commands:
This sed command will match the rows with the Timestamp of the tweets and keep
only the date.
sed -i '' 's/.*([0-9]{4})-([0-9]{2})-([0-9]{2}).*/1-
2-3/g' tweets1.txt
This sed command will match the rows with link to the user’s profile and keep only the
actual twitter handle, removing “http://www.twitter.com/” and also the U at the
beginning of the line.
sed -i '' 's/(U.http://twitter.com/)(.*)/2/g' tweets1.txt
This sed command will match the tweets that include one or more mentions, remove
the words that are not mentions and make it extremely faster for us to later iterate
through each word in a tweet.
sed -i '' 's/W.[^@]*(@[^ :,.]*)*/1 /g' tweets1.txt
This sed command will remove the – separators between each match that is generated
in the output of the previous sed’s.
sed -i '' '/--/d' tweets1.txt
Next, we combine the previous commands in one sed and execute it for each of the 5
files.
§ the -i argument indicates that we want to overwrite the file with the results
§ the empty quotes after the -i is used to indicate that we want to write in the
existing file (it’s needed because I have a Mac. In Linux it might be obsolete.)
time sed -i '' 's/.*([0-9]{4})-([0-9]{2})-([0-
9]{2}).*/1-2-3/g; s/(U.http://twitter.com/)(.*)/2/g;
s/W.[^@]*(@[^ :,.]*)*/1 /g; /--/d' tweets1.txt
sed -i '' 's/.*([0-9]{4})-([0-9]{2})-([0-9]{2}).*/1-
2-3/g; s/(U.http://twitter.com/)(.*)/2/g; s/W.[^@]*(@[^
:,.]*)*/1 /g; /--/d' tweets2.txt
sed -i '' 's/.*([0-9]{4})-([0-9]{2})-([0-9]{2}).*/1-
2-3/g; s/(U.http://twitter.com/)(.*)/2/g; s/W.[^@]*(@[^
:,.]*)*/1 /g; /--/d' tweets3.txt
sed -i '' 's/.*([0-9]{4})-([0-9]{2})-([0-9]{2}).*/1-
2-3/g; s/(U.http://twitter.com/)(.*)/2/g; s/W.[^@]*(@[^
:,.]*)*/1 /g; /--/d' tweets4.txt
3. 3
S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S
T W I T T E R M E N T I O N G R A P H
sed -i '' 's/.*([0-9]{4})-([0-9]{2})-([0-9]{2}).*/1-
2-3/g; s/(U.http://twitter.com/)(.*)/2/g; s/W.[^@]*(@[^
:,.]*)*/1 /g; /--/d' tweets5.txt
The resulting files have the following format:
For tweets that did not include a mention, the content has been cleared.
For tweets with mentions, the content before and between each mention has been
cleared. Each record takes 3 rows (date, user, tweet).
2009-07-04
dailynascar
2009-07-04
dcompanyau
@JezebellXOXO Please Come and See my Lasted pics http://short.to/h0r7
2009-07-04
donnamurrutia
Step 3: Put the data into tabular format
Next, we are ready to load the data in Python to put them in Tabular format and
generate the needed CSVs.
To do that, we follow the process described below:
(I describe the process in detail inside the .ipynb file)
1. We read the file line-by-line
2. We put the data into tabular format, by iterating through every 3 lines and
putting the content in the appropriate column (i.e. Date, from, to).
3. We create a function that looks for every word that starts with @ (mention) and
splits multiple mentions into different rows, keeping the Date and User who
made the mention the same between multiple mentions in the same tweet.
4. We group the data by “from” – “to” pairs and get the size() to find the frequency
(weight) of mentions for each pair.
5. We extract the resulting data frame as CSV
6. We run this process for every file and end up with 5 CSV files, one for each day
The result is 5 CSV files with the following format.
"from","to","weight"
"suddenlyjamie","dmscott",1
"aruanpc","danilogentili",1
"gloriahansen","janedavila",2
"uluvsheena","PreciousSoHot",1
4. 4
S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S
T W I T T E R M E N T I O N G R A P H
"adreamon","jlovely69",1
"cin7415","sdriven1",1
Finally, after having our CSV files ready, we load them in R and create the directed
iGraph objects.
# Reading ths CSV files we have created
tweets1 = read.csv(file="tweets1.csv", header=T, sep=",",
fileEncoding = "utf-8")
tweets2 = read.csv(file="tweets2.csv", header=T, sep=",",
fileEncoding = "utf-8")
tweets3 = read.csv(file="tweets3.csv", header=T, sep=",",
fileEncoding = "utf-8")
tweets4 = read.csv(file="tweets4.csv", header=T, sep=",",
fileEncoding = "utf-8")
tweets5 = read.csv(file="tweets5.csv", header=T, sep=",",
fileEncoding = "utf-8")
# Checking the structure of the data
str(tweets1)
DUPLICATE NODES
By taking a look at the dataset, we observe that some of the records are duplicate,
because of case sensitivity (e.g. 'OfficialTila' and 'officialtila' are considered as separate
users, while they are the same). In Twitter, a username is not dependent on whether it’s
written in lowercase, uppercase, sentence-case or any other case.
To correct the problem, we will make all the characters lowercase and then merge the
duplicate records that have been created. We need to be careful to sum up the weights
of each record, so that they are represented correctly after merging the different cases.
# To correct this problem, first we will make all characters
lowercase
tweets1[,1:2] <- sapply(tweets1[,1:2], tolower)
tweets2[,1:2] <- sapply(tweets2[,1:2], tolower)
tweets3[,1:2] <- sapply(tweets3[,1:2], tolower)
tweets4[,1:2] <- sapply(tweets4[,1:2], tolower)
tweets5[,1:2] <- sapply(tweets5[,1:2], tolower)
# Then, we will use ddply to merge duplicate rows, but also sum
their weights, so that we don't lose any values
library(plyr)
tweets1<- ddply(tweets1,~from + to,summarise,weight=sum(weight))
tweets2<- ddply(tweets2,~from + to,summarise,weight=sum(weight))
5. 5
S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S
T W I T T E R M E N T I O N G R A P H
tweets3<- ddply(tweets3,~from + to,summarise,weight=sum(weight))
tweets4<- ddply(tweets4,~from + to,summarise,weight=sum(weight))
tweets5<- ddply(tweets5,~from + to,summarise,weight=sum(weight))
# CREATING THE iGRAPH OBJECTS (READ FROM DATA FRAME)
g1 <- graph_from_data_frame(tweets1, directed = TRUE, vertices =
NULL)
g2 <- graph_from_data_frame(tweets2, directed = TRUE, vertices =
NULL)
g3 <- graph_from_data_frame(tweets3, directed = TRUE, vertices =
NULL)
g4 <- graph_from_data_frame(tweets4, directed = TRUE, vertices =
NULL)
g5 <- graph_from_data_frame(tweets5, directed = TRUE, vertices =
NULL)
TASK 2: Average Degree over time
Our next task is to create plots that visualize the 5-day evolution of some important
metrics of the network, such as number of vertices, number of edges, diameter and
average degrees.
To do that, we create a loop that takes each network, computes the needed metrics
and write them in a data frame. The resulting table is the following:
To get a better representation, we can also plot the 7 metrics, using ggplot:
(In all directed graphs, the sum of in-degree and sum of out-degree is the same, so we
plot the average in-degree and out-degree on the same plot. Moreover, we calculate
the average weighted in/out degree).
6. 6
S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S
T W I T T E R M E N T I O N G R A P H
There are a few things we can observe in the above graphs:
§ The number of users involved in mentioning others or being mentioned by
others is at its peak on Wednesday 01/07/2009 and steadily decreases every
day until Sunday 05/07/2009.
§ As expected, the number of mentions also decreases, following roughly the
same curve.
§ The diameter of the network starts at 71 on Wed 01/07, steadily increases to
75 until Friday, but drops to its lowest point on Saturday. On Sunday, however,
it reaches its highest point (85). What is fascinating about these numbers, is the
fact that, even if we only took into account the direct mentions, the diameter is
still low, given that we talk about a global social network. Of course, this might
have something to do with the data being from 2009, when Twitter was more
well-known and widely used in the USA.
§ The Average In-Degree and the Average In-Degree follow the opposite
direction, starting just below 1.19 on Wednesday, increasing to over 1.45 on
Friday and decreasing to 1.23 on Sunday. This means that the average user
mentions ~1 person and the average user is also mentioned about once.
§ If we take into account the weights of the network, we can calculate the
weighted average degree, which gives results very similar to the unweighted
degree graph, but a little bit higher.
TASK 3: Important nodes
Next, we will identify the important nodes of the network per day. We will select the
nodes that rank highest for each day in 3 key metrics:
7. 7
S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S
T W I T T E R M E N T I O N G R A P H
§ In-degree
§ Out-degree
§ Page-Rank
In-Degree: Shows us which users are mentioned the most for each day
The process we follow is to calculate the Top 10 for each day and then bind them into
a single data frame:
intop1 <- head(sort(degree(g1, mode="in"), decreasing=TRUE), 10)
intop2 <- head(sort(degree(g2, mode="in"), decreasing=TRUE), 10)
intop3 <- head(sort(degree(g3, mode="in"), decreasing=TRUE), 10)
intop4 <- head(sort(degree(g4, mode="in"), decreasing=TRUE), 10)
intop5 <- head(sort(degree(g5, mode="in"), decreasing=TRUE), 10)
intop <- data.frame(cbind(names(intop1), names(intop2),
names(intop3), names(intop4), names(intop5)))
colnames(intop) <- c("01-07-2009", "02-07-2009", "03-07-2009", "04-
07-2009", "05-07-2009")
intop
In-degree Top 10 per day
As we can see, the users who are mentioned the most, are pretty much the same every
day. It’s accounts that post memes or news, such as tweetmeme, mashable, addthis,
cnn, cnnbrk, breakingnews, and celebrities, such as mileycyrus, ddlovato, adamlambert,
souljaboytellem, officialtila. There are some exceptions that might have to do with
something significant or news-worthy that happened on that day and caused an
account to receive more mentions.
Since we have a weighted network, it might make more sense to calculate the Top 10,
using the weighted in-degree.
8. 8
S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S
T W I T T E R M E N T I O N G R A P H
Weighted in-degree Top 10 per day
If we do that, we can see some variations in the positions of each user inside the Top
10, but the accounts are mostly the same as the previous results.
Out-Degree: Shows us which users mention other users the most for each day
We follow the same process to calculate the Top 10 using the Out-degree
Out-degree Top 10 per day & Weighted out-degree Top 10 per day
On the contrary, when we identify the Top 10 users for each day using the Out-Degree
and Weighted Out-Degree, we can see a lot of variation, with only a few users being
exceptions. The explanation for this is, that it needs a certain level of popularity to be
the receiver of a lot of mentions (in-degree), but any user, on any day, can mention as
many users they want, as long as it doesn’t violate a limit imposed by Twitter.
Page-Rank: Shows us the users who received the most page-rank value every day. It
takes into account whether a user was mentioned by many users who were in turn
mentioned by other users (e.g. infuencers).
# PAGERANK
pgrnk1 <- page_rank(g1, algo="prpack" , directed=FALSE)$vector
pgrnk2 <- page_rank(g2, algo="prpack" , directed=FALSE)$vector
pgrnk3 <- page_rank(g3, algo="prpack" , directed=FALSE)$vector
pgrnk4 <- page_rank(g4, algo="prpack" , directed=FALSE)$vector
pgrnk5 <- page_rank(g5, algo="prpack" , directed=FALSE)$vector
9. 9
S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S
T W I T T E R M E N T I O N G R A P H
# Here are all the characters ranked in descending order, based on
their Pagerank value
ranked1 <- head(sort(pgrnk1, decreasing=TRUE), 10)
ranked2 <- head(sort(pgrnk2, decreasing=TRUE), 10)
ranked3 <- head(sort(pgrnk3, decreasing=TRUE), 10)
ranked4 <- head(sort(pgrnk4, decreasing=TRUE), 10)
ranked5 <- head(sort(pgrnk5, decreasing=TRUE), 10)
ranked <- data.frame(cbind(names(ranked1), names(ranked2),
names(ranked3), names(ranked4), names(ranked5)))
colnames(ranked) <- c("01-07-2009", "02-07-2009", "03-07-2009", "04-
07-2009", "05-07-2009")
ranked
Top 10 users per day, based on Page-Rank
As far as page-rank is concerned, we can see that most of the Top 10 users are the
same as the Top 10 users based on in-degree. This makes sense, since these users (e.g.
celebrities) receive a lot of mentions and retweets from other users, but they only
mention a few other users themselves. This way, they concentrate a lot of page-rank
value.
TASK 4: Communities
Our final task is to identify different communities, by applying fast greedy clustering,
infomap clustering, and louvain clustering on the undirected versions of the 5 mention
graphs.
# Making the graphs undirected
ug1 <- as.undirected(g1)
ug2 <- as.undirected(g2)
ug3 <- as.undirected(g3)
10. 10
S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S
T W I T T E R M E N T I O N G R A P H
ug4 <- as.undirected(g4)
ug5 <- as.undirected(g5)
# Finding communities with fast greedy clustering
communities_fast_greedy1 <- cluster_fast_greedy(ug1)
# Finding communities with infomap clustering
communities_infomap1 <- cluster_infomap(ug1)
# Finding communities with louvain clustering
communities_louvain1 <- cluster_louvain(ug1)
communities_louvain2 <- cluster_louvain(ug2)
communities_louvain3 <- cluster_louvain(ug3)
communities_louvain4 <- cluster_louvain(ug4)
communities_louvain5 <- cluster_louvain(ug5)
However, as it turns out fast greedy clustering takes too long to execute (I got results
after about 45-50 minutes). As a matter of fact, Infomap clustering takes even longer.
The only method that is able to produce results in a matter of seconds, is the Louvain
community detection algorithm, because although it’s based on a greedy optimization
process, it includes an additional aggregation step to improve processing on very large
networks.
compare(communities_fast_greedy1, communities_infomap1)
compare(communities_fast_greedy1, communities_louvain1)
compare(communities_infomap1, communities_louvain1)
Comparing different clustering methods
We can compare the different resulting communities, using the compare() function. It
seems that Louvain is closer to the FastGreedy method.
EVOLUTION OF COMMUNITY MEMBERSHIP
Then, using the Louvain method, we will try to detect the evolution of the communities
the user “KimKardashian” belongs to.
To do that, first we identify the communities in which Kim Kardashian belongs in each
graph (=each day) and then find the intersect of these communities.
11. 11
S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S
T W I T T E R M E N T I O N G R A P H
# Detecting the evolution of communities to which user "KimKardashian"
belongs
c1<-communities_louvain1[membership(communities_louvain1)["kimkardashian"]]
c2<-communities_louvain2[membership(communities_louvain2)["kimkardashian"]]
c3<-communities_louvain3[membership(communities_louvain3)["kimkardashian"]]
c4<-communities_louvain5[membership(communities_louvain4)["kimkardashian"]]
c5<-communities_louvain5[membership(communities_louvain5)["kimkardashian"]]
# Finding similarities between
intersect(c1$`54008`, c2$`41188`)
intersect(c1$`54008`, c3$`22013`)
intersect(c1$`54008`, c4$`8036`)
intersect(c1$`54008`, c5$`21162`)
intersect(c2$`41188`, c3$`22013`)
intersect(c2$`41188`, c4$`8036`)
intersect(c2$`41188`, c5$`21162`)
intersect(c3$`22013`, c4$`8036`)
intersect(c3$`22013`, c5$`21162`)
As we can see from the results, the communities that are most similar (in terms of
common members) are the community of Day 3 and the community of Day 5, as well
as the community from Day 2.
12. 12
S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S
T W I T T E R M E N T I O N G R A P H
On the other hand, the community from Day 4 is very small, and it doesn’t have any
common member with the other communities.
VISUALIZING THE COMMUNITIES
In order to visualize the communities (we will use the 1st
day’s graph as an example),
first we need to:
§ set a color to the different communities (represented as levels of a factor)
§ check the sizes of each community to select our filtering parameters
§ filter to keep only some mid-sized communities
§ induce a subgraph using this filter to keep only the nodes that belong in these
communities
§ plot the subgraph and adjust the parameters to get a good visual result
# Setting colors for the different communities
V(g1)$color <- factor(membership(communities_louvain1))
#Get the sizes of each community of Graph1 (g1)
community_size <- sizes(communities_louvain1)
head(sort(community_size, decreasing=TRUE), 20)
head(sort(community_size, decreasing=FALSE), 20)
mean(community_size)
length(community_size)
# Keep only some mid-size communities with more than 50 and less than
90 members
in_mid_community1 <- unlist(communities_louvain1[community_size > 50
& community_size < 90])
# Induce a subgraph of graph 1 using in_mid_community
sub_g1 <- induced.subgraph(g1, in_mid_community1)
# Plot those mid-size communities
plot(sub_g1, vertex.label = NA, edge.arrow.width = 0.8,
edge.arrow.size = 0.2,
coords = layout_with_fr(sub_g1), margin = 0, vertex.size = 3)
13. 13
S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S
T W I T T E R M E N T I O N G R A P H
Visualization of some mid-sized communities for each day (1 to 5). Each community is depicted in a
different color.