Twitter Mention Graph - Analytics Project
Twitter Mention Graph - Analytics Project
Twitter Mention Graph - Analytics Project
Twitter Mention Graph - Analytics Project
Twitter Mention Graph - Analytics Project
Twitter Mention Graph - Analytics Project
Twitter Mention Graph - Analytics Project
Twitter Mention Graph - Analytics Project
Twitter Mention Graph - Analytics Project
Twitter Mention Graph - Analytics Project
Twitter Mention Graph - Analytics Project
Twitter Mention Graph - Analytics Project
Twitter Mention Graph - Analytics Project
Próximo SlideShare
Introduction to R for Data Science :: Session 4Introduction to R for Data Science :: Session 4
Cargando en ... 3
1 de 13

Más contenido relacionado

Más de Sotiris Baratsas(20)

Último(20)

Twitter Mention Graph - Analytics Project

  1. 1 S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S T W I T T E R M E N T I O N G R A P H Twitter Mention Graph SOCIAL NETWORK ANALYSIS Sotiris Baratsas MSc in Business Analytics TASK 1: Twitter Mention Graph Our first task is to create a weighted directed graph with igraph, using raw data from Twitter. To do that, first, we will clean the data and try to bring it as close as possible to the usable format we want to import to R. Probably, the most efficient way to do that, is using the native commands of the Unix terminal (bash), which are able to process the data much faster than the alternatives. Step 1: Extract only the dates we want The first thing we can do, in order to work faster with the dataset, is to extract only the dates we want. To do that, we will use the “grep” command and keep only the rows that start with “T 2009-07-01” as well as the next 2 rows after every much (passed through the -A 2 parameter). In this way, we will keep a total of 3 lines for every match, the date, the user and the tweet. time grep -A 2 "^T.2009-07-01" tweets2009-07.txt > tweets1.txt # real 1m45.001s # user 1m31.585s # sys 0m4.816s grep -A 2 "^T.2009-07-02" tweets2009-07.txt > tweets2.txt grep -A 2 "^T.2009-07-03" tweets2009-07.txt > tweets3.txt grep -A 2 "^T.2009-07-04" tweets2009-07.txt > tweets4.txt grep -A 2 "^T.2009-07-05" tweets2009-07.txt > tweets5.txt As we can see, the time spent to create each file is about 1,5 minutes, which is quite good. Step 2: Clean the data After extracting only the dates we want, we could load the data into python and start the data cleaning, however, it would be far more efficient to do some part of the data cleaning inside the terminal, using the sed command.
  2. 2 S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S T W I T T E R M E N T I O N G R A P H Sed allows us to pass multiple sed’s in a single line, so, we will combine the following sed commands: This sed command will match the rows with the Timestamp of the tweets and keep only the date. sed -i '' 's/.*([0-9]{4})-([0-9]{2})-([0-9]{2}).*/1- 2-3/g' tweets1.txt This sed command will match the rows with link to the user’s profile and keep only the actual twitter handle, removing “http://www.twitter.com/” and also the U at the beginning of the line. sed -i '' 's/(U.http://twitter.com/)(.*)/2/g' tweets1.txt This sed command will match the tweets that include one or more mentions, remove the words that are not mentions and make it extremely faster for us to later iterate through each word in a tweet. sed -i '' 's/W.[^@]*(@[^ :,.]*)*/1 /g' tweets1.txt This sed command will remove the – separators between each match that is generated in the output of the previous sed’s. sed -i '' '/--/d' tweets1.txt Next, we combine the previous commands in one sed and execute it for each of the 5 files. § the -i argument indicates that we want to overwrite the file with the results § the empty quotes after the -i is used to indicate that we want to write in the existing file (it’s needed because I have a Mac. In Linux it might be obsolete.) time sed -i '' 's/.*([0-9]{4})-([0-9]{2})-([0- 9]{2}).*/1-2-3/g; s/(U.http://twitter.com/)(.*)/2/g; s/W.[^@]*(@[^ :,.]*)*/1 /g; /--/d' tweets1.txt sed -i '' 's/.*([0-9]{4})-([0-9]{2})-([0-9]{2}).*/1- 2-3/g; s/(U.http://twitter.com/)(.*)/2/g; s/W.[^@]*(@[^ :,.]*)*/1 /g; /--/d' tweets2.txt sed -i '' 's/.*([0-9]{4})-([0-9]{2})-([0-9]{2}).*/1- 2-3/g; s/(U.http://twitter.com/)(.*)/2/g; s/W.[^@]*(@[^ :,.]*)*/1 /g; /--/d' tweets3.txt sed -i '' 's/.*([0-9]{4})-([0-9]{2})-([0-9]{2}).*/1- 2-3/g; s/(U.http://twitter.com/)(.*)/2/g; s/W.[^@]*(@[^ :,.]*)*/1 /g; /--/d' tweets4.txt
  3. 3 S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S T W I T T E R M E N T I O N G R A P H sed -i '' 's/.*([0-9]{4})-([0-9]{2})-([0-9]{2}).*/1- 2-3/g; s/(U.http://twitter.com/)(.*)/2/g; s/W.[^@]*(@[^ :,.]*)*/1 /g; /--/d' tweets5.txt The resulting files have the following format: For tweets that did not include a mention, the content has been cleared. For tweets with mentions, the content before and between each mention has been cleared. Each record takes 3 rows (date, user, tweet). 2009-07-04 dailynascar 2009-07-04 dcompanyau @JezebellXOXO Please Come and See my Lasted pics http://short.to/h0r7 2009-07-04 donnamurrutia Step 3: Put the data into tabular format Next, we are ready to load the data in Python to put them in Tabular format and generate the needed CSVs. To do that, we follow the process described below: (I describe the process in detail inside the .ipynb file) 1. We read the file line-by-line 2. We put the data into tabular format, by iterating through every 3 lines and putting the content in the appropriate column (i.e. Date, from, to). 3. We create a function that looks for every word that starts with @ (mention) and splits multiple mentions into different rows, keeping the Date and User who made the mention the same between multiple mentions in the same tweet. 4. We group the data by “from” – “to” pairs and get the size() to find the frequency (weight) of mentions for each pair. 5. We extract the resulting data frame as CSV 6. We run this process for every file and end up with 5 CSV files, one for each day The result is 5 CSV files with the following format. "from","to","weight" "suddenlyjamie","dmscott",1 "aruanpc","danilogentili",1 "gloriahansen","janedavila",2 "uluvsheena","PreciousSoHot",1
  4. 4 S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S T W I T T E R M E N T I O N G R A P H "adreamon","jlovely69",1 "cin7415","sdriven1",1 Finally, after having our CSV files ready, we load them in R and create the directed iGraph objects. # Reading ths CSV files we have created tweets1 = read.csv(file="tweets1.csv", header=T, sep=",", fileEncoding = "utf-8") tweets2 = read.csv(file="tweets2.csv", header=T, sep=",", fileEncoding = "utf-8") tweets3 = read.csv(file="tweets3.csv", header=T, sep=",", fileEncoding = "utf-8") tweets4 = read.csv(file="tweets4.csv", header=T, sep=",", fileEncoding = "utf-8") tweets5 = read.csv(file="tweets5.csv", header=T, sep=",", fileEncoding = "utf-8") # Checking the structure of the data str(tweets1) DUPLICATE NODES By taking a look at the dataset, we observe that some of the records are duplicate, because of case sensitivity (e.g. 'OfficialTila' and 'officialtila' are considered as separate users, while they are the same). In Twitter, a username is not dependent on whether it’s written in lowercase, uppercase, sentence-case or any other case. To correct the problem, we will make all the characters lowercase and then merge the duplicate records that have been created. We need to be careful to sum up the weights of each record, so that they are represented correctly after merging the different cases. # To correct this problem, first we will make all characters lowercase tweets1[,1:2] <- sapply(tweets1[,1:2], tolower) tweets2[,1:2] <- sapply(tweets2[,1:2], tolower) tweets3[,1:2] <- sapply(tweets3[,1:2], tolower) tweets4[,1:2] <- sapply(tweets4[,1:2], tolower) tweets5[,1:2] <- sapply(tweets5[,1:2], tolower) # Then, we will use ddply to merge duplicate rows, but also sum their weights, so that we don't lose any values library(plyr) tweets1<- ddply(tweets1,~from + to,summarise,weight=sum(weight)) tweets2<- ddply(tweets2,~from + to,summarise,weight=sum(weight))
  5. 5 S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S T W I T T E R M E N T I O N G R A P H tweets3<- ddply(tweets3,~from + to,summarise,weight=sum(weight)) tweets4<- ddply(tweets4,~from + to,summarise,weight=sum(weight)) tweets5<- ddply(tweets5,~from + to,summarise,weight=sum(weight)) # CREATING THE iGRAPH OBJECTS (READ FROM DATA FRAME) g1 <- graph_from_data_frame(tweets1, directed = TRUE, vertices = NULL) g2 <- graph_from_data_frame(tweets2, directed = TRUE, vertices = NULL) g3 <- graph_from_data_frame(tweets3, directed = TRUE, vertices = NULL) g4 <- graph_from_data_frame(tweets4, directed = TRUE, vertices = NULL) g5 <- graph_from_data_frame(tweets5, directed = TRUE, vertices = NULL) TASK 2: Average Degree over time Our next task is to create plots that visualize the 5-day evolution of some important metrics of the network, such as number of vertices, number of edges, diameter and average degrees. To do that, we create a loop that takes each network, computes the needed metrics and write them in a data frame. The resulting table is the following: To get a better representation, we can also plot the 7 metrics, using ggplot: (In all directed graphs, the sum of in-degree and sum of out-degree is the same, so we plot the average in-degree and out-degree on the same plot. Moreover, we calculate the average weighted in/out degree).
  6. 6 S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S T W I T T E R M E N T I O N G R A P H There are a few things we can observe in the above graphs: § The number of users involved in mentioning others or being mentioned by others is at its peak on Wednesday 01/07/2009 and steadily decreases every day until Sunday 05/07/2009. § As expected, the number of mentions also decreases, following roughly the same curve. § The diameter of the network starts at 71 on Wed 01/07, steadily increases to 75 until Friday, but drops to its lowest point on Saturday. On Sunday, however, it reaches its highest point (85). What is fascinating about these numbers, is the fact that, even if we only took into account the direct mentions, the diameter is still low, given that we talk about a global social network. Of course, this might have something to do with the data being from 2009, when Twitter was more well-known and widely used in the USA. § The Average In-Degree and the Average In-Degree follow the opposite direction, starting just below 1.19 on Wednesday, increasing to over 1.45 on Friday and decreasing to 1.23 on Sunday. This means that the average user mentions ~1 person and the average user is also mentioned about once. § If we take into account the weights of the network, we can calculate the weighted average degree, which gives results very similar to the unweighted degree graph, but a little bit higher. TASK 3: Important nodes Next, we will identify the important nodes of the network per day. We will select the nodes that rank highest for each day in 3 key metrics:
  7. 7 S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S T W I T T E R M E N T I O N G R A P H § In-degree § Out-degree § Page-Rank In-Degree: Shows us which users are mentioned the most for each day The process we follow is to calculate the Top 10 for each day and then bind them into a single data frame: intop1 <- head(sort(degree(g1, mode="in"), decreasing=TRUE), 10) intop2 <- head(sort(degree(g2, mode="in"), decreasing=TRUE), 10) intop3 <- head(sort(degree(g3, mode="in"), decreasing=TRUE), 10) intop4 <- head(sort(degree(g4, mode="in"), decreasing=TRUE), 10) intop5 <- head(sort(degree(g5, mode="in"), decreasing=TRUE), 10) intop <- data.frame(cbind(names(intop1), names(intop2), names(intop3), names(intop4), names(intop5))) colnames(intop) <- c("01-07-2009", "02-07-2009", "03-07-2009", "04- 07-2009", "05-07-2009") intop In-degree Top 10 per day As we can see, the users who are mentioned the most, are pretty much the same every day. It’s accounts that post memes or news, such as tweetmeme, mashable, addthis, cnn, cnnbrk, breakingnews, and celebrities, such as mileycyrus, ddlovato, adamlambert, souljaboytellem, officialtila. There are some exceptions that might have to do with something significant or news-worthy that happened on that day and caused an account to receive more mentions. Since we have a weighted network, it might make more sense to calculate the Top 10, using the weighted in-degree.
  8. 8 S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S T W I T T E R M E N T I O N G R A P H Weighted in-degree Top 10 per day If we do that, we can see some variations in the positions of each user inside the Top 10, but the accounts are mostly the same as the previous results. Out-Degree: Shows us which users mention other users the most for each day We follow the same process to calculate the Top 10 using the Out-degree Out-degree Top 10 per day & Weighted out-degree Top 10 per day On the contrary, when we identify the Top 10 users for each day using the Out-Degree and Weighted Out-Degree, we can see a lot of variation, with only a few users being exceptions. The explanation for this is, that it needs a certain level of popularity to be the receiver of a lot of mentions (in-degree), but any user, on any day, can mention as many users they want, as long as it doesn’t violate a limit imposed by Twitter. Page-Rank: Shows us the users who received the most page-rank value every day. It takes into account whether a user was mentioned by many users who were in turn mentioned by other users (e.g. infuencers). # PAGERANK pgrnk1 <- page_rank(g1, algo="prpack" , directed=FALSE)$vector pgrnk2 <- page_rank(g2, algo="prpack" , directed=FALSE)$vector pgrnk3 <- page_rank(g3, algo="prpack" , directed=FALSE)$vector pgrnk4 <- page_rank(g4, algo="prpack" , directed=FALSE)$vector pgrnk5 <- page_rank(g5, algo="prpack" , directed=FALSE)$vector
  9. 9 S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S T W I T T E R M E N T I O N G R A P H # Here are all the characters ranked in descending order, based on their Pagerank value ranked1 <- head(sort(pgrnk1, decreasing=TRUE), 10) ranked2 <- head(sort(pgrnk2, decreasing=TRUE), 10) ranked3 <- head(sort(pgrnk3, decreasing=TRUE), 10) ranked4 <- head(sort(pgrnk4, decreasing=TRUE), 10) ranked5 <- head(sort(pgrnk5, decreasing=TRUE), 10) ranked <- data.frame(cbind(names(ranked1), names(ranked2), names(ranked3), names(ranked4), names(ranked5))) colnames(ranked) <- c("01-07-2009", "02-07-2009", "03-07-2009", "04- 07-2009", "05-07-2009") ranked Top 10 users per day, based on Page-Rank As far as page-rank is concerned, we can see that most of the Top 10 users are the same as the Top 10 users based on in-degree. This makes sense, since these users (e.g. celebrities) receive a lot of mentions and retweets from other users, but they only mention a few other users themselves. This way, they concentrate a lot of page-rank value. TASK 4: Communities Our final task is to identify different communities, by applying fast greedy clustering, infomap clustering, and louvain clustering on the undirected versions of the 5 mention graphs. # Making the graphs undirected ug1 <- as.undirected(g1) ug2 <- as.undirected(g2) ug3 <- as.undirected(g3)
  10. 10 S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S T W I T T E R M E N T I O N G R A P H ug4 <- as.undirected(g4) ug5 <- as.undirected(g5) # Finding communities with fast greedy clustering communities_fast_greedy1 <- cluster_fast_greedy(ug1) # Finding communities with infomap clustering communities_infomap1 <- cluster_infomap(ug1) # Finding communities with louvain clustering communities_louvain1 <- cluster_louvain(ug1) communities_louvain2 <- cluster_louvain(ug2) communities_louvain3 <- cluster_louvain(ug3) communities_louvain4 <- cluster_louvain(ug4) communities_louvain5 <- cluster_louvain(ug5) However, as it turns out fast greedy clustering takes too long to execute (I got results after about 45-50 minutes). As a matter of fact, Infomap clustering takes even longer. The only method that is able to produce results in a matter of seconds, is the Louvain community detection algorithm, because although it’s based on a greedy optimization process, it includes an additional aggregation step to improve processing on very large networks. compare(communities_fast_greedy1, communities_infomap1) compare(communities_fast_greedy1, communities_louvain1) compare(communities_infomap1, communities_louvain1) Comparing different clustering methods We can compare the different resulting communities, using the compare() function. It seems that Louvain is closer to the FastGreedy method. EVOLUTION OF COMMUNITY MEMBERSHIP Then, using the Louvain method, we will try to detect the evolution of the communities the user “KimKardashian” belongs to. To do that, first we identify the communities in which Kim Kardashian belongs in each graph (=each day) and then find the intersect of these communities.
  11. 11 S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S T W I T T E R M E N T I O N G R A P H # Detecting the evolution of communities to which user "KimKardashian" belongs c1<-communities_louvain1[membership(communities_louvain1)["kimkardashian"]] c2<-communities_louvain2[membership(communities_louvain2)["kimkardashian"]] c3<-communities_louvain3[membership(communities_louvain3)["kimkardashian"]] c4<-communities_louvain5[membership(communities_louvain4)["kimkardashian"]] c5<-communities_louvain5[membership(communities_louvain5)["kimkardashian"]] # Finding similarities between intersect(c1$`54008`, c2$`41188`) intersect(c1$`54008`, c3$`22013`) intersect(c1$`54008`, c4$`8036`) intersect(c1$`54008`, c5$`21162`) intersect(c2$`41188`, c3$`22013`) intersect(c2$`41188`, c4$`8036`) intersect(c2$`41188`, c5$`21162`) intersect(c3$`22013`, c4$`8036`) intersect(c3$`22013`, c5$`21162`) As we can see from the results, the communities that are most similar (in terms of common members) are the community of Day 3 and the community of Day 5, as well as the community from Day 2.
  12. 12 S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S T W I T T E R M E N T I O N G R A P H On the other hand, the community from Day 4 is very small, and it doesn’t have any common member with the other communities. VISUALIZING THE COMMUNITIES In order to visualize the communities (we will use the 1st day’s graph as an example), first we need to: § set a color to the different communities (represented as levels of a factor) § check the sizes of each community to select our filtering parameters § filter to keep only some mid-sized communities § induce a subgraph using this filter to keep only the nodes that belong in these communities § plot the subgraph and adjust the parameters to get a good visual result # Setting colors for the different communities V(g1)$color <- factor(membership(communities_louvain1)) #Get the sizes of each community of Graph1 (g1) community_size <- sizes(communities_louvain1) head(sort(community_size, decreasing=TRUE), 20) head(sort(community_size, decreasing=FALSE), 20) mean(community_size) length(community_size) # Keep only some mid-size communities with more than 50 and less than 90 members in_mid_community1 <- unlist(communities_louvain1[community_size > 50 & community_size < 90]) # Induce a subgraph of graph 1 using in_mid_community sub_g1 <- induced.subgraph(g1, in_mid_community1) # Plot those mid-size communities plot(sub_g1, vertex.label = NA, edge.arrow.width = 0.8, edge.arrow.size = 0.2, coords = layout_with_fr(sub_g1), margin = 0, vertex.size = 3)
  13. 13 S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S T W I T T E R M E N T I O N G R A P H Visualization of some mid-sized communities for each day (1 to 5). Each community is depicted in a different color.