SlideShare una empresa de Scribd logo
1 de 26
Descargar para leer sin conexión
you need to complete the r code and a single-page document containing two figures, report the
parameters you estimate and discuss how well your power law fits the network data, and explain
the finding.
Question: images
incomplete r code:
# IDS 564 - Spring 2023
# Lab 4 R Code - Estimating the Degree Exponent of a Scale-free Network
#=========================================================================
=====================
# 0. INITIATION
==========================================================================
=
#=========================================================================
=====================
## You'll need VGAM for the zeta function
# install.packages("VGAM") ## When prompted to install from binary version, select no
library(VGAM)
## You'll need this when calculating goodness of fit
# install.packages("parallel")
library(parallel)
library(ggplot2)
library(ggthemes)
library(dplyr)
library(tidyr)
##------------------------------------------------------------------------------
## This function will calculate the zeta function for you. You don't need to worry about it! Run it
and continue.
## gen_zeta(gamma , shift) will give you a number
gen_zeta <- function (gamma, shift = 1, deriv = 0)
{
deriv.arg <- deriv
rm(deriv)
if (!is.Numeric(deriv.arg, length.arg = 1, integer.valued = TRUE))
stop("'deriv' must be a single non-negative integer")
if (deriv.arg < 0 || deriv.arg > 2)
stop("'deriv' must be 0, 1, or 2")
if (deriv.arg > 0)
return(zeta.specials(Zeta.derivative(gamma, deriv.arg = deriv.arg,
shift = shift), gamma, deriv.arg, shift))
if (any(special <- Re(gamma) <= 1)) {
ans <- gamma
ans[special] <- Inf
special3 <- Re(gamma) < 1
ans[special3] <- NA
special4 <- (0 < Re(gamma)) & (Re(gamma) < 1) & (Im(gamma) == 0)
# ans[special4] <- Zeta.derivative(gamma[special4], deriv.arg = deriv.arg, shift = shift)
special2 <- Re(gamma) < 0
if (any(special2)) {
gamma2 <- gamma[special2]
cgamma <- 1 - gamma2
ans[special2] <- 2^(gamma2) * pi^(gamma2 - 1) * sin(pi *
gamma2/2) * gamma(cgamma) * Recall(cgamma)
}
if (any(!special)) {
ans[!special] <- Recall(gamma[!special])
}
return(zeta.specials(ans, gamma, deriv.arg, shift))
}
aa <- 12
ans <- 0
for (ii in 0:(aa - 1)) ans <- ans + 1/(shift + ii)^gamma
ans <- ans + Zeta.aux(shape = gamma, aa, shift = shift)
ans[shift <= 0] <- NaN
zeta.specials(ans, gamma, deriv.arg = deriv.arg, shift = shift)
}
## example:
gen_zeta(2.1, 4)
##------------------------------------------------------------------------------
## The P_k (the CDF)
P_k = function(gamma, k, k_sat){
### fill the function
return(1 - ( gen_zeta(gamma, k) / ... ))
}
##------------------------------------------------------------------------------
my_theme <- theme_classic() +
theme(legend.position = "bottom", legend.box = "horizontal", legend.direction = "horizontal",
title = element_text(size = 18), axis.title = element_text(size = 14),
axis.text.y = element_text(size = 16), axis.text.x = element_text(size = 16),
strip.text = element_text(size = 14), strip.background = element_blank(),
strip.text.x = element_text(size = 14), strip.text.y = element_text(size = 14),
legend.title = element_text(size = 14), legend.text = element_text(size = 14))
set.seed(123)
#=========================================================================
=====================
# 00. LOADING DATA
========================================================================
#=========================================================================
=====================
## Load data - fill the path of the folder where you put the file. If the file is in the same folder as the
R code remove the path part.
your_path = "your path"
pat_citation_deg = read.csv(paste0(your_path, "lab 4 data - sample_patant_citation_deg.csv"))
head(pat_citation_deg)
tail(pat_citation_deg)
summary(pat_citation_deg)
## let's have a look at the Log-log degree distribution plot (nothing to fill)
p <- ggplot(pat_citation_deg, aes(x = degree)) +
geom_point(stat = 'bin', color = "blue", size = 2.5, bins = 3 * ceiling(log(nrow(pat_citation_deg))))+
scale_x_continuous(trans = "log", breaks = scales::trans_breaks("log", function(x) exp(x)),
labels = scales::trans_format("log", scales::math_format(e^.x))) +
scale_y_continuous(trans = "log", breaks = scales::trans_breaks("log", function(x) exp(x)),
labels = scales::trans_format("log", scales::math_format(e^.x))) +
labs(x = "Degree", y = "Frequency") +
my_theme + theme(title = element_text(size = 12 ))
## fit a line to the Log-log degree distribution (nothing to fill)
p + geom_smooth(data = ggplot_build(p)$data[[1]] %>% filter(!is.infinite(y)), ## this will take the
binned data generated by ggplot to fit the line
mapping = aes(x=exp(x), y= exp(y)), method = "lm", se=FALSE, color = "red", size = 0.75, alpha =
0.5)
#=========================================================================
=====================
# 1. EXERCISE PART 1 - Estimating Gamma
===================================================
#=========================================================================
=====================
## designate the data.frame to be used - and use standardized column names: id, degree (nothing
to fill)
my_df = pat_citation_deg %>% rename(id = patent_id)
##------------------------------------------------------------------------------
## you'll write a for loop over individual unique degrees in the data-set to find the corresponding
distance D
### let's create a data.frame with one column as each observed degree in our network;
### We will fill the other columns D and gamma as we calculate them in the loop (nothing to fill)
D_df = data.frame(k_sat = unique(my_df$degree), D= NA, gamma = NA) %>% arrange(k_sat)
## here you set up the maximum degree to check so that you do not have to do the computation
for all degrees
### Let's set it up as 25th percentile of the unique observed degrees. Next line of code does that
for you: (nothing to fill)
max_degree_to_check = as.numeric(ceiling(quantile(unique(my_df$degree), 0.25)))
### Now discard the rows of D_df you do not need (that are above the max_degree_to_check).
Next line of code does it for you: (nothing to fill)
D_df = D_df[D_df$k_sat < max_degree_to_check,]
### Let's take a look at the distance data.frame we are about to fill in the loop: (nothing to fill)
head(D_df)
tail(D_df)
## Understand and fill parts of the code in this loop
## I recommend setting i = 1 and running each line of this loop on your own and checking what it
gives you. This will help you fill the gaps
##------------------------------------------------------------------------------
## let's work on the loop
for (i in 1:(nrow(D_df))) { ## note: the loop starts slower but will speed up as it progresses.
## let's show the current loop k_sat (so that we can see our progress): (nothing to fill)
print(paste0("at %", round(100 * i/nrow(D_df), 2)))
k_sat_temp = D_df$k_sat[i]
##----------------------------------------------------------------------------
## let's create a temporary copy of the network degree data that contains degrees equal or above
k_sat_temp: (nothing to fill)
temp_df = my_df[my_df$degree>k_sat_temp,]
##----------------------------------------------------------------------------
## step 1: estimate gamma for this loop and call it 'temp_gamma'
### create a vector of k_i/ (k_sat) so that you can feed it to natural logarithm and sum over
elements
### k_i is each observed node degree in your data; k_sat refers to the k_sat of this loop (nothing
to fill)
temp_vec_k_i = temp_df$degree/(k_sat_temp)
### now use the above vector in (4.41); remember N is the number of nodes in your network. N =
nrow(my_df) (nothing to fill)
temp_gamma = 1 + (nrow(temp_df) / sum(log(temp_vec_k_i)) - 1/2) ## (4.41)
##----------------------------------------------------------------------------
## step 2: now use (temp_gamma, k_sat_temp) to write (4.43) as a function called 'CDF_k' to
pass the KS test in step 3:
### k will be a variable that KS test will use, so make it an argument of CDF_k;
### put gamma and k_sat of this loop in the body of the function
CDF_k = function(k) {
### FILL THIS FUNCTION ACCORDING TO (4.43)
return(1 - (gen_zeta(temp_gamma, k) / ...))
}
##----------------------------------------------------------------------------
## step 3: run KS test and pass the statistic as D to the corresponding column of D_df
KS_D = ks.test(temp_df$degree, ...) ### fill in with function you created above: just pass the
function name (without parantheses, or brackets, or quotes)
### * you can take a look here if you couldn't figure it out:
https://stats.stackexchange.com/questions/47730/r-defining-a-new-continuous-distribution-
function-to-use-with-kolmogorov-smirno
D_df[i,'D'] = as.numeric(KS_D$statistic) ### (nothing to fill)
## let's also store the gamma so that we do not have to compute it again once we have an optimal
k_sat (nothing to fill)
D_df[i,'gamma'] = temp_gamma
}
##------------------------------------------------------------------------------
## step 4: plot D against k_sat find the k_sat that minimizes D, and the corresponding gamma
### let's first take a look at the D_df we have formed (nothing to fill)
head(D_df, 10)
### find the optimal k_sat that yields minimum D and call it 'optimal_k_sat' (nothing to fill)
optimal_k_sat = D_df[which.min(D_df$D),'k_sat']
### let's take a look at the D_df we have formed (nothing to fill)
ggplot(D_df %>% drop_na(), aes(x = k_sat, y = D)) +
geom_point(size = 3, alpha = .5, color = "purple") +
geom_vline(xintercept = optimal_k_sat, size = 1, color = "red") +
ggplot2::annotate("text", x = optimal_k_sat + 15, y = as.numeric(quantile(D_df$D,.85)), label =
paste0("Optimal k_sat = ",optimal_k_sat), color = "red") +
my_theme + labs(x = "k", y = "D")
### find the D corresponding to 'optimal_k_sat' (nothing to fill)
min(D_df$D)
### find the gamma corresponding to the optimal k_sat and call it 'optimal_gamma' (nothing to fill)
(optimal_gamma = D_df[which.min(D_df$D),'gamma'])
## Discard observations with degree below the best k_sat you found earlier. (nothing to fill)
post_data = my_df %>% filter(degree >= optimal_k_sat)
##------------------------------------------------------------------------------
## let's take a look at the resulting Log-log degree distribution plot for the remaining data-points
(nothing to fill)
p_post <- ggplot(post_data, aes(x = degree)) +
geom_point(stat = 'bin', color = "blue", size = 2.5, bins = 3 * ceiling(log(nrow(post_data))))+
scale_x_continuous(trans = "log", breaks = scales::trans_breaks("log", function(x) exp(x)),
labels = scales::trans_format("log", scales::math_format(e^.x))) +
scale_y_continuous(trans = "log", breaks = scales::trans_breaks("log", function(x) exp(x)),
labels = scales::trans_format("log", scales::math_format(e^.x))) +
labs(x = "Degree", y = "Frequency") +
my_theme
## fit a line to the Log-log degree distribution (nothing to fill)
p_post + geom_smooth(data = ggplot_build(p_post)$data[[1]] %>% filter(!is.infinite(y)), ## this will
take the binned data generated by ggplot to fit the line
mapping = aes(x=exp(x), y= exp(y)), method = "lm", se=FALSE, color = "red", size = 0.75, alpha =
0.5)
#=========================================================================
=====================
# 2. EXERCISE PART 2 - Goodness-of-fit
====================================================
#=========================================================================
=====================
## We are going to create a vector of synthetic sequences of degrees and repeat the process M
times
## Usually, M is pretty big, like M = 10,000. For now, let's set M = 100: (nothing to fill)
M = 100
## so let's create a data.frame of M rows, one for every D_synthetic we will generate (nothing to
fill)
D_gof_df = data.frame(iter = 1:M, D_synthetic = NA)
##------------------------------------------------------------------------------
## step 1: store the distance you found in part 1 as D_real (nothing to fill)
D_real = min(D_df$D)
##------------------------------------------------------------------------------
## I. Let's walk through steps 2 and 3 once outside of the loop
##------------------------------------------------------------------------------
##------------------------------------------------------------------------------
## step 2: you will need to define the inverse of the CDF function (so that you generate random
probability values [0,1] and get degrees back)
### let's write the CDF that best fits the data (we did this in part 1):
CDF_k = function(k) {
### FILL THE FUNCTION ACCORDING TO (4.43)
return(1 - (gen_zeta(optimal_gamma, k) / ...))
}
### 2.1. Let's define the inverse of your CDF; (nothing to fill)
### if the next line is hard to understand, check here:
https://stackoverflow.com/questions/23258482/use-inverse-cdf-to-generate-random-variable-in-r
#### mini step 2.1.1. this piece of code will create an inverse for you (that searches the interval 0
up to a big higher than the highest observed degree in our data)
Inverse = function(f, interval = c(sqrt(min(post_data$degree)), max(post_data$degree))){
function(y) uniroot((function(x) f(x) - y), interval = interval, extendInt = "yes")$root
}
inverse_CDF = Inverse(CDF_k)
### mini step 2.2.1 let's try it for a couple of numbers (nothing to fill)
inverse_CDF(0.4)
## mini step 2.2.2. runif(1) will generate a random real number between 0 and 1. Let's pass that to
inverse_CDF
rand_p = runif(1)
inverse_CDF(rand_p)
### mini step 2.2. let's generate 5 random numbers betwee 0 and 1 and get 5 degrees back from
our inverse
#### (unfortunately we have to write this complex code because inverse_CDF does not accept a
vector; try inverse_CDF(c(0.1, 0.2)). )
rand_p = runif(5)
unlist(lapply(rand_p, function(p){inverse_CDF(p)}))
### step 2.2.ok, you know how to generate 5 random degrees. Let's create n random degrees,
where n is the number of degrees in our remaining data (nothing to fill)
rand_p = runif(nrow(post_data))
rand_deg = unlist(lapply(rand_p, function(p){inverse_CDF(p)})) ## unfortunately this is a bit slow!!!
rand_deg = unlist(mclapply(rand_p, function(p){inverse_CDF(p)}, mc.cores =
parallel::detectCores() - 1)) ## this will make it a bit faster, but still not great!
##------------------------------------------------------------------------------
## step 3: use ks.test to get a D_synthetic for the rand_deg you just generated and
### FILL THE ks.test ACCORDING TO THE DISCUSSION ON STEP 3 PART 2
KS_D = ks.test(rand_deg, ...) ## ks.test(first are synthetic degrees, second are real degrees)
as.numeric(KS_D$statistic)
##------------------------------------------------------------------------------
## II. Now let's write the loop
##------------------------------------------------------------------------------
for (i in 1:M){ ## this may take a while!!! It took me 15 minutes to run M = 100 on a computer with
3.5 GHz 6-Core and 64 GB memory. Lower to M = 20 if necessary
print(paste0("at %",100 * i/M))
##------------------------------------------------------------------------------
## step 2: generate a synthetic (random) sequence of degrees
rand_p = ... ### FILL AS WE DID ABOVE
rand_deg = unlist(mclapply(rand_p, function(p){inverse_CDF(p)}, mc.cores =
parallel::detectCores() - 1)) ## this will make it a bit faster
##------------------------------------------------------------------------------
## step 3: find the distance between the synthetic sequence and CDF_k and store it
# KS_D = ks.test(rand_deg, CDF_k)
KS_D = ks.test(..., ...) ### FILL AS WE DID ABOVE ks.test(first are synthetic degrees, second
are real degrees)
D_gof_df[i,'D_synthetic'] = as.numeric(KS_D$statistic)
}
##------------------------------------------------------------------------------
## Let's plot the results
### let's take a look at the D_df we have formed
ggplot(D_gof_df, aes(x = D_synthetic)) +
geom_histogram(bins = 20, color = "white", fill = "green", alpha = 0.9) +
geom_vline(xintercept = D_real, size = 1, color = "brown") +
my_theme + labs(x = "Distance", y = "Frequency", title = "Synthetic and Real Distances")
# IDS 564 - Spring 2023
# Lab 4 R Code - Estimating the Degree Exponent of a Scale-free Network
#=========================================================================
=====================
# 0. INITIATION
==========================================================================
=
#=========================================================================
=====================
## You'll need VGAM for the zeta function
# install.packages("VGAM") ## When prompted to install from binary version, select no
library(VGAM)
## You'll need this when calculating goodness of fit
# install.packages("parallel")
library(parallel)
library(ggplot2)
library(ggthemes)
library(dplyr)
library(tidyr)
##------------------------------------------------------------------------------
## This function will calculate the zeta function for you. You don't need to worry about it! Run it
and continue.
## gen_zeta(gamma , shift) will give you a number
gen_zeta <- function (gamma, shift = 1, deriv = 0)
{
deriv.arg <- deriv
rm(deriv)
if (!is.Numeric(deriv.arg, length.arg = 1, integer.valued = TRUE))
stop("'deriv' must be a single non-negative integer")
if (deriv.arg < 0 || deriv.arg > 2)
stop("'deriv' must be 0, 1, or 2")
if (deriv.arg > 0)
return(zeta.specials(Zeta.derivative(gamma, deriv.arg = deriv.arg,
shift = shift), gamma, deriv.arg, shift))
if (any(special <- Re(gamma) <= 1)) {
ans <- gamma
ans[special] <- Inf
special3 <- Re(gamma) < 1
ans[special3] <- NA
special4 <- (0 < Re(gamma)) & (Re(gamma) < 1) & (Im(gamma) == 0)
# ans[special4] <- Zeta.derivative(gamma[special4], deriv.arg = deriv.arg, shift = shift)
special2 <- Re(gamma) < 0
if (any(special2)) {
gamma2 <- gamma[special2]
cgamma <- 1 - gamma2
ans[special2] <- 2^(gamma2) * pi^(gamma2 - 1) * sin(pi *
gamma2/2) * gamma(cgamma) * Recall(cgamma)
}
if (any(!special)) {
ans[!special] <- Recall(gamma[!special])
}
return(zeta.specials(ans, gamma, deriv.arg, shift))
}
aa <- 12
ans <- 0
for (ii in 0:(aa - 1)) ans <- ans + 1/(shift + ii)^gamma
ans <- ans + Zeta.aux(shape = gamma, aa, shift = shift)
ans[shift <= 0] <- NaN
zeta.specials(ans, gamma, deriv.arg = deriv.arg, shift = shift)
}
## example:
gen_zeta(2.1, 4)
##------------------------------------------------------------------------------
## The P_k (the CDF)
P_k = function(gamma, k, k_sat){
### fill the function
return(1 - ( gen_zeta(gamma, k) / ... ))
}
##------------------------------------------------------------------------------
my_theme <- theme_classic() +
theme(legend.position = "bottom", legend.box = "horizontal", legend.direction = "horizontal",
title = element_text(size = 18), axis.title = element_text(size = 14),
axis.text.y = element_text(size = 16), axis.text.x = element_text(size = 16),
strip.text = element_text(size = 14), strip.background = element_blank(),
strip.text.x = element_text(size = 14), strip.text.y = element_text(size = 14),
legend.title = element_text(size = 14), legend.text = element_text(size = 14))
set.seed(123)
#=========================================================================
=====================
# 00. LOADING DATA
========================================================================
#=========================================================================
=====================
## Load data - fill the path of the folder where you put the file. If the file is in the same folder as the
R code remove the path part.
your_path = "your path"
pat_citation_deg = read.csv(paste0(your_path, "lab 4 data - sample_patant_citation_deg.csv"))
head(pat_citation_deg)
tail(pat_citation_deg)
summary(pat_citation_deg)
## let's have a look at the Log-log degree distribution plot (nothing to fill)
p <- ggplot(pat_citation_deg, aes(x = degree)) +
geom_point(stat = 'bin', color = "blue", size = 2.5, bins = 3 * ceiling(log(nrow(pat_citation_deg))))+
scale_x_continuous(trans = "log", breaks = scales::trans_breaks("log", function(x) exp(x)),
labels = scales::trans_format("log", scales::math_format(e^.x))) +
scale_y_continuous(trans = "log", breaks = scales::trans_breaks("log", function(x) exp(x)),
labels = scales::trans_format("log", scales::math_format(e^.x))) +
labs(x = "Degree", y = "Frequency") +
my_theme + theme(title = element_text(size = 12 ))
## fit a line to the Log-log degree distribution (nothing to fill)
p + geom_smooth(data = ggplot_build(p)$data[[1]] %>% filter(!is.infinite(y)), ## this will take the
binned data generated by ggplot to fit the line
mapping = aes(x=exp(x), y= exp(y)), method = "lm", se=FALSE, color = "red", size = 0.75, alpha =
0.5)
#=========================================================================
=====================
# 1. EXERCISE PART 1 - Estimating Gamma
===================================================
#=========================================================================
=====================
## designate the data.frame to be used - and use standardized column names: id, degree (nothing
to fill)
my_df = pat_citation_deg %>% rename(id = patent_id)
##------------------------------------------------------------------------------
## you'll write a for loop over individual unique degrees in the data-set to find the corresponding
distance D
### let's create a data.frame with one column as each observed degree in our network;
### We will fill the other columns D and gamma as we calculate them in the loop (nothing to fill)
D_df = data.frame(k_sat = unique(my_df$degree), D= NA, gamma = NA) %>% arrange(k_sat)
## here you set up the maximum degree to check so that you do not have to do the computation
for all degrees
### Let's set it up as 25th percentile of the unique observed degrees. Next line of code does that
for you: (nothing to fill)
max_degree_to_check = as.numeric(ceiling(quantile(unique(my_df$degree), 0.25)))
### Now discard the rows of D_df you do not need (that are above the max_degree_to_check).
Next line of code does it for you: (nothing to fill)
D_df = D_df[D_df$k_sat < max_degree_to_check,]
### Let's take a look at the distance data.frame we are about to fill in the loop: (nothing to fill)
head(D_df)
tail(D_df)
## Understand and fill parts of the code in this loop
## I recommend setting i = 1 and running each line of this loop on your own and checking what it
gives you. This will help you fill the gaps
##------------------------------------------------------------------------------
## let's work on the loop
for (i in 1:(nrow(D_df))) { ## note: the loop starts slower but will speed up as it progresses.
## let's show the current loop k_sat (so that we can see our progress): (nothing to fill)
print(paste0("at %", round(100 * i/nrow(D_df), 2)))
k_sat_temp = D_df$k_sat[i]
##----------------------------------------------------------------------------
## let's create a temporary copy of the network degree data that contains degrees equal or above
k_sat_temp: (nothing to fill)
temp_df = my_df[my_df$degree>k_sat_temp,]
##----------------------------------------------------------------------------
## step 1: estimate gamma for this loop and call it 'temp_gamma'
### create a vector of k_i/ (k_sat) so that you can feed it to natural logarithm and sum over
elements
### k_i is each observed node degree in your data; k_sat refers to the k_sat of this loop (nothing
to fill)
temp_vec_k_i = temp_df$degree/(k_sat_temp)
### now use the above vector in (4.41); remember N is the number of nodes in your network. N =
nrow(my_df) (nothing to fill)
temp_gamma = 1 + (nrow(temp_df) / sum(log(temp_vec_k_i)) - 1/2) ## (4.41)
##----------------------------------------------------------------------------
## step 2: now use (temp_gamma, k_sat_temp) to write (4.43) as a function called 'CDF_k' to
pass the KS test in step 3:
### k will be a variable that KS test will use, so make it an argument of CDF_k;
### put gamma and k_sat of this loop in the body of the function
CDF_k = function(k) {
### FILL THIS FUNCTION ACCORDING TO (4.43)
return(1 - (gen_zeta(temp_gamma, k) / ...))
}
##----------------------------------------------------------------------------
## step 3: run KS test and pass the statistic as D to the corresponding column of D_df
KS_D = ks.test(temp_df$degree, ...) ### fill in with function you created above: just pass the
function name (without parantheses, or brackets, or quotes)
### * you can take a look here if you couldn't figure it out:
https://stats.stackexchange.com/questions/47730/r-defining-a-new-continuous-distribution-
function-to-use-with-kolmogorov-smirno
D_df[i,'D'] = as.numeric(KS_D$statistic) ### (nothing to fill)
## let's also store the gamma so that we do not have to compute it again once we have an optimal
k_sat (nothing to fill)
D_df[i,'gamma'] = temp_gamma
}
##------------------------------------------------------------------------------
## step 4: plot D against k_sat find the k_sat that minimizes D, and the corresponding gamma
### let's first take a look at the D_df we have formed (nothing to fill)
head(D_df, 10)
### find the optimal k_sat that yields minimum D and call it 'optimal_k_sat' (nothing to fill)
optimal_k_sat = D_df[which.min(D_df$D),'k_sat']
### let's take a look at the D_df we have formed (nothing to fill)
ggplot(D_df %>% drop_na(), aes(x = k_sat, y = D)) +
geom_point(size = 3, alpha = .5, color = "purple") +
geom_vline(xintercept = optimal_k_sat, size = 1, color = "red") +
ggplot2::annotate("text", x = optimal_k_sat + 15, y = as.numeric(quantile(D_df$D,.85)), label =
paste0("Optimal k_sat = ",optimal_k_sat), color = "red") +
my_theme + labs(x = "k", y = "D")
### find the D corresponding to 'optimal_k_sat' (nothing to fill)
min(D_df$D)
### find the gamma corresponding to the optimal k_sat and call it 'optimal_gamma' (nothing to fill)
(optimal_gamma = D_df[which.min(D_df$D),'gamma'])
## Discard observations with degree below the best k_sat you found earlier. (nothing to fill)
post_data = my_df %>% filter(degree >= optimal_k_sat)
##------------------------------------------------------------------------------
## let's take a look at the resulting Log-log degree distribution plot for the remaining data-points
(nothing to fill)
p_post <- ggplot(post_data, aes(x = degree)) +
geom_point(stat = 'bin', color = "blue", size = 2.5, bins = 3 * ceiling(log(nrow(post_data))))+
scale_x_continuous(trans = "log", breaks = scales::trans_breaks("log", function(x) exp(x)),
labels = scales::trans_format("log", scales::math_format(e^.x))) +
scale_y_continuous(trans = "log", breaks = scales::trans_breaks("log", function(x) exp(x)),
labels = scales::trans_format("log", scales::math_format(e^.x))) +
labs(x = "Degree", y = "Frequency") +
my_theme
## fit a line to the Log-log degree distribution (nothing to fill)
p_post + geom_smooth(data = ggplot_build(p_post)$data[[1]] %>% filter(!is.infinite(y)), ## this will
take the binned data generated by ggplot to fit the line
mapping = aes(x=exp(x), y= exp(y)), method = "lm", se=FALSE, color = "red", size = 0.75, alpha =
0.5)
#=========================================================================
=====================
# 2. EXERCISE PART 2 - Goodness-of-fit
====================================================
#=========================================================================
=====================
## We are going to create a vector of synthetic sequences of degrees and repeat the process M
times
## Usually, M is pretty big, like M = 10,000. For now, let's set M = 100: (nothing to fill)
M = 100
## so let's create a data.frame of M rows, one for every D_synthetic we will generate (nothing to
fill)
D_gof_df = data.frame(iter = 1:M, D_synthetic = NA)
##------------------------------------------------------------------------------
## step 1: store the distance you found in part 1 as D_real (nothing to fill)
D_real = min(D_df$D)
##------------------------------------------------------------------------------
## I. Let's walk through steps 2 and 3 once outside of the loop
##------------------------------------------------------------------------------
##------------------------------------------------------------------------------
## step 2: you will need to define the inverse of the CDF function (so that you generate random
probability values [0,1] and get degrees back)
### let's write the CDF that best fits the data (we did this in part 1):
CDF_k = function(k) {
### FILL THE FUNCTION ACCORDING TO (4.43)
return(1 - (gen_zeta(optimal_gamma, k) / ...))
}
### 2.1. Let's define the inverse of your CDF; (nothing to fill)
### if the next line is hard to understand, check here:
https://stackoverflow.com/questions/23258482/use-inverse-cdf-to-generate-random-variable-in-r
#### mini step 2.1.1. this piece of code will create an inverse for you (that searches the interval 0
up to a big higher than the highest observed degree in our data)
Inverse = function(f, interval = c(sqrt(min(post_data$degree)), max(post_data$degree))){
function(y) uniroot((function(x) f(x) - y), interval = interval, extendInt = "yes")$root
}
inverse_CDF = Inverse(CDF_k)
### mini step 2.2.1 let's try it for a couple of numbers (nothing to fill)
inverse_CDF(0.4)
## mini step 2.2.2. runif(1) will generate a random real number between 0 and 1. Let's pass that to
inverse_CDF
rand_p = runif(1)
inverse_CDF(rand_p)
### mini step 2.2. let's generate 5 random numbers betwee 0 and 1 and get 5 degrees back from
our inverse
#### (unfortunately we have to write this complex code because inverse_CDF does not accept a
vector; try inverse_CDF(c(0.1, 0.2)). )
rand_p = runif(5)
unlist(lapply(rand_p, function(p){inverse_CDF(p)}))
### step 2.2.ok, you know how to generate 5 random degrees. Let's create n random degrees,
where n is the number of degrees in our remaining data (nothing to fill)
rand_p = runif(nrow(post_data))
rand_deg = unlist(lapply(rand_p, function(p){inverse_CDF(p)})) ## unfortunately this is a bit slow!!!
rand_deg = unlist(mclapply(rand_p, function(p){inverse_CDF(p)}, mc.cores =
parallel::detectCores() - 1)) ## this will make it a bit faster, but still not great!
##------------------------------------------------------------------------------
## step 3: use ks.test to get a D_synthetic for the rand_deg you just generated and
### FILL THE ks.test ACCORDING TO THE DISCUSSION ON STEP 3 PART 2
KS_D = ks.test(rand_deg, ...) ## ks.test(first are synthetic degrees, second are real degrees)
as.numeric(KS_D$statistic)
##------------------------------------------------------------------------------
## II. Now let's write the loop
##------------------------------------------------------------------------------
for (i in 1:M){ ## this may take a while!!! It took me 15 minutes to run M = 100 on a computer with
3.5 GHz 6-Core and 64 GB memory. Lower to M = 20 if necessary
print(paste0("at %",100 * i/M))
##------------------------------------------------------------------------------
## step 2: generate a synthetic (random) sequence of degrees
rand_p = ... ### FILL AS WE DID ABOVE
rand_deg = unlist(mclapply(rand_p, function(p){inverse_CDF(p)}, mc.cores =
parallel::detectCores() - 1)) ## this will make it a bit faster
##------------------------------------------------------------------------------
## step 3: find the distance between the synthetic sequence and CDF_k and store it
# KS_D = ks.test(rand_deg, CDF_k)
KS_D = ks.test(..., ...) ### FILL AS WE DID ABOVE ks.test(first are synthetic degrees, second are
real degrees)
D_gof_df[i,'D_synthetic'] = as.numeric(KS_D$statistic)
}
##------------------------------------------------------------------------------
## Let's plot the results
### let's take a look at the D_df we have formed
ggplot(D_gof_df, aes(x = D_synthetic)) +
geom_histogram(bins = 20, color = "white", fill = "green", alpha = 0.9) +
geom_vline(xintercept = D_real, size = 1, color = "brown") +
my_theme + labs(x = "Distance", y = "Frequency", title = "Synthetic and Real Distances")
Please read this instruction carefully before starting this exercise. To complete this lab, you have a
short, required reading included in this document that explains the challenges in finding the degree
exponent () of a power law for an observed network. The fitting procedure corrects such problems;
reading it will require you to follow and understand the exercise. You will fit a power law into a
network provided for this lab. The estimation process (starting on page 3 ) contains two parts. The
first part estimates the degree exponent (), and the second part offers the goodness of fit using a
simulation process. An attached R code handles the computational heavy lifting. However, filling in
and comapleting the code requires your understanding of the process. I annotated the code in
detail. The idea is to facilitate learning rather than implement the necessary steps. You still need to
run every line of code. However, the lines you do not need to change show (nothing to fill) in their
annotation. The data you will handle is a random sample of 80,000 US patents taken from a larger
corpus of about 8 million patents. You see a patent id number (first column) and a degree
representing the number of times the patent was cited by other patents (second column). So, the
data is the degree sequence of a network, summarized to help you handle the lab. The R code
provides you with a log-log plot to get a sense of the data. Here are the first ten* Patents without
any citations do not appear in the data. Unlike your previous labs, lab 4 does not require you to
complete a quiz. All you need to submit is your R code, and a single-page document containing
two figures (similar to Figure 4.24 parts b and c ), report two values ( and k sat ) and discuss how
well your power law fits the network data, and an explanation of your finding. Note that the second
part of the exercise will need a bit of time to run. Accommodate it by starting your lab 4 early.
Required Reading: Degree Distribution of Real Networks In real systems, we rarely observe a
degree distribution that follows a pure power law. Instead, for most real systems pk has the shape
shown in Figure 4.23a, with two recurring - Low-degree saturation is a common deviation from the
power-law behavior. Its problems: signature is a flattened Px for k<k sot. This indicates that we
have fewer small degree nodes than expected for a pure power law. The origin of the saturation
will be - High-degree cutoff appears as a rapid drop in pk for k>kcut,, indicating that we have
explained in Chapter 6. fewer high-degree nodes than expected in a pure power law. This limits
the size of the largest hub, making it smaller than predicted by what we derived in the class (kmax
=kminNr11). High-degree cutoffs emerge if there are inherent limitations in the number of links a
node can have. For example, in social networks individuals have difficulty maintaining meaningful
relationships with an exceptionally large number of acquaintances, or like the case of Facebook's
5klimitonfriends.Barabasi Figure 4.23. Rescaling the Degree Distribution a. In real networks the
degree distribution frequently deviates from a pure power law by showing a low degree b. By
plotting the rescaled pk in function of (k+ksat), as suggested by (4.40), the degree distribution
follows a saturation and high degree cutoff. power law for all degrees. Given the widespread
presence of such cutoffs the degree distribution is often fitted to: (4.39) Px=a(k+ksat)exp(kcutk)
where ksat accounts for degree saturation, and the exponential term accounts for high-k cutoff.
How do we deal with these two problems? To extract the full extent of the scaling we correct the
plot by multiplying a term as follows: Px=Pxexp(kcutk) If you look at the resulting function, Px=a(k+
ksat)r follows the power law form we would like to see as a function of k=k+ksot:PRkr, correcting
for the two cutoffs, as seen in Figure 4.23b. This is the idea. We will go through this correction in
the exercise. It is occasionally claimed that the presence of low-degree or high-degree cutoffs
implies that the network is not scale-free. This is a misunderstanding of the scale-free property:
Virtually all properties of scale-free networks are insensitive to the low-degree saturation. Only the
high-degree cutoff affects the system's properties by limiting therdivergence of the second
moment (that determines the variance). The presence of such cutoffs indicates the presence of
additional phenomena that need to be understood - that will not be discussed here. Exercise:
Estimating the Degree Exponent As the properties of scale-free networks depend on the degree
exponent, we need to determine the value of . We face several difficulties, however, when we try
to fit a power law into real data. The most important is the fact that the scaling is rarely valid for the
full range of the degree distribution. Rather we observe small- and high-degree cutoffs described
above, as ksot and kcut, within which we have a clear scaling region. Here we focus on estimating
the small degree cutoff ksat as the high degree cutoff can be determined in a similar fashion. The
reader is advised to consult the discussion on systematic problems provided at the end of this
section - as optional reading. Part 1: Fitting Procedure: As the degree distribution is typically
provided as a list of positive integers kmin,,kmax, we aim to estimate from a discrete set of data
points (only those observed in the data). Here you will see figures for an article citation network to
illustrate the procedure. You will implement the procedure on the patent data explained above.
The article citation network illustrated here consists of N=384,362 nodes, each node representing
a research paper published between 1890 and 2009 in journals pubblished by the American
Physical Society. The network has L=2,353,984 links, each representing a citation from a
published research paper to some other publication in the dataset (outside citations are ignored).
The steps of the fitting process are:1. Choose a value of ksot between kmin and kmax. Estimate
the value of the degree exponent corresponding to this k sot using: * Notice two things. First, in
practice, you do not need to check all degrees between kmin and kmax Indeed, you can check
degrees up to a certain quantile. In the code, you loop over kmin up to the degree at the 25th
percentile - this is already set up for you. 3 Second, in (4.41), the summation starts from i=1. This
applies to the data that has a degree at least equal to ksot for that loop. For instance, if you are
checking ksat=4, all data corresponding to values with a degree below 4 will be discarded and do
not enter the calculation of in (4.41) - this is also set up for you in R. 2. With the obtained (,ksat)
parameter pair, assume that the degree distribution has the form: Pk=k=0(k+ksat)1k First, note
that the kk in the denominator is called a zeta function. When running the exercise, a function I
have provided will calculate zeta for you, so you do not have to worry about it. Just learn the
notation so that you can pass the arguments: zeta(,a)=x=0(x+a) is the main parameter, and a is
the shift. You will pass these two parameters in the is the main parameter, and a is the shift. You
will pass these two parameters in the same order to the function provided in R. Second, notice that
Pk in (4.42) is like a power law we have seen in the class, if you take the C=zeta(,ksat)1. This is,
in fact, a constant (given parameters and ksat ) and is the discrete form of C (in slide 15 of session
9 we derived C for a continuous closed form; the discrete form is easier to work with here since
you have discrete data from your network). You do not need to implement Pk. Next is what you
need to write in R. With the Pk(4.42) as the probability density function, the cumulative distribution
function (CDF) is: Pk=1zeta(r,ksa)zeta(,k) You will write this in R using the zeta function). Your Pk
implementation in R should be something like this that takes k as input and has and kiat inside: C
DF(k)=1zeta(Y,k)/zeta(Y,k_sat) 3. Use the Kormogorov-Smirnov (KS) test to determine the
distance D between your network data, let's call it S(k), and the fitted model provided by (4.43)
with the selected (,ksat) parameter pairs. In R, ks.test(..)$statistic will give you the D. To
implement this in R, you can pass the degree distribution of your network, which we call S(k), as
the first parameter to the function ks.test. The second parameter should be the function Pk, you
just defined. The implementation of your KS test in R for the setup described will look like this:
ks.test (S(k),CDF(k)) * Both CDF(k) and ks.test written above are essentially pseudocode. Change
them as necessary in your R code.4. Repeat steps (13) by scanning the kwot range from kmin to
the degree at 25th percentile. We aim to identify the k sat value for which D provided by the test is
minimal and call it 'optimal ksot'. To illustrate the procedure, we plot D as a function of k
satforthepaper citation network (Figure 4.24b ). The plot indicates that D is minimal for ksat=49,
and the corresponding estimated by (4.41), representing the optimal fit, is =2.79. What you report
for part 1 on your network: - Plotting D against ksat for (similar to Figure 4.24 b). - The values of
ksot that minimizes D, the resulting D, and the corresponding . a. The degree distribution Pk of the
citation network, where the straight purple line represents the best fit based Barabasi Figure 4.24
Maximum Likelihood Estimation. b. The values of Kormogorov-Smimov test vs. kut for the citation
network. c. p ( D mantheric), for M=10,000 synthetic datasets, where the grey line corresponds to
the Dm mal value extracted for the on the model (4.39). citation network. Just because we
obtained a (,k sot ) pair that represents an optimal fit to our dataset, does not Part 2: Goodness of
Fit mean that the power law itself is a good model for the studied distribution. Therefore, we need
to use a goodness-of-fit test, which generates a p-value that quantifies the plausibility of the power
law hypothesis. The most often-used procedure consists of the following steps: 1. Use the
distance (KS statistic for the best ksat) you found in part 1 . Call it D real. For1. Use the distance (
KS statistic for the best ksot ) you found in part 1. Call it D Dreal. For instance, the selected ksot=
49 and the distance Dreal=0.01158 for the citation network. This is the distance Dreal is between
the observed data and our fit based on the parameter pair (,ksot). 2. Use (4.42) to generate a
degree sequence of N degrees (i.e. the same number of random numbers as the number of nodes
in the original dataset). This is synthetic data we have generated as a hypothetical degree
sequence. 3. Now, calculate the distance between the synthetic data and your actual data using
the Kormogorov-Smirnov test. Call the new distance, D Dynthetic, Hence, D pnthetic represents
the distance between a synthetically generated degree sequence, consistent with our degree
distribution, and the real data. The goal is to see if the obtained D'mentetic is comparable to Dreal
. For this, we repeat step (2) and (3)M times (say M=100 ). Each time, we generate a new degree
sequence and determine the corresponding D. nhthect. Eventually, we obtain the p(Dannthetic))
distribution (i.e., the histogram of all 100 Dsynthetic values you generated by repeating step 2. Plot
p(D ( r interic) and show as a vertical bar Dreal (Figure 4.24c). If Dreal is within the p(Dsynthetce)
distribution, it means that the distance between the model providing the best fit and the empirical
data is comparable with the distance expected from random degree samples chosen from the best
fit distribution. Hence the power law is a reasonable model for the data. If, however, Dreal falls
outside the p(Dynthetic) distribution, then the power law is not a good model - some other function
is expected to describe the original Pk better. While the distribution shown in Figure 4.24c may be,
in some cases, useful to illustrate theWhile the distribution shown in Figure 4.24c may be, in some
cases, useful to illustrate the statistical significance of the fit, in general it is better to assign a p-
number to the fit, by: p=DP(Dsynthetic)dDsynthetic The closer p is to 1 , the more likely that the
difference between the empirical data and the model can be attributed to statistical fluctuations
alone. If p is very small, the model is not a plausible fit to the data (Typically, the model is
accepted if p>1%.). You can skip this part as our M is too small to draw any meaningful statistical
inference. Based on the histogram you drew, does the power law fit the data well? What you
report for part 2 : - In which region we discussed in session 9 (slides 3740) does your of voptimal
land? - Plot histogram of Dsynthetic and where Dreal lands (similar to Figure 4.24 c). - Is scale
free is a good choice for the network? Based on paragraph above, why? - What could be the
reason for your finding be? This concludes your lab 4. Congrats! You did some serious network
analysis! For the citation network authors obtain p<104, indicating that a pure power law is not a
suitable model for the original degree distribution. This outcome is somewhat surprising, as the
power-law nature of citation data has been documented repeatedly since 1960s. This failure
indicates the limitation of the blind fitting to a power law, without an analytical understanding of the
underlying distribution. Barabasi discusses how to correct the problem: We note that the fitting
model (4.44) eliminates all the data points with k<ksot Choosing ksat=49 forces us to discard over
96% of the data points. Yet, there is statistically useful information for the data that falls in k<ksat
that is ignored by the previous fit. We must introduce an alternate model that resolves this
problem. I included their solution as part of an optional reading to this exercise. I included pages
since the beginning of this solution due to minor notation differences. You can find the discussion
at the end of (PDF) page 4/7. 6

Más contenido relacionado

Similar a you need to complete the r code and a singlepage document c.pdf

Assignment 5.2.pdf
Assignment 5.2.pdfAssignment 5.2.pdf
Assignment 5.2.pdfdash41
 
Basic Query Tuning Primer - Pg West 2009
Basic Query Tuning Primer - Pg West 2009Basic Query Tuning Primer - Pg West 2009
Basic Query Tuning Primer - Pg West 2009mattsmiley
 
R is a very flexible and powerful programming language, as well as a.pdf
R is a very flexible and powerful programming language, as well as a.pdfR is a very flexible and powerful programming language, as well as a.pdf
R is a very flexible and powerful programming language, as well as a.pdfannikasarees
 
Introduction to R
Introduction to RIntroduction to R
Introduction to RRajib Layek
 
I really need help with this assignment it is multiple parts# Part.pdf
I really need help with this assignment it is multiple parts# Part.pdfI really need help with this assignment it is multiple parts# Part.pdf
I really need help with this assignment it is multiple parts# Part.pdfillyasraja7
 
ggtimeseries-->ggplot2 extensions
ggtimeseries-->ggplot2 extensions ggtimeseries-->ggplot2 extensions
ggtimeseries-->ggplot2 extensions Dr. Volkan OBAN
 
Functional Programming with Groovy
Functional Programming with GroovyFunctional Programming with Groovy
Functional Programming with GroovyArturo Herrero
 
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docxINFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docxcarliotwaycave
 
[1062BPY12001] Data analysis with R / week 2
[1062BPY12001] Data analysis with R / week 2[1062BPY12001] Data analysis with R / week 2
[1062BPY12001] Data analysis with R / week 2Kevin Chun-Hsien Hsu
 
Simple Ways To Be A Better Programmer (OSCON 2007)
Simple Ways To Be A Better Programmer (OSCON 2007)Simple Ways To Be A Better Programmer (OSCON 2007)
Simple Ways To Be A Better Programmer (OSCON 2007)Michael Schwern
 

Similar a you need to complete the r code and a singlepage document c.pdf (20)

Assignment 5.2.pdf
Assignment 5.2.pdfAssignment 5.2.pdf
Assignment 5.2.pdf
 
Spark_Documentation_Template1
Spark_Documentation_Template1Spark_Documentation_Template1
Spark_Documentation_Template1
 
Basic Query Tuning Primer
Basic Query Tuning PrimerBasic Query Tuning Primer
Basic Query Tuning Primer
 
Basic Query Tuning Primer - Pg West 2009
Basic Query Tuning Primer - Pg West 2009Basic Query Tuning Primer - Pg West 2009
Basic Query Tuning Primer - Pg West 2009
 
Programming Assignment Help
Programming Assignment HelpProgramming Assignment Help
Programming Assignment Help
 
R is a very flexible and powerful programming language, as well as a.pdf
R is a very flexible and powerful programming language, as well as a.pdfR is a very flexible and powerful programming language, as well as a.pdf
R is a very flexible and powerful programming language, as well as a.pdf
 
Clojure basics
Clojure basicsClojure basics
Clojure basics
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Ruby 1.9
Ruby 1.9Ruby 1.9
Ruby 1.9
 
R code for data manipulation
R code for data manipulationR code for data manipulation
R code for data manipulation
 
R code for data manipulation
R code for data manipulationR code for data manipulation
R code for data manipulation
 
I really need help with this assignment it is multiple parts# Part.pdf
I really need help with this assignment it is multiple parts# Part.pdfI really need help with this assignment it is multiple parts# Part.pdf
I really need help with this assignment it is multiple parts# Part.pdf
 
ggtimeseries-->ggplot2 extensions
ggtimeseries-->ggplot2 extensions ggtimeseries-->ggplot2 extensions
ggtimeseries-->ggplot2 extensions
 
Javascript
JavascriptJavascript
Javascript
 
R basics
R basicsR basics
R basics
 
Functional Programming with Groovy
Functional Programming with GroovyFunctional Programming with Groovy
Functional Programming with Groovy
 
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docxINFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
 
[1062BPY12001] Data analysis with R / week 2
[1062BPY12001] Data analysis with R / week 2[1062BPY12001] Data analysis with R / week 2
[1062BPY12001] Data analysis with R / week 2
 
Simple Ways To Be A Better Programmer (OSCON 2007)
Simple Ways To Be A Better Programmer (OSCON 2007)Simple Ways To Be A Better Programmer (OSCON 2007)
Simple Ways To Be A Better Programmer (OSCON 2007)
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 

Más de adnankhan605720

You perform a cross between a doubleheterozygous individual.pdf
You perform a cross between a doubleheterozygous individual.pdfYou perform a cross between a doubleheterozygous individual.pdf
You perform a cross between a doubleheterozygous individual.pdfadnankhan605720
 
You observe a Cepheid variable star with a pulsation period .pdf
You observe a Cepheid variable star with a pulsation period .pdfYou observe a Cepheid variable star with a pulsation period .pdf
You observe a Cepheid variable star with a pulsation period .pdfadnankhan605720
 
You needed to give the reader an explicit recursive sequence.pdf
You needed to give the reader an explicit recursive sequence.pdfYou needed to give the reader an explicit recursive sequence.pdf
You needed to give the reader an explicit recursive sequence.pdfadnankhan605720
 
You need to use the five apk samples that represent six andr.pdf
You need to use the five apk samples that represent six andr.pdfYou need to use the five apk samples that represent six andr.pdf
You need to use the five apk samples that represent six andr.pdfadnankhan605720
 
You need to design an ERP system for a circular economy that.pdf
You need to design an ERP system for a circular economy that.pdfYou need to design an ERP system for a circular economy that.pdf
You need to design an ERP system for a circular economy that.pdfadnankhan605720
 
You mightve heard of Clubhouse by naw Its unilikely youv.pdf
You mightve heard of Clubhouse by naw Its unilikely youv.pdfYou mightve heard of Clubhouse by naw Its unilikely youv.pdf
You mightve heard of Clubhouse by naw Its unilikely youv.pdfadnankhan605720
 
You manage a loading dock One of your employees has had an .pdf
You manage a loading dock One of your employees has had an .pdfYou manage a loading dock One of your employees has had an .pdf
You manage a loading dock One of your employees has had an .pdfadnankhan605720
 
You have landed your dream job working for Steve Evert Unfo.pdf
You have landed your dream job working for Steve Evert Unfo.pdfYou have landed your dream job working for Steve Evert Unfo.pdf
You have landed your dream job working for Steve Evert Unfo.pdfadnankhan605720
 
You have identified two genes Parl and Mst5 which you susp.pdf
You have identified two genes Parl and Mst5 which you susp.pdfYou have identified two genes Parl and Mst5 which you susp.pdf
You have identified two genes Parl and Mst5 which you susp.pdfadnankhan605720
 
You have chosen to use a Likert scale to rate respondents p.pdf
You have chosen to use a Likert scale to rate respondents p.pdfYou have chosen to use a Likert scale to rate respondents p.pdf
You have chosen to use a Likert scale to rate respondents p.pdfadnankhan605720
 
You have discovered a mutant teleost fish with a strange beh.pdf
You have discovered a mutant teleost fish with a strange beh.pdfYou have discovered a mutant teleost fish with a strange beh.pdf
You have discovered a mutant teleost fish with a strange beh.pdfadnankhan605720
 
You have been hired as consultants to research report and p.pdf
You have been hired as consultants to research report and p.pdfYou have been hired as consultants to research report and p.pdf
You have been hired as consultants to research report and p.pdfadnankhan605720
 
You have been given the network 101064020 Subnet this n.pdf
You have been given the network 101064020 Subnet this n.pdfYou have been given the network 101064020 Subnet this n.pdf
You have been given the network 101064020 Subnet this n.pdfadnankhan605720
 

Más de adnankhan605720 (13)

You perform a cross between a doubleheterozygous individual.pdf
You perform a cross between a doubleheterozygous individual.pdfYou perform a cross between a doubleheterozygous individual.pdf
You perform a cross between a doubleheterozygous individual.pdf
 
You observe a Cepheid variable star with a pulsation period .pdf
You observe a Cepheid variable star with a pulsation period .pdfYou observe a Cepheid variable star with a pulsation period .pdf
You observe a Cepheid variable star with a pulsation period .pdf
 
You needed to give the reader an explicit recursive sequence.pdf
You needed to give the reader an explicit recursive sequence.pdfYou needed to give the reader an explicit recursive sequence.pdf
You needed to give the reader an explicit recursive sequence.pdf
 
You need to use the five apk samples that represent six andr.pdf
You need to use the five apk samples that represent six andr.pdfYou need to use the five apk samples that represent six andr.pdf
You need to use the five apk samples that represent six andr.pdf
 
You need to design an ERP system for a circular economy that.pdf
You need to design an ERP system for a circular economy that.pdfYou need to design an ERP system for a circular economy that.pdf
You need to design an ERP system for a circular economy that.pdf
 
You mightve heard of Clubhouse by naw Its unilikely youv.pdf
You mightve heard of Clubhouse by naw Its unilikely youv.pdfYou mightve heard of Clubhouse by naw Its unilikely youv.pdf
You mightve heard of Clubhouse by naw Its unilikely youv.pdf
 
You manage a loading dock One of your employees has had an .pdf
You manage a loading dock One of your employees has had an .pdfYou manage a loading dock One of your employees has had an .pdf
You manage a loading dock One of your employees has had an .pdf
 
You have landed your dream job working for Steve Evert Unfo.pdf
You have landed your dream job working for Steve Evert Unfo.pdfYou have landed your dream job working for Steve Evert Unfo.pdf
You have landed your dream job working for Steve Evert Unfo.pdf
 
You have identified two genes Parl and Mst5 which you susp.pdf
You have identified two genes Parl and Mst5 which you susp.pdfYou have identified two genes Parl and Mst5 which you susp.pdf
You have identified two genes Parl and Mst5 which you susp.pdf
 
You have chosen to use a Likert scale to rate respondents p.pdf
You have chosen to use a Likert scale to rate respondents p.pdfYou have chosen to use a Likert scale to rate respondents p.pdf
You have chosen to use a Likert scale to rate respondents p.pdf
 
You have discovered a mutant teleost fish with a strange beh.pdf
You have discovered a mutant teleost fish with a strange beh.pdfYou have discovered a mutant teleost fish with a strange beh.pdf
You have discovered a mutant teleost fish with a strange beh.pdf
 
You have been hired as consultants to research report and p.pdf
You have been hired as consultants to research report and p.pdfYou have been hired as consultants to research report and p.pdf
You have been hired as consultants to research report and p.pdf
 
You have been given the network 101064020 Subnet this n.pdf
You have been given the network 101064020 Subnet this n.pdfYou have been given the network 101064020 Subnet this n.pdf
You have been given the network 101064020 Subnet this n.pdf
 

Último

A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersChitralekhaTherkar
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 

Último (20)

A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of Powders
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 

you need to complete the r code and a singlepage document c.pdf

  • 1. you need to complete the r code and a single-page document containing two figures, report the parameters you estimate and discuss how well your power law fits the network data, and explain the finding. Question: images incomplete r code: # IDS 564 - Spring 2023 # Lab 4 R Code - Estimating the Degree Exponent of a Scale-free Network #========================================================================= ===================== # 0. INITIATION ========================================================================== = #========================================================================= ===================== ## You'll need VGAM for the zeta function # install.packages("VGAM") ## When prompted to install from binary version, select no library(VGAM) ## You'll need this when calculating goodness of fit # install.packages("parallel") library(parallel) library(ggplot2) library(ggthemes) library(dplyr) library(tidyr) ##------------------------------------------------------------------------------ ## This function will calculate the zeta function for you. You don't need to worry about it! Run it and continue. ## gen_zeta(gamma , shift) will give you a number gen_zeta <- function (gamma, shift = 1, deriv = 0) { deriv.arg <- deriv rm(deriv) if (!is.Numeric(deriv.arg, length.arg = 1, integer.valued = TRUE))
  • 2. stop("'deriv' must be a single non-negative integer") if (deriv.arg < 0 || deriv.arg > 2) stop("'deriv' must be 0, 1, or 2") if (deriv.arg > 0) return(zeta.specials(Zeta.derivative(gamma, deriv.arg = deriv.arg, shift = shift), gamma, deriv.arg, shift)) if (any(special <- Re(gamma) <= 1)) { ans <- gamma ans[special] <- Inf special3 <- Re(gamma) < 1 ans[special3] <- NA special4 <- (0 < Re(gamma)) & (Re(gamma) < 1) & (Im(gamma) == 0) # ans[special4] <- Zeta.derivative(gamma[special4], deriv.arg = deriv.arg, shift = shift) special2 <- Re(gamma) < 0 if (any(special2)) { gamma2 <- gamma[special2] cgamma <- 1 - gamma2 ans[special2] <- 2^(gamma2) * pi^(gamma2 - 1) * sin(pi * gamma2/2) * gamma(cgamma) * Recall(cgamma) } if (any(!special)) { ans[!special] <- Recall(gamma[!special]) } return(zeta.specials(ans, gamma, deriv.arg, shift)) } aa <- 12 ans <- 0 for (ii in 0:(aa - 1)) ans <- ans + 1/(shift + ii)^gamma ans <- ans + Zeta.aux(shape = gamma, aa, shift = shift) ans[shift <= 0] <- NaN zeta.specials(ans, gamma, deriv.arg = deriv.arg, shift = shift) } ## example: gen_zeta(2.1, 4) ##------------------------------------------------------------------------------ ## The P_k (the CDF) P_k = function(gamma, k, k_sat){ ### fill the function return(1 - ( gen_zeta(gamma, k) / ... )) }
  • 3. ##------------------------------------------------------------------------------ my_theme <- theme_classic() + theme(legend.position = "bottom", legend.box = "horizontal", legend.direction = "horizontal", title = element_text(size = 18), axis.title = element_text(size = 14), axis.text.y = element_text(size = 16), axis.text.x = element_text(size = 16), strip.text = element_text(size = 14), strip.background = element_blank(), strip.text.x = element_text(size = 14), strip.text.y = element_text(size = 14), legend.title = element_text(size = 14), legend.text = element_text(size = 14)) set.seed(123) #========================================================================= ===================== # 00. LOADING DATA ======================================================================== #========================================================================= ===================== ## Load data - fill the path of the folder where you put the file. If the file is in the same folder as the R code remove the path part. your_path = "your path" pat_citation_deg = read.csv(paste0(your_path, "lab 4 data - sample_patant_citation_deg.csv")) head(pat_citation_deg) tail(pat_citation_deg) summary(pat_citation_deg) ## let's have a look at the Log-log degree distribution plot (nothing to fill) p <- ggplot(pat_citation_deg, aes(x = degree)) + geom_point(stat = 'bin', color = "blue", size = 2.5, bins = 3 * ceiling(log(nrow(pat_citation_deg))))+ scale_x_continuous(trans = "log", breaks = scales::trans_breaks("log", function(x) exp(x)), labels = scales::trans_format("log", scales::math_format(e^.x))) + scale_y_continuous(trans = "log", breaks = scales::trans_breaks("log", function(x) exp(x)), labels = scales::trans_format("log", scales::math_format(e^.x))) + labs(x = "Degree", y = "Frequency") + my_theme + theme(title = element_text(size = 12 )) ## fit a line to the Log-log degree distribution (nothing to fill) p + geom_smooth(data = ggplot_build(p)$data[[1]] %>% filter(!is.infinite(y)), ## this will take the binned data generated by ggplot to fit the line mapping = aes(x=exp(x), y= exp(y)), method = "lm", se=FALSE, color = "red", size = 0.75, alpha = 0.5)
  • 4. #========================================================================= ===================== # 1. EXERCISE PART 1 - Estimating Gamma =================================================== #========================================================================= ===================== ## designate the data.frame to be used - and use standardized column names: id, degree (nothing to fill) my_df = pat_citation_deg %>% rename(id = patent_id) ##------------------------------------------------------------------------------ ## you'll write a for loop over individual unique degrees in the data-set to find the corresponding distance D ### let's create a data.frame with one column as each observed degree in our network; ### We will fill the other columns D and gamma as we calculate them in the loop (nothing to fill) D_df = data.frame(k_sat = unique(my_df$degree), D= NA, gamma = NA) %>% arrange(k_sat) ## here you set up the maximum degree to check so that you do not have to do the computation for all degrees ### Let's set it up as 25th percentile of the unique observed degrees. Next line of code does that for you: (nothing to fill) max_degree_to_check = as.numeric(ceiling(quantile(unique(my_df$degree), 0.25))) ### Now discard the rows of D_df you do not need (that are above the max_degree_to_check). Next line of code does it for you: (nothing to fill) D_df = D_df[D_df$k_sat < max_degree_to_check,] ### Let's take a look at the distance data.frame we are about to fill in the loop: (nothing to fill) head(D_df) tail(D_df) ## Understand and fill parts of the code in this loop ## I recommend setting i = 1 and running each line of this loop on your own and checking what it gives you. This will help you fill the gaps ##------------------------------------------------------------------------------ ## let's work on the loop for (i in 1:(nrow(D_df))) { ## note: the loop starts slower but will speed up as it progresses. ## let's show the current loop k_sat (so that we can see our progress): (nothing to fill) print(paste0("at %", round(100 * i/nrow(D_df), 2))) k_sat_temp = D_df$k_sat[i]
  • 5. ##---------------------------------------------------------------------------- ## let's create a temporary copy of the network degree data that contains degrees equal or above k_sat_temp: (nothing to fill) temp_df = my_df[my_df$degree>k_sat_temp,] ##---------------------------------------------------------------------------- ## step 1: estimate gamma for this loop and call it 'temp_gamma' ### create a vector of k_i/ (k_sat) so that you can feed it to natural logarithm and sum over elements ### k_i is each observed node degree in your data; k_sat refers to the k_sat of this loop (nothing to fill) temp_vec_k_i = temp_df$degree/(k_sat_temp) ### now use the above vector in (4.41); remember N is the number of nodes in your network. N = nrow(my_df) (nothing to fill) temp_gamma = 1 + (nrow(temp_df) / sum(log(temp_vec_k_i)) - 1/2) ## (4.41) ##---------------------------------------------------------------------------- ## step 2: now use (temp_gamma, k_sat_temp) to write (4.43) as a function called 'CDF_k' to pass the KS test in step 3: ### k will be a variable that KS test will use, so make it an argument of CDF_k; ### put gamma and k_sat of this loop in the body of the function CDF_k = function(k) { ### FILL THIS FUNCTION ACCORDING TO (4.43) return(1 - (gen_zeta(temp_gamma, k) / ...)) } ##---------------------------------------------------------------------------- ## step 3: run KS test and pass the statistic as D to the corresponding column of D_df KS_D = ks.test(temp_df$degree, ...) ### fill in with function you created above: just pass the function name (without parantheses, or brackets, or quotes) ### * you can take a look here if you couldn't figure it out: https://stats.stackexchange.com/questions/47730/r-defining-a-new-continuous-distribution- function-to-use-with-kolmogorov-smirno D_df[i,'D'] = as.numeric(KS_D$statistic) ### (nothing to fill) ## let's also store the gamma so that we do not have to compute it again once we have an optimal k_sat (nothing to fill) D_df[i,'gamma'] = temp_gamma
  • 6. } ##------------------------------------------------------------------------------ ## step 4: plot D against k_sat find the k_sat that minimizes D, and the corresponding gamma ### let's first take a look at the D_df we have formed (nothing to fill) head(D_df, 10) ### find the optimal k_sat that yields minimum D and call it 'optimal_k_sat' (nothing to fill) optimal_k_sat = D_df[which.min(D_df$D),'k_sat'] ### let's take a look at the D_df we have formed (nothing to fill) ggplot(D_df %>% drop_na(), aes(x = k_sat, y = D)) + geom_point(size = 3, alpha = .5, color = "purple") + geom_vline(xintercept = optimal_k_sat, size = 1, color = "red") + ggplot2::annotate("text", x = optimal_k_sat + 15, y = as.numeric(quantile(D_df$D,.85)), label = paste0("Optimal k_sat = ",optimal_k_sat), color = "red") + my_theme + labs(x = "k", y = "D") ### find the D corresponding to 'optimal_k_sat' (nothing to fill) min(D_df$D) ### find the gamma corresponding to the optimal k_sat and call it 'optimal_gamma' (nothing to fill) (optimal_gamma = D_df[which.min(D_df$D),'gamma']) ## Discard observations with degree below the best k_sat you found earlier. (nothing to fill) post_data = my_df %>% filter(degree >= optimal_k_sat) ##------------------------------------------------------------------------------ ## let's take a look at the resulting Log-log degree distribution plot for the remaining data-points (nothing to fill) p_post <- ggplot(post_data, aes(x = degree)) + geom_point(stat = 'bin', color = "blue", size = 2.5, bins = 3 * ceiling(log(nrow(post_data))))+ scale_x_continuous(trans = "log", breaks = scales::trans_breaks("log", function(x) exp(x)), labels = scales::trans_format("log", scales::math_format(e^.x))) + scale_y_continuous(trans = "log", breaks = scales::trans_breaks("log", function(x) exp(x)), labels = scales::trans_format("log", scales::math_format(e^.x))) + labs(x = "Degree", y = "Frequency") + my_theme ## fit a line to the Log-log degree distribution (nothing to fill) p_post + geom_smooth(data = ggplot_build(p_post)$data[[1]] %>% filter(!is.infinite(y)), ## this will
  • 7. take the binned data generated by ggplot to fit the line mapping = aes(x=exp(x), y= exp(y)), method = "lm", se=FALSE, color = "red", size = 0.75, alpha = 0.5) #========================================================================= ===================== # 2. EXERCISE PART 2 - Goodness-of-fit ==================================================== #========================================================================= ===================== ## We are going to create a vector of synthetic sequences of degrees and repeat the process M times ## Usually, M is pretty big, like M = 10,000. For now, let's set M = 100: (nothing to fill) M = 100 ## so let's create a data.frame of M rows, one for every D_synthetic we will generate (nothing to fill) D_gof_df = data.frame(iter = 1:M, D_synthetic = NA) ##------------------------------------------------------------------------------ ## step 1: store the distance you found in part 1 as D_real (nothing to fill) D_real = min(D_df$D) ##------------------------------------------------------------------------------ ## I. Let's walk through steps 2 and 3 once outside of the loop ##------------------------------------------------------------------------------ ##------------------------------------------------------------------------------ ## step 2: you will need to define the inverse of the CDF function (so that you generate random probability values [0,1] and get degrees back) ### let's write the CDF that best fits the data (we did this in part 1): CDF_k = function(k) { ### FILL THE FUNCTION ACCORDING TO (4.43) return(1 - (gen_zeta(optimal_gamma, k) / ...)) } ### 2.1. Let's define the inverse of your CDF; (nothing to fill) ### if the next line is hard to understand, check here: https://stackoverflow.com/questions/23258482/use-inverse-cdf-to-generate-random-variable-in-r
  • 8. #### mini step 2.1.1. this piece of code will create an inverse for you (that searches the interval 0 up to a big higher than the highest observed degree in our data) Inverse = function(f, interval = c(sqrt(min(post_data$degree)), max(post_data$degree))){ function(y) uniroot((function(x) f(x) - y), interval = interval, extendInt = "yes")$root } inverse_CDF = Inverse(CDF_k) ### mini step 2.2.1 let's try it for a couple of numbers (nothing to fill) inverse_CDF(0.4) ## mini step 2.2.2. runif(1) will generate a random real number between 0 and 1. Let's pass that to inverse_CDF rand_p = runif(1) inverse_CDF(rand_p) ### mini step 2.2. let's generate 5 random numbers betwee 0 and 1 and get 5 degrees back from our inverse #### (unfortunately we have to write this complex code because inverse_CDF does not accept a vector; try inverse_CDF(c(0.1, 0.2)). ) rand_p = runif(5) unlist(lapply(rand_p, function(p){inverse_CDF(p)})) ### step 2.2.ok, you know how to generate 5 random degrees. Let's create n random degrees, where n is the number of degrees in our remaining data (nothing to fill) rand_p = runif(nrow(post_data)) rand_deg = unlist(lapply(rand_p, function(p){inverse_CDF(p)})) ## unfortunately this is a bit slow!!! rand_deg = unlist(mclapply(rand_p, function(p){inverse_CDF(p)}, mc.cores = parallel::detectCores() - 1)) ## this will make it a bit faster, but still not great! ##------------------------------------------------------------------------------ ## step 3: use ks.test to get a D_synthetic for the rand_deg you just generated and ### FILL THE ks.test ACCORDING TO THE DISCUSSION ON STEP 3 PART 2 KS_D = ks.test(rand_deg, ...) ## ks.test(first are synthetic degrees, second are real degrees) as.numeric(KS_D$statistic) ##------------------------------------------------------------------------------ ## II. Now let's write the loop ##------------------------------------------------------------------------------ for (i in 1:M){ ## this may take a while!!! It took me 15 minutes to run M = 100 on a computer with 3.5 GHz 6-Core and 64 GB memory. Lower to M = 20 if necessary
  • 9. print(paste0("at %",100 * i/M)) ##------------------------------------------------------------------------------ ## step 2: generate a synthetic (random) sequence of degrees rand_p = ... ### FILL AS WE DID ABOVE rand_deg = unlist(mclapply(rand_p, function(p){inverse_CDF(p)}, mc.cores = parallel::detectCores() - 1)) ## this will make it a bit faster ##------------------------------------------------------------------------------ ## step 3: find the distance between the synthetic sequence and CDF_k and store it # KS_D = ks.test(rand_deg, CDF_k) KS_D = ks.test(..., ...) ### FILL AS WE DID ABOVE ks.test(first are synthetic degrees, second are real degrees) D_gof_df[i,'D_synthetic'] = as.numeric(KS_D$statistic) } ##------------------------------------------------------------------------------ ## Let's plot the results ### let's take a look at the D_df we have formed ggplot(D_gof_df, aes(x = D_synthetic)) + geom_histogram(bins = 20, color = "white", fill = "green", alpha = 0.9) + geom_vline(xintercept = D_real, size = 1, color = "brown") + my_theme + labs(x = "Distance", y = "Frequency", title = "Synthetic and Real Distances") # IDS 564 - Spring 2023 # Lab 4 R Code - Estimating the Degree Exponent of a Scale-free Network #========================================================================= ===================== # 0. INITIATION ========================================================================== = #========================================================================= ===================== ## You'll need VGAM for the zeta function # install.packages("VGAM") ## When prompted to install from binary version, select no
  • 10. library(VGAM) ## You'll need this when calculating goodness of fit # install.packages("parallel") library(parallel) library(ggplot2) library(ggthemes) library(dplyr) library(tidyr) ##------------------------------------------------------------------------------ ## This function will calculate the zeta function for you. You don't need to worry about it! Run it and continue. ## gen_zeta(gamma , shift) will give you a number gen_zeta <- function (gamma, shift = 1, deriv = 0) { deriv.arg <- deriv rm(deriv) if (!is.Numeric(deriv.arg, length.arg = 1, integer.valued = TRUE)) stop("'deriv' must be a single non-negative integer") if (deriv.arg < 0 || deriv.arg > 2) stop("'deriv' must be 0, 1, or 2") if (deriv.arg > 0) return(zeta.specials(Zeta.derivative(gamma, deriv.arg = deriv.arg,
  • 11. shift = shift), gamma, deriv.arg, shift)) if (any(special <- Re(gamma) <= 1)) { ans <- gamma ans[special] <- Inf special3 <- Re(gamma) < 1 ans[special3] <- NA special4 <- (0 < Re(gamma)) & (Re(gamma) < 1) & (Im(gamma) == 0) # ans[special4] <- Zeta.derivative(gamma[special4], deriv.arg = deriv.arg, shift = shift) special2 <- Re(gamma) < 0 if (any(special2)) { gamma2 <- gamma[special2] cgamma <- 1 - gamma2 ans[special2] <- 2^(gamma2) * pi^(gamma2 - 1) * sin(pi * gamma2/2) * gamma(cgamma) * Recall(cgamma) } if (any(!special)) { ans[!special] <- Recall(gamma[!special]) } return(zeta.specials(ans, gamma, deriv.arg, shift)) } aa <- 12 ans <- 0
  • 12. for (ii in 0:(aa - 1)) ans <- ans + 1/(shift + ii)^gamma ans <- ans + Zeta.aux(shape = gamma, aa, shift = shift) ans[shift <= 0] <- NaN zeta.specials(ans, gamma, deriv.arg = deriv.arg, shift = shift) } ## example: gen_zeta(2.1, 4) ##------------------------------------------------------------------------------ ## The P_k (the CDF) P_k = function(gamma, k, k_sat){ ### fill the function return(1 - ( gen_zeta(gamma, k) / ... )) } ##------------------------------------------------------------------------------ my_theme <- theme_classic() + theme(legend.position = "bottom", legend.box = "horizontal", legend.direction = "horizontal", title = element_text(size = 18), axis.title = element_text(size = 14), axis.text.y = element_text(size = 16), axis.text.x = element_text(size = 16), strip.text = element_text(size = 14), strip.background = element_blank(), strip.text.x = element_text(size = 14), strip.text.y = element_text(size = 14), legend.title = element_text(size = 14), legend.text = element_text(size = 14))
  • 13. set.seed(123) #========================================================================= ===================== # 00. LOADING DATA ======================================================================== #========================================================================= ===================== ## Load data - fill the path of the folder where you put the file. If the file is in the same folder as the R code remove the path part. your_path = "your path" pat_citation_deg = read.csv(paste0(your_path, "lab 4 data - sample_patant_citation_deg.csv")) head(pat_citation_deg) tail(pat_citation_deg) summary(pat_citation_deg) ## let's have a look at the Log-log degree distribution plot (nothing to fill) p <- ggplot(pat_citation_deg, aes(x = degree)) + geom_point(stat = 'bin', color = "blue", size = 2.5, bins = 3 * ceiling(log(nrow(pat_citation_deg))))+ scale_x_continuous(trans = "log", breaks = scales::trans_breaks("log", function(x) exp(x)), labels = scales::trans_format("log", scales::math_format(e^.x))) + scale_y_continuous(trans = "log", breaks = scales::trans_breaks("log", function(x) exp(x)), labels = scales::trans_format("log", scales::math_format(e^.x))) +
  • 14. labs(x = "Degree", y = "Frequency") + my_theme + theme(title = element_text(size = 12 )) ## fit a line to the Log-log degree distribution (nothing to fill) p + geom_smooth(data = ggplot_build(p)$data[[1]] %>% filter(!is.infinite(y)), ## this will take the binned data generated by ggplot to fit the line mapping = aes(x=exp(x), y= exp(y)), method = "lm", se=FALSE, color = "red", size = 0.75, alpha = 0.5) #========================================================================= ===================== # 1. EXERCISE PART 1 - Estimating Gamma =================================================== #========================================================================= ===================== ## designate the data.frame to be used - and use standardized column names: id, degree (nothing to fill) my_df = pat_citation_deg %>% rename(id = patent_id) ##------------------------------------------------------------------------------ ## you'll write a for loop over individual unique degrees in the data-set to find the corresponding distance D ### let's create a data.frame with one column as each observed degree in our network; ### We will fill the other columns D and gamma as we calculate them in the loop (nothing to fill) D_df = data.frame(k_sat = unique(my_df$degree), D= NA, gamma = NA) %>% arrange(k_sat)
  • 15. ## here you set up the maximum degree to check so that you do not have to do the computation for all degrees ### Let's set it up as 25th percentile of the unique observed degrees. Next line of code does that for you: (nothing to fill) max_degree_to_check = as.numeric(ceiling(quantile(unique(my_df$degree), 0.25))) ### Now discard the rows of D_df you do not need (that are above the max_degree_to_check). Next line of code does it for you: (nothing to fill) D_df = D_df[D_df$k_sat < max_degree_to_check,] ### Let's take a look at the distance data.frame we are about to fill in the loop: (nothing to fill) head(D_df) tail(D_df) ## Understand and fill parts of the code in this loop ## I recommend setting i = 1 and running each line of this loop on your own and checking what it gives you. This will help you fill the gaps ##------------------------------------------------------------------------------ ## let's work on the loop for (i in 1:(nrow(D_df))) { ## note: the loop starts slower but will speed up as it progresses. ## let's show the current loop k_sat (so that we can see our progress): (nothing to fill) print(paste0("at %", round(100 * i/nrow(D_df), 2))) k_sat_temp = D_df$k_sat[i] ##---------------------------------------------------------------------------- ## let's create a temporary copy of the network degree data that contains degrees equal or above
  • 16. k_sat_temp: (nothing to fill) temp_df = my_df[my_df$degree>k_sat_temp,] ##---------------------------------------------------------------------------- ## step 1: estimate gamma for this loop and call it 'temp_gamma' ### create a vector of k_i/ (k_sat) so that you can feed it to natural logarithm and sum over elements ### k_i is each observed node degree in your data; k_sat refers to the k_sat of this loop (nothing to fill) temp_vec_k_i = temp_df$degree/(k_sat_temp) ### now use the above vector in (4.41); remember N is the number of nodes in your network. N = nrow(my_df) (nothing to fill) temp_gamma = 1 + (nrow(temp_df) / sum(log(temp_vec_k_i)) - 1/2) ## (4.41) ##---------------------------------------------------------------------------- ## step 2: now use (temp_gamma, k_sat_temp) to write (4.43) as a function called 'CDF_k' to pass the KS test in step 3: ### k will be a variable that KS test will use, so make it an argument of CDF_k; ### put gamma and k_sat of this loop in the body of the function CDF_k = function(k) { ### FILL THIS FUNCTION ACCORDING TO (4.43) return(1 - (gen_zeta(temp_gamma, k) / ...)) } ##---------------------------------------------------------------------------- ## step 3: run KS test and pass the statistic as D to the corresponding column of D_df
  • 17. KS_D = ks.test(temp_df$degree, ...) ### fill in with function you created above: just pass the function name (without parantheses, or brackets, or quotes) ### * you can take a look here if you couldn't figure it out: https://stats.stackexchange.com/questions/47730/r-defining-a-new-continuous-distribution- function-to-use-with-kolmogorov-smirno D_df[i,'D'] = as.numeric(KS_D$statistic) ### (nothing to fill) ## let's also store the gamma so that we do not have to compute it again once we have an optimal k_sat (nothing to fill) D_df[i,'gamma'] = temp_gamma } ##------------------------------------------------------------------------------ ## step 4: plot D against k_sat find the k_sat that minimizes D, and the corresponding gamma ### let's first take a look at the D_df we have formed (nothing to fill) head(D_df, 10) ### find the optimal k_sat that yields minimum D and call it 'optimal_k_sat' (nothing to fill) optimal_k_sat = D_df[which.min(D_df$D),'k_sat'] ### let's take a look at the D_df we have formed (nothing to fill) ggplot(D_df %>% drop_na(), aes(x = k_sat, y = D)) + geom_point(size = 3, alpha = .5, color = "purple") + geom_vline(xintercept = optimal_k_sat, size = 1, color = "red") + ggplot2::annotate("text", x = optimal_k_sat + 15, y = as.numeric(quantile(D_df$D,.85)), label = paste0("Optimal k_sat = ",optimal_k_sat), color = "red") +
  • 18. my_theme + labs(x = "k", y = "D") ### find the D corresponding to 'optimal_k_sat' (nothing to fill) min(D_df$D) ### find the gamma corresponding to the optimal k_sat and call it 'optimal_gamma' (nothing to fill) (optimal_gamma = D_df[which.min(D_df$D),'gamma']) ## Discard observations with degree below the best k_sat you found earlier. (nothing to fill) post_data = my_df %>% filter(degree >= optimal_k_sat) ##------------------------------------------------------------------------------ ## let's take a look at the resulting Log-log degree distribution plot for the remaining data-points (nothing to fill) p_post <- ggplot(post_data, aes(x = degree)) + geom_point(stat = 'bin', color = "blue", size = 2.5, bins = 3 * ceiling(log(nrow(post_data))))+ scale_x_continuous(trans = "log", breaks = scales::trans_breaks("log", function(x) exp(x)), labels = scales::trans_format("log", scales::math_format(e^.x))) + scale_y_continuous(trans = "log", breaks = scales::trans_breaks("log", function(x) exp(x)), labels = scales::trans_format("log", scales::math_format(e^.x))) + labs(x = "Degree", y = "Frequency") + my_theme
  • 19. ## fit a line to the Log-log degree distribution (nothing to fill) p_post + geom_smooth(data = ggplot_build(p_post)$data[[1]] %>% filter(!is.infinite(y)), ## this will take the binned data generated by ggplot to fit the line mapping = aes(x=exp(x), y= exp(y)), method = "lm", se=FALSE, color = "red", size = 0.75, alpha = 0.5) #========================================================================= ===================== # 2. EXERCISE PART 2 - Goodness-of-fit ==================================================== #========================================================================= ===================== ## We are going to create a vector of synthetic sequences of degrees and repeat the process M times ## Usually, M is pretty big, like M = 10,000. For now, let's set M = 100: (nothing to fill) M = 100 ## so let's create a data.frame of M rows, one for every D_synthetic we will generate (nothing to fill) D_gof_df = data.frame(iter = 1:M, D_synthetic = NA) ##------------------------------------------------------------------------------ ## step 1: store the distance you found in part 1 as D_real (nothing to fill) D_real = min(D_df$D) ##------------------------------------------------------------------------------
  • 20. ## I. Let's walk through steps 2 and 3 once outside of the loop ##------------------------------------------------------------------------------ ##------------------------------------------------------------------------------ ## step 2: you will need to define the inverse of the CDF function (so that you generate random probability values [0,1] and get degrees back) ### let's write the CDF that best fits the data (we did this in part 1): CDF_k = function(k) { ### FILL THE FUNCTION ACCORDING TO (4.43) return(1 - (gen_zeta(optimal_gamma, k) / ...)) } ### 2.1. Let's define the inverse of your CDF; (nothing to fill) ### if the next line is hard to understand, check here: https://stackoverflow.com/questions/23258482/use-inverse-cdf-to-generate-random-variable-in-r #### mini step 2.1.1. this piece of code will create an inverse for you (that searches the interval 0 up to a big higher than the highest observed degree in our data) Inverse = function(f, interval = c(sqrt(min(post_data$degree)), max(post_data$degree))){ function(y) uniroot((function(x) f(x) - y), interval = interval, extendInt = "yes")$root } inverse_CDF = Inverse(CDF_k) ### mini step 2.2.1 let's try it for a couple of numbers (nothing to fill)
  • 21. inverse_CDF(0.4) ## mini step 2.2.2. runif(1) will generate a random real number between 0 and 1. Let's pass that to inverse_CDF rand_p = runif(1) inverse_CDF(rand_p) ### mini step 2.2. let's generate 5 random numbers betwee 0 and 1 and get 5 degrees back from our inverse #### (unfortunately we have to write this complex code because inverse_CDF does not accept a vector; try inverse_CDF(c(0.1, 0.2)). ) rand_p = runif(5) unlist(lapply(rand_p, function(p){inverse_CDF(p)})) ### step 2.2.ok, you know how to generate 5 random degrees. Let's create n random degrees, where n is the number of degrees in our remaining data (nothing to fill) rand_p = runif(nrow(post_data)) rand_deg = unlist(lapply(rand_p, function(p){inverse_CDF(p)})) ## unfortunately this is a bit slow!!! rand_deg = unlist(mclapply(rand_p, function(p){inverse_CDF(p)}, mc.cores = parallel::detectCores() - 1)) ## this will make it a bit faster, but still not great! ##------------------------------------------------------------------------------ ## step 3: use ks.test to get a D_synthetic for the rand_deg you just generated and ### FILL THE ks.test ACCORDING TO THE DISCUSSION ON STEP 3 PART 2 KS_D = ks.test(rand_deg, ...) ## ks.test(first are synthetic degrees, second are real degrees) as.numeric(KS_D$statistic)
  • 22. ##------------------------------------------------------------------------------ ## II. Now let's write the loop ##------------------------------------------------------------------------------ for (i in 1:M){ ## this may take a while!!! It took me 15 minutes to run M = 100 on a computer with 3.5 GHz 6-Core and 64 GB memory. Lower to M = 20 if necessary print(paste0("at %",100 * i/M)) ##------------------------------------------------------------------------------ ## step 2: generate a synthetic (random) sequence of degrees rand_p = ... ### FILL AS WE DID ABOVE rand_deg = unlist(mclapply(rand_p, function(p){inverse_CDF(p)}, mc.cores = parallel::detectCores() - 1)) ## this will make it a bit faster ##------------------------------------------------------------------------------ ## step 3: find the distance between the synthetic sequence and CDF_k and store it # KS_D = ks.test(rand_deg, CDF_k) KS_D = ks.test(..., ...) ### FILL AS WE DID ABOVE ks.test(first are synthetic degrees, second are real degrees) D_gof_df[i,'D_synthetic'] = as.numeric(KS_D$statistic) } ##------------------------------------------------------------------------------ ## Let's plot the results ### let's take a look at the D_df we have formed
  • 23. ggplot(D_gof_df, aes(x = D_synthetic)) + geom_histogram(bins = 20, color = "white", fill = "green", alpha = 0.9) + geom_vline(xintercept = D_real, size = 1, color = "brown") + my_theme + labs(x = "Distance", y = "Frequency", title = "Synthetic and Real Distances") Please read this instruction carefully before starting this exercise. To complete this lab, you have a short, required reading included in this document that explains the challenges in finding the degree exponent () of a power law for an observed network. The fitting procedure corrects such problems; reading it will require you to follow and understand the exercise. You will fit a power law into a network provided for this lab. The estimation process (starting on page 3 ) contains two parts. The first part estimates the degree exponent (), and the second part offers the goodness of fit using a simulation process. An attached R code handles the computational heavy lifting. However, filling in and comapleting the code requires your understanding of the process. I annotated the code in detail. The idea is to facilitate learning rather than implement the necessary steps. You still need to run every line of code. However, the lines you do not need to change show (nothing to fill) in their annotation. The data you will handle is a random sample of 80,000 US patents taken from a larger corpus of about 8 million patents. You see a patent id number (first column) and a degree representing the number of times the patent was cited by other patents (second column). So, the data is the degree sequence of a network, summarized to help you handle the lab. The R code provides you with a log-log plot to get a sense of the data. Here are the first ten* Patents without any citations do not appear in the data. Unlike your previous labs, lab 4 does not require you to complete a quiz. All you need to submit is your R code, and a single-page document containing two figures (similar to Figure 4.24 parts b and c ), report two values ( and k sat ) and discuss how well your power law fits the network data, and an explanation of your finding. Note that the second part of the exercise will need a bit of time to run. Accommodate it by starting your lab 4 early. Required Reading: Degree Distribution of Real Networks In real systems, we rarely observe a degree distribution that follows a pure power law. Instead, for most real systems pk has the shape shown in Figure 4.23a, with two recurring - Low-degree saturation is a common deviation from the power-law behavior. Its problems: signature is a flattened Px for k<k sot. This indicates that we have fewer small degree nodes than expected for a pure power law. The origin of the saturation will be - High-degree cutoff appears as a rapid drop in pk for k>kcut,, indicating that we have explained in Chapter 6. fewer high-degree nodes than expected in a pure power law. This limits the size of the largest hub, making it smaller than predicted by what we derived in the class (kmax =kminNr11). High-degree cutoffs emerge if there are inherent limitations in the number of links a node can have. For example, in social networks individuals have difficulty maintaining meaningful relationships with an exceptionally large number of acquaintances, or like the case of Facebook's 5klimitonfriends.Barabasi Figure 4.23. Rescaling the Degree Distribution a. In real networks the degree distribution frequently deviates from a pure power law by showing a low degree b. By
  • 24. plotting the rescaled pk in function of (k+ksat), as suggested by (4.40), the degree distribution follows a saturation and high degree cutoff. power law for all degrees. Given the widespread presence of such cutoffs the degree distribution is often fitted to: (4.39) Px=a(k+ksat)exp(kcutk) where ksat accounts for degree saturation, and the exponential term accounts for high-k cutoff. How do we deal with these two problems? To extract the full extent of the scaling we correct the plot by multiplying a term as follows: Px=Pxexp(kcutk) If you look at the resulting function, Px=a(k+ ksat)r follows the power law form we would like to see as a function of k=k+ksot:PRkr, correcting for the two cutoffs, as seen in Figure 4.23b. This is the idea. We will go through this correction in the exercise. It is occasionally claimed that the presence of low-degree or high-degree cutoffs implies that the network is not scale-free. This is a misunderstanding of the scale-free property: Virtually all properties of scale-free networks are insensitive to the low-degree saturation. Only the high-degree cutoff affects the system's properties by limiting therdivergence of the second moment (that determines the variance). The presence of such cutoffs indicates the presence of additional phenomena that need to be understood - that will not be discussed here. Exercise: Estimating the Degree Exponent As the properties of scale-free networks depend on the degree exponent, we need to determine the value of . We face several difficulties, however, when we try to fit a power law into real data. The most important is the fact that the scaling is rarely valid for the full range of the degree distribution. Rather we observe small- and high-degree cutoffs described above, as ksot and kcut, within which we have a clear scaling region. Here we focus on estimating the small degree cutoff ksat as the high degree cutoff can be determined in a similar fashion. The reader is advised to consult the discussion on systematic problems provided at the end of this section - as optional reading. Part 1: Fitting Procedure: As the degree distribution is typically provided as a list of positive integers kmin,,kmax, we aim to estimate from a discrete set of data points (only those observed in the data). Here you will see figures for an article citation network to illustrate the procedure. You will implement the procedure on the patent data explained above. The article citation network illustrated here consists of N=384,362 nodes, each node representing a research paper published between 1890 and 2009 in journals pubblished by the American Physical Society. The network has L=2,353,984 links, each representing a citation from a published research paper to some other publication in the dataset (outside citations are ignored). The steps of the fitting process are:1. Choose a value of ksot between kmin and kmax. Estimate the value of the degree exponent corresponding to this k sot using: * Notice two things. First, in practice, you do not need to check all degrees between kmin and kmax Indeed, you can check degrees up to a certain quantile. In the code, you loop over kmin up to the degree at the 25th percentile - this is already set up for you. 3 Second, in (4.41), the summation starts from i=1. This applies to the data that has a degree at least equal to ksot for that loop. For instance, if you are checking ksat=4, all data corresponding to values with a degree below 4 will be discarded and do not enter the calculation of in (4.41) - this is also set up for you in R. 2. With the obtained (,ksat) parameter pair, assume that the degree distribution has the form: Pk=k=0(k+ksat)1k First, note that the kk in the denominator is called a zeta function. When running the exercise, a function I have provided will calculate zeta for you, so you do not have to worry about it. Just learn the notation so that you can pass the arguments: zeta(,a)=x=0(x+a) is the main parameter, and a is the shift. You will pass these two parameters in the is the main parameter, and a is the shift. You
  • 25. will pass these two parameters in the same order to the function provided in R. Second, notice that Pk in (4.42) is like a power law we have seen in the class, if you take the C=zeta(,ksat)1. This is, in fact, a constant (given parameters and ksat ) and is the discrete form of C (in slide 15 of session 9 we derived C for a continuous closed form; the discrete form is easier to work with here since you have discrete data from your network). You do not need to implement Pk. Next is what you need to write in R. With the Pk(4.42) as the probability density function, the cumulative distribution function (CDF) is: Pk=1zeta(r,ksa)zeta(,k) You will write this in R using the zeta function). Your Pk implementation in R should be something like this that takes k as input and has and kiat inside: C DF(k)=1zeta(Y,k)/zeta(Y,k_sat) 3. Use the Kormogorov-Smirnov (KS) test to determine the distance D between your network data, let's call it S(k), and the fitted model provided by (4.43) with the selected (,ksat) parameter pairs. In R, ks.test(..)$statistic will give you the D. To implement this in R, you can pass the degree distribution of your network, which we call S(k), as the first parameter to the function ks.test. The second parameter should be the function Pk, you just defined. The implementation of your KS test in R for the setup described will look like this: ks.test (S(k),CDF(k)) * Both CDF(k) and ks.test written above are essentially pseudocode. Change them as necessary in your R code.4. Repeat steps (13) by scanning the kwot range from kmin to the degree at 25th percentile. We aim to identify the k sat value for which D provided by the test is minimal and call it 'optimal ksot'. To illustrate the procedure, we plot D as a function of k satforthepaper citation network (Figure 4.24b ). The plot indicates that D is minimal for ksat=49, and the corresponding estimated by (4.41), representing the optimal fit, is =2.79. What you report for part 1 on your network: - Plotting D against ksat for (similar to Figure 4.24 b). - The values of ksot that minimizes D, the resulting D, and the corresponding . a. The degree distribution Pk of the citation network, where the straight purple line represents the best fit based Barabasi Figure 4.24 Maximum Likelihood Estimation. b. The values of Kormogorov-Smimov test vs. kut for the citation network. c. p ( D mantheric), for M=10,000 synthetic datasets, where the grey line corresponds to the Dm mal value extracted for the on the model (4.39). citation network. Just because we obtained a (,k sot ) pair that represents an optimal fit to our dataset, does not Part 2: Goodness of Fit mean that the power law itself is a good model for the studied distribution. Therefore, we need to use a goodness-of-fit test, which generates a p-value that quantifies the plausibility of the power law hypothesis. The most often-used procedure consists of the following steps: 1. Use the distance (KS statistic for the best ksat) you found in part 1 . Call it D real. For1. Use the distance ( KS statistic for the best ksot ) you found in part 1. Call it D Dreal. For instance, the selected ksot= 49 and the distance Dreal=0.01158 for the citation network. This is the distance Dreal is between the observed data and our fit based on the parameter pair (,ksot). 2. Use (4.42) to generate a degree sequence of N degrees (i.e. the same number of random numbers as the number of nodes in the original dataset). This is synthetic data we have generated as a hypothetical degree sequence. 3. Now, calculate the distance between the synthetic data and your actual data using the Kormogorov-Smirnov test. Call the new distance, D Dynthetic, Hence, D pnthetic represents the distance between a synthetically generated degree sequence, consistent with our degree distribution, and the real data. The goal is to see if the obtained D'mentetic is comparable to Dreal . For this, we repeat step (2) and (3)M times (say M=100 ). Each time, we generate a new degree sequence and determine the corresponding D. nhthect. Eventually, we obtain the p(Dannthetic))
  • 26. distribution (i.e., the histogram of all 100 Dsynthetic values you generated by repeating step 2. Plot p(D ( r interic) and show as a vertical bar Dreal (Figure 4.24c). If Dreal is within the p(Dsynthetce) distribution, it means that the distance between the model providing the best fit and the empirical data is comparable with the distance expected from random degree samples chosen from the best fit distribution. Hence the power law is a reasonable model for the data. If, however, Dreal falls outside the p(Dynthetic) distribution, then the power law is not a good model - some other function is expected to describe the original Pk better. While the distribution shown in Figure 4.24c may be, in some cases, useful to illustrate theWhile the distribution shown in Figure 4.24c may be, in some cases, useful to illustrate the statistical significance of the fit, in general it is better to assign a p- number to the fit, by: p=DP(Dsynthetic)dDsynthetic The closer p is to 1 , the more likely that the difference between the empirical data and the model can be attributed to statistical fluctuations alone. If p is very small, the model is not a plausible fit to the data (Typically, the model is accepted if p>1%.). You can skip this part as our M is too small to draw any meaningful statistical inference. Based on the histogram you drew, does the power law fit the data well? What you report for part 2 : - In which region we discussed in session 9 (slides 3740) does your of voptimal land? - Plot histogram of Dsynthetic and where Dreal lands (similar to Figure 4.24 c). - Is scale free is a good choice for the network? Based on paragraph above, why? - What could be the reason for your finding be? This concludes your lab 4. Congrats! You did some serious network analysis! For the citation network authors obtain p<104, indicating that a pure power law is not a suitable model for the original degree distribution. This outcome is somewhat surprising, as the power-law nature of citation data has been documented repeatedly since 1960s. This failure indicates the limitation of the blind fitting to a power law, without an analytical understanding of the underlying distribution. Barabasi discusses how to correct the problem: We note that the fitting model (4.44) eliminates all the data points with k<ksot Choosing ksat=49 forces us to discard over 96% of the data points. Yet, there is statistically useful information for the data that falls in k<ksat that is ignored by the previous fit. We must introduce an alternate model that resolves this problem. I included their solution as part of an optional reading to this exercise. I included pages since the beginning of this solution due to minor notation differences. You can find the discussion at the end of (PDF) page 4/7. 6