you need to complete the r code and a singlepage document c.pdf

you need to complete the r code and a single-page document containing two figures, report the
parameters you estimate and discuss how well your power law fits the network data, and explain
the finding.
Question: images
incomplete r code:
# IDS 564 - Spring 2023
# Lab 4 R Code - Estimating the Degree Exponent of a Scale-free Network
#=========================================================================
=====================
# 0. INITIATION
==========================================================================
=
#=========================================================================
=====================
## You'll need VGAM for the zeta function
# install.packages("VGAM") ## When prompted to install from binary version, select no
library(VGAM)
## You'll need this when calculating goodness of fit
# install.packages("parallel")
library(parallel)
library(ggplot2)
library(ggthemes)
library(dplyr)
library(tidyr)
##------------------------------------------------------------------------------
## This function will calculate the zeta function for you. You don't need to worry about it! Run it
and continue.
## gen_zeta(gamma , shift) will give you a number
gen_zeta <- function (gamma, shift = 1, deriv = 0)
{
deriv.arg <- deriv
rm(deriv)
if (!is.Numeric(deriv.arg, length.arg = 1, integer.valued = TRUE))

stop("'deriv' must be a single non-negative integer")
if (deriv.arg < 0 || deriv.arg > 2)
stop("'deriv' must be 0, 1, or 2")
if (deriv.arg > 0)
return(zeta.specials(Zeta.derivative(gamma, deriv.arg = deriv.arg,
shift = shift), gamma, deriv.arg, shift))
if (any(special <- Re(gamma) <= 1)) {
ans <- gamma
ans[special] <- Inf
special3 <- Re(gamma) < 1
ans[special3] <- NA
special4 <- (0 < Re(gamma)) & (Re(gamma) < 1) & (Im(gamma) == 0)
# ans[special4] <- Zeta.derivative(gamma[special4], deriv.arg = deriv.arg, shift = shift)
if (any(special2)) {
gamma2 <- gamma[special2]
cgamma <- 1 - gamma2
ans[special2] <- 2^(gamma2) * pi^(gamma2 - 1) * sin(pi *
gamma2/2) * gamma(cgamma) * Recall(cgamma)
}
if (any(!special)) {
ans[!special] <- Recall(gamma[!special])
}
return(zeta.specials(ans, gamma, deriv.arg, shift))
}
aa <- 12
ans <- 0
for (ii in 0:(aa - 1)) ans <- ans + 1/(shift + ii)^gamma
ans <- ans + Zeta.aux(shape = gamma, aa, shift = shift)
ans[shift <= 0] <- NaN
zeta.specials(ans, gamma, deriv.arg = deriv.arg, shift = shift)
}
## example:
gen_zeta(2.1, 4)
##------------------------------------------------------------------------------
## The P_k (the CDF)
P_k = function(gamma, k, k_sat){
### fill the function
return(1 - ( gen_zeta(gamma, k) / ... ))
}

##------------------------------------------------------------------------------
my_theme <- theme_classic() +
theme(legend.position = "bottom", legend.box = "horizontal", legend.direction = "horizontal",
title = element_text(size = 18), axis.title = element_text(size = 14),
axis.text.y = element_text(size = 16), axis.text.x = element_text(size = 16),
strip.text = element_text(size = 14), strip.background = element_blank(),
strip.text.x = element_text(size = 14), strip.text.y = element_text(size = 14),
legend.title = element_text(size = 14), legend.text = element_text(size = 14))
set.seed(123)
#=========================================================================
=====================
# 00. LOADING DATA
========================================================================
#=========================================================================
=====================
## Load data - fill the path of the folder where you put the file. If the file is in the same folder as the
R code remove the path part.
your_path = "your path"
pat_citation_deg = read.csv(paste0(your_path, "lab 4 data - sample_patant_citation_deg.csv"))
head(pat_citation_deg)
tail(pat_citation_deg)
summary(pat_citation_deg)
## let's have a look at the Log-log degree distribution plot (nothing to fill)
p <- ggplot(pat_citation_deg, aes(x = degree)) +
geom_point(stat = 'bin', color = "blue", size = 2.5, bins = 3 * ceiling(log(nrow(pat_citation_deg))))+
scale_x_continuous(trans = "log", breaks = scales::trans_breaks("log", function(x) exp(x)),
labels = scales::trans_format("log", scales::math_format(e^.x))) +
scale_y_continuous(trans = "log", breaks = scales::trans_breaks("log", function(x) exp(x)),
labs(x = "Degree", y = "Frequency") +
my_theme + theme(title = element_text(size = 12 ))
## fit a line to the Log-log degree distribution (nothing to fill)
p + geom_smooth(data = ggplot_build(p)$data[[1]] %>% filter(!is.infinite(y)), ## this will take the
binned data generated by ggplot to fit the line
mapping = aes(x=exp(x), y= exp(y)), method = "lm", se=FALSE, color = "red", size = 0.75, alpha =
0.5)

#=========================================================================
=====================
# 1. EXERCISE PART 1 - Estimating Gamma
===================================================
#=========================================================================
=====================
## designate the data.frame to be used - and use standardized column names: id, degree (nothing
to fill)
my_df = pat_citation_deg %>% rename(id = patent_id)
##------------------------------------------------------------------------------
## you'll write a for loop over individual unique degrees in the data-set to find the corresponding
distance D
### let's create a data.frame with one column as each observed degree in our network;
### We will fill the other columns D and gamma as we calculate them in the loop (nothing to fill)
D_df = data.frame(k_sat = unique(my_df$degree), D= NA, gamma = NA) %>% arrange(k_sat)
## here you set up the maximum degree to check so that you do not have to do the computation
for all degrees
### Let's set it up as 25th percentile of the unique observed degrees. Next line of code does that
for you: (nothing to fill)
max_degree_to_check = as.numeric(ceiling(quantile(unique(my_df$degree), 0.25)))
### Now discard the rows of D_df you do not need (that are above the max_degree_to_check).
Next line of code does it for you: (nothing to fill)
D_df = D_df[D_df$k_sat < max_degree_to_check,]
### Let's take a look at the distance data.frame we are about to fill in the loop: (nothing to fill)
head(D_df)
tail(D_df)
## Understand and fill parts of the code in this loop
## I recommend setting i = 1 and running each line of this loop on your own and checking what it
gives you. This will help you fill the gaps
##------------------------------------------------------------------------------
## let's work on the loop
for (i in 1:(nrow(D_df))) { ## note: the loop starts slower but will speed up as it progresses.
## let's show the current loop k_sat (so that we can see our progress): (nothing to fill)
print(paste0("at %", round(100 * i/nrow(D_df), 2)))
k_sat_temp = D_df$k_sat[i]

##----------------------------------------------------------------------------
## let's create a temporary copy of the network degree data that contains degrees equal or above
k_sat_temp: (nothing to fill)
temp_df = my_df[my_df$degree>k_sat_temp,]
##----------------------------------------------------------------------------
## step 1: estimate gamma for this loop and call it 'temp_gamma'
### create a vector of k_i/ (k_sat) so that you can feed it to natural logarithm and sum over
elements
### k_i is each observed node degree in your data; k_sat refers to the k_sat of this loop (nothing
to fill)
temp_vec_k_i = temp_df$degree/(k_sat_temp)
### now use the above vector in (4.41); remember N is the number of nodes in your network. N =
nrow(my_df) (nothing to fill)
temp_gamma = 1 + (nrow(temp_df) / sum(log(temp_vec_k_i)) - 1/2) ## (4.41)
##----------------------------------------------------------------------------
## step 2: now use (temp_gamma, k_sat_temp) to write (4.43) as a function called 'CDF_k' to
pass the KS test in step 3:
### k will be a variable that KS test will use, so make it an argument of CDF_k;
### put gamma and k_sat of this loop in the body of the function
CDF_k = function(k) {
### FILL THIS FUNCTION ACCORDING TO (4.43)
return(1 - (gen_zeta(temp_gamma, k) / ...))
}
##----------------------------------------------------------------------------
## step 3: run KS test and pass the statistic as D to the corresponding column of D_df
KS_D = ks.test(temp_df$degree, ...) ### fill in with function you created above: just pass the
function name (without parantheses, or brackets, or quotes)
### * you can take a look here if you couldn't figure it out:
https://stats.stackexchange.com/questions/47730/r-defining-a-new-continuous-distribution-
function-to-use-with-kolmogorov-smirno
D_df[i,'D'] = as.numeric(KS_D$statistic) ### (nothing to fill)
## let's also store the gamma so that we do not have to compute it again once we have an optimal
k_sat (nothing to fill)
D_df[i,'gamma'] = temp_gamma

}
##------------------------------------------------------------------------------
## step 4: plot D against k_sat find the k_sat that minimizes D, and the corresponding gamma
### let's first take a look at the D_df we have formed (nothing to fill)
head(D_df, 10)
### find the optimal k_sat that yields minimum D and call it 'optimal_k_sat' (nothing to fill)
optimal_k_sat = D_df[which.min(D_df$D),'k_sat']
### let's take a look at the D_df we have formed (nothing to fill)
ggplot(D_df %>% drop_na(), aes(x = k_sat, y = D)) +
geom_point(size = 3, alpha = .5, color = "purple") +
geom_vline(xintercept = optimal_k_sat, size = 1, color = "red") +
ggplot2::annotate("text", x = optimal_k_sat + 15, y = as.numeric(quantile(D_df$D,.85)), label =
paste0("Optimal k_sat = ",optimal_k_sat), color = "red") +
my_theme + labs(x = "k", y = "D")
### find the D corresponding to 'optimal_k_sat' (nothing to fill)
min(D_df$D)
### find the gamma corresponding to the optimal k_sat and call it 'optimal_gamma' (nothing to fill)
(optimal_gamma = D_df[which.min(D_df$D),'gamma'])
## Discard observations with degree below the best k_sat you found earlier. (nothing to fill)
post_data = my_df %>% filter(degree >= optimal_k_sat)
##------------------------------------------------------------------------------
## let's take a look at the resulting Log-log degree distribution plot for the remaining data-points
(nothing to fill)
p_post <- ggplot(post_data, aes(x = degree)) +
geom_point(stat = 'bin', color = "blue", size = 2.5, bins = 3 * ceiling(log(nrow(post_data))))+
my_theme
p_post + geom_smooth(data = ggplot_build(p_post)$data[[1]] %>% filter(!is.infinite(y)), ## this will

take the binned data generated by ggplot to fit the line
0.5)
#=========================================================================
=====================
# 2. EXERCISE PART 2 - Goodness-of-fit
====================================================
#=========================================================================
=====================
## We are going to create a vector of synthetic sequences of degrees and repeat the process M
times
## Usually, M is pretty big, like M = 10,000. For now, let's set M = 100: (nothing to fill)
M = 100
## so let's create a data.frame of M rows, one for every D_synthetic we will generate (nothing to
fill)
D_gof_df = data.frame(iter = 1:M, D_synthetic = NA)
##------------------------------------------------------------------------------
## step 1: store the distance you found in part 1 as D_real (nothing to fill)
D_real = min(D_df$D)
##------------------------------------------------------------------------------
## I. Let's walk through steps 2 and 3 once outside of the loop
##------------------------------------------------------------------------------
##------------------------------------------------------------------------------
## step 2: you will need to define the inverse of the CDF function (so that you generate random
probability values [0,1] and get degrees back)
### let's write the CDF that best fits the data (we did this in part 1):
### FILL THE FUNCTION ACCORDING TO (4.43)
return(1 - (gen_zeta(optimal_gamma, k) / ...))
}
### 2.1. Let's define the inverse of your CDF; (nothing to fill)
### if the next line is hard to understand, check here:
https://stackoverflow.com/questions/23258482/use-inverse-cdf-to-generate-random-variable-in-r

#### mini step 2.1.1. this piece of code will create an inverse for you (that searches the interval 0
up to a big higher than the highest observed degree in our data)
Inverse = function(f, interval = c(sqrt(min(post_data$degree)), max(post_data$degree))){
function(y) uniroot((function(x) f(x) - y), interval = interval, extendInt = "yes")$root
}
inverse_CDF = Inverse(CDF_k)
### mini step 2.2.1 let's try it for a couple of numbers (nothing to fill)
inverse_CDF(0.4)
## mini step 2.2.2. runif(1) will generate a random real number between 0 and 1. Let's pass that to
inverse_CDF
rand_p = runif(1)
inverse_CDF(rand_p)
### mini step 2.2. let's generate 5 random numbers betwee 0 and 1 and get 5 degrees back from
our inverse
#### (unfortunately we have to write this complex code because inverse_CDF does not accept a
vector; try inverse_CDF(c(0.1, 0.2)). )
rand_p = runif(5)
unlist(lapply(rand_p, function(p){inverse_CDF(p)}))
### step 2.2.ok, you know how to generate 5 random degrees. Let's create n random degrees,
where n is the number of degrees in our remaining data (nothing to fill)
rand_p = runif(nrow(post_data))
rand_deg = unlist(lapply(rand_p, function(p){inverse_CDF(p)})) ## unfortunately this is a bit slow!!!
rand_deg = unlist(mclapply(rand_p, function(p){inverse_CDF(p)}, mc.cores =
parallel::detectCores() - 1)) ## this will make it a bit faster, but still not great!
##------------------------------------------------------------------------------
## step 3: use ks.test to get a D_synthetic for the rand_deg you just generated and
### FILL THE ks.test ACCORDING TO THE DISCUSSION ON STEP 3 PART 2
KS_D = ks.test(rand_deg, ...) ## ks.test(first are synthetic degrees, second are real degrees)
as.numeric(KS_D$statistic)
##------------------------------------------------------------------------------
## II. Now let's write the loop
##------------------------------------------------------------------------------
for (i in 1:M){ ## this may take a while!!! It took me 15 minutes to run M = 100 on a computer with
3.5 GHz 6-Core and 64 GB memory. Lower to M = 20 if necessary

print(paste0("at %",100 * i/M))
##------------------------------------------------------------------------------
## step 2: generate a synthetic (random) sequence of degrees
rand_p = ... ### FILL AS WE DID ABOVE
parallel::detectCores() - 1)) ## this will make it a bit faster
##------------------------------------------------------------------------------
## step 3: find the distance between the synthetic sequence and CDF_k and store it
# KS_D = ks.test(rand_deg, CDF_k)
KS_D = ks.test(..., ...) ### FILL AS WE DID ABOVE ks.test(first are synthetic degrees, second
are real degrees)
D_gof_df[i,'D_synthetic'] = as.numeric(KS_D$statistic)
}
##------------------------------------------------------------------------------
## Let's plot the results
### let's take a look at the D_df we have formed
ggplot(D_gof_df, aes(x = D_synthetic)) +
geom_histogram(bins = 20, color = "white", fill = "green", alpha = 0.9) +
geom_vline(xintercept = D_real, size = 1, color = "brown") +
my_theme + labs(x = "Distance", y = "Frequency", title = "Synthetic and Real Distances")
# IDS 564 - Spring 2023
# Lab 4 R Code - Estimating the Degree Exponent of a Scale-free Network
#=========================================================================
=====================
# 0. INITIATION
==========================================================================
=
#=========================================================================
=====================
## You'll need VGAM for the zeta function
# install.packages("VGAM") ## When prompted to install from binary version, select no

library(VGAM)
## You'll need this when calculating goodness of fit
# install.packages("parallel")
library(parallel)
library(ggplot2)
library(ggthemes)
library(dplyr)
library(tidyr)
##------------------------------------------------------------------------------
## This function will calculate the zeta function for you. You don't need to worry about it! Run it
and continue.
## gen_zeta(gamma , shift) will give you a number
gen_zeta <- function (gamma, shift = 1, deriv = 0)
{
deriv.arg <- deriv
rm(deriv)
if (!is.Numeric(deriv.arg, length.arg = 1, integer.valued = TRUE))
stop("'deriv' must be a single non-negative integer")
if (deriv.arg < 0 || deriv.arg > 2)
stop("'deriv' must be 0, 1, or 2")
if (deriv.arg > 0)
return(zeta.specials(Zeta.derivative(gamma, deriv.arg = deriv.arg,

shift = shift), gamma, deriv.arg, shift))
if (any(special <- Re(gamma) <= 1)) {
ans <- gamma
ans[special] <- Inf
ans[special3] <- NA
special4 <- (0 < Re(gamma)) & (Re(gamma) < 1) & (Im(gamma) == 0)
# ans[special4] <- Zeta.derivative(gamma[special4], deriv.arg = deriv.arg, shift = shift)
if (any(special2)) {
gamma2 <- gamma[special2]
cgamma <- 1 - gamma2
ans[special2] <- 2^(gamma2) * pi^(gamma2 - 1) * sin(pi *
gamma2/2) * gamma(cgamma) * Recall(cgamma)
}
if (any(!special)) {
ans[!special] <- Recall(gamma[!special])
}
return(zeta.specials(ans, gamma, deriv.arg, shift))
}
aa <- 12
ans <- 0

for (ii in 0:(aa - 1)) ans <- ans + 1/(shift + ii)^gamma
ans <- ans + Zeta.aux(shape = gamma, aa, shift = shift)
ans[shift <= 0] <- NaN
zeta.specials(ans, gamma, deriv.arg = deriv.arg, shift = shift)
}
## example:
gen_zeta(2.1, 4)
##------------------------------------------------------------------------------
## The P_k (the CDF)
P_k = function(gamma, k, k_sat){
### fill the function
return(1 - ( gen_zeta(gamma, k) / ... ))
}
##------------------------------------------------------------------------------
my_theme <- theme_classic() +
theme(legend.position = "bottom", legend.box = "horizontal", legend.direction = "horizontal",
title = element_text(size = 18), axis.title = element_text(size = 14),
axis.text.y = element_text(size = 16), axis.text.x = element_text(size = 16),
strip.text = element_text(size = 14), strip.background = element_blank(),
strip.text.x = element_text(size = 14), strip.text.y = element_text(size = 14),
legend.title = element_text(size = 14), legend.text = element_text(size = 14))

set.seed(123)
#=========================================================================
=====================
# 00. LOADING DATA
========================================================================
#=========================================================================
=====================
## Load data - fill the path of the folder where you put the file. If the file is in the same folder as the
R code remove the path part.
your_path = "your path"
pat_citation_deg = read.csv(paste0(your_path, "lab 4 data - sample_patant_citation_deg.csv"))
head(pat_citation_deg)
tail(pat_citation_deg)
summary(pat_citation_deg)
## let's have a look at the Log-log degree distribution plot (nothing to fill)
p <- ggplot(pat_citation_deg, aes(x = degree)) +
geom_point(stat = 'bin', color = "blue", size = 2.5, bins = 3 * ceiling(log(nrow(pat_citation_deg))))+

my_theme + theme(title = element_text(size = 12 ))
p + geom_smooth(data = ggplot_build(p)$data[[1]] %>% filter(!is.infinite(y)), ## this will take the
binned data generated by ggplot to fit the line
0.5)
#=========================================================================
=====================
# 1. EXERCISE PART 1 - Estimating Gamma
===================================================
#=========================================================================
=====================
## designate the data.frame to be used - and use standardized column names: id, degree (nothing
to fill)
my_df = pat_citation_deg %>% rename(id = patent_id)
##------------------------------------------------------------------------------
## you'll write a for loop over individual unique degrees in the data-set to find the corresponding
distance D
### let's create a data.frame with one column as each observed degree in our network;
### We will fill the other columns D and gamma as we calculate them in the loop (nothing to fill)
D_df = data.frame(k_sat = unique(my_df$degree), D= NA, gamma = NA) %>% arrange(k_sat)

## here you set up the maximum degree to check so that you do not have to do the computation
for all degrees
### Let's set it up as 25th percentile of the unique observed degrees. Next line of code does that
for you: (nothing to fill)
max_degree_to_check = as.numeric(ceiling(quantile(unique(my_df$degree), 0.25)))
### Now discard the rows of D_df you do not need (that are above the max_degree_to_check).
Next line of code does it for you: (nothing to fill)
D_df = D_df[D_df$k_sat < max_degree_to_check,]
### Let's take a look at the distance data.frame we are about to fill in the loop: (nothing to fill)
head(D_df)
tail(D_df)
## Understand and fill parts of the code in this loop
## I recommend setting i = 1 and running each line of this loop on your own and checking what it
gives you. This will help you fill the gaps
##------------------------------------------------------------------------------
## let's work on the loop
for (i in 1:(nrow(D_df))) { ## note: the loop starts slower but will speed up as it progresses.
## let's show the current loop k_sat (so that we can see our progress): (nothing to fill)
print(paste0("at %", round(100 * i/nrow(D_df), 2)))
k_sat_temp = D_df$k_sat[i]
##----------------------------------------------------------------------------
## let's create a temporary copy of the network degree data that contains degrees equal or above

k_sat_temp: (nothing to fill)
temp_df = my_df[my_df$degree>k_sat_temp,]
##----------------------------------------------------------------------------
## step 1: estimate gamma for this loop and call it 'temp_gamma'
### create a vector of k_i/ (k_sat) so that you can feed it to natural logarithm and sum over
elements
### k_i is each observed node degree in your data; k_sat refers to the k_sat of this loop (nothing
to fill)
temp_vec_k_i = temp_df$degree/(k_sat_temp)
### now use the above vector in (4.41); remember N is the number of nodes in your network. N =
nrow(my_df) (nothing to fill)
temp_gamma = 1 + (nrow(temp_df) / sum(log(temp_vec_k_i)) - 1/2) ## (4.41)
##----------------------------------------------------------------------------
## step 2: now use (temp_gamma, k_sat_temp) to write (4.43) as a function called 'CDF_k' to
pass the KS test in step 3:
### k will be a variable that KS test will use, so make it an argument of CDF_k;
### put gamma and k_sat of this loop in the body of the function
### FILL THIS FUNCTION ACCORDING TO (4.43)
return(1 - (gen_zeta(temp_gamma, k) / ...))
}
##----------------------------------------------------------------------------
## step 3: run KS test and pass the statistic as D to the corresponding column of D_df

KS_D = ks.test(temp_df$degree, ...) ### fill in with function you created above: just pass the
function name (without parantheses, or brackets, or quotes)
### * you can take a look here if you couldn't figure it out:
https://stats.stackexchange.com/questions/47730/r-defining-a-new-continuous-distribution-
function-to-use-with-kolmogorov-smirno
D_df[i,'D'] = as.numeric(KS_D$statistic) ### (nothing to fill)
## let's also store the gamma so that we do not have to compute it again once we have an optimal
k_sat (nothing to fill)
D_df[i,'gamma'] = temp_gamma
}
##------------------------------------------------------------------------------
## step 4: plot D against k_sat find the k_sat that minimizes D, and the corresponding gamma
### let's first take a look at the D_df we have formed (nothing to fill)
head(D_df, 10)
### find the optimal k_sat that yields minimum D and call it 'optimal_k_sat' (nothing to fill)
optimal_k_sat = D_df[which.min(D_df$D),'k_sat']
### let's take a look at the D_df we have formed (nothing to fill)
ggplot(D_df %>% drop_na(), aes(x = k_sat, y = D)) +
geom_point(size = 3, alpha = .5, color = "purple") +
geom_vline(xintercept = optimal_k_sat, size = 1, color = "red") +
ggplot2::annotate("text", x = optimal_k_sat + 15, y = as.numeric(quantile(D_df$D,.85)), label =
paste0("Optimal k_sat = ",optimal_k_sat), color = "red") +

my_theme + labs(x = "k", y = "D")
### find the D corresponding to 'optimal_k_sat' (nothing to fill)
min(D_df$D)
### find the gamma corresponding to the optimal k_sat and call it 'optimal_gamma' (nothing to fill)
(optimal_gamma = D_df[which.min(D_df$D),'gamma'])
## Discard observations with degree below the best k_sat you found earlier. (nothing to fill)
post_data = my_df %>% filter(degree >= optimal_k_sat)
##------------------------------------------------------------------------------
## let's take a look at the resulting Log-log degree distribution plot for the remaining data-points
(nothing to fill)
p_post <- ggplot(post_data, aes(x = degree)) +
geom_point(stat = 'bin', color = "blue", size = 2.5, bins = 3 * ceiling(log(nrow(post_data))))+
my_theme

p_post + geom_smooth(data = ggplot_build(p_post)$data[[1]] %>% filter(!is.infinite(y)), ## this will
take the binned data generated by ggplot to fit the line
0.5)
#=========================================================================
=====================
# 2. EXERCISE PART 2 - Goodness-of-fit
====================================================
#=========================================================================
=====================
## We are going to create a vector of synthetic sequences of degrees and repeat the process M
times
## Usually, M is pretty big, like M = 10,000. For now, let's set M = 100: (nothing to fill)
M = 100
## so let's create a data.frame of M rows, one for every D_synthetic we will generate (nothing to
fill)
D_gof_df = data.frame(iter = 1:M, D_synthetic = NA)
##------------------------------------------------------------------------------
## step 1: store the distance you found in part 1 as D_real (nothing to fill)
D_real = min(D_df$D)
##------------------------------------------------------------------------------

## I. Let's walk through steps 2 and 3 once outside of the loop
##------------------------------------------------------------------------------
##------------------------------------------------------------------------------
## step 2: you will need to define the inverse of the CDF function (so that you generate random
probability values [0,1] and get degrees back)
### let's write the CDF that best fits the data (we did this in part 1):
### FILL THE FUNCTION ACCORDING TO (4.43)
return(1 - (gen_zeta(optimal_gamma, k) / ...))
}
### 2.1. Let's define the inverse of your CDF; (nothing to fill)
### if the next line is hard to understand, check here:
https://stackoverflow.com/questions/23258482/use-inverse-cdf-to-generate-random-variable-in-r
#### mini step 2.1.1. this piece of code will create an inverse for you (that searches the interval 0
up to a big higher than the highest observed degree in our data)
Inverse = function(f, interval = c(sqrt(min(post_data$degree)), max(post_data$degree))){
function(y) uniroot((function(x) f(x) - y), interval = interval, extendInt = "yes")$root
}
inverse_CDF = Inverse(CDF_k)
### mini step 2.2.1 let's try it for a couple of numbers (nothing to fill)

inverse_CDF(0.4)
## mini step 2.2.2. runif(1) will generate a random real number between 0 and 1. Let's pass that to
inverse_CDF
rand_p = runif(1)
inverse_CDF(rand_p)
### mini step 2.2. let's generate 5 random numbers betwee 0 and 1 and get 5 degrees back from
our inverse
#### (unfortunately we have to write this complex code because inverse_CDF does not accept a
vector; try inverse_CDF(c(0.1, 0.2)). )
rand_p = runif(5)
unlist(lapply(rand_p, function(p){inverse_CDF(p)}))
### step 2.2.ok, you know how to generate 5 random degrees. Let's create n random degrees,
where n is the number of degrees in our remaining data (nothing to fill)
rand_p = runif(nrow(post_data))
rand_deg = unlist(lapply(rand_p, function(p){inverse_CDF(p)})) ## unfortunately this is a bit slow!!!
parallel::detectCores() - 1)) ## this will make it a bit faster, but still not great!
##------------------------------------------------------------------------------
## step 3: use ks.test to get a D_synthetic for the rand_deg you just generated and
### FILL THE ks.test ACCORDING TO THE DISCUSSION ON STEP 3 PART 2
KS_D = ks.test(rand_deg, ...) ## ks.test(first are synthetic degrees, second are real degrees)
as.numeric(KS_D$statistic)

##------------------------------------------------------------------------------
## II. Now let's write the loop
##------------------------------------------------------------------------------
for (i in 1:M){ ## this may take a while!!! It took me 15 minutes to run M = 100 on a computer with
3.5 GHz 6-Core and 64 GB memory. Lower to M = 20 if necessary
print(paste0("at %",100 * i/M))
##------------------------------------------------------------------------------
## step 2: generate a synthetic (random) sequence of degrees
rand_p = ... ### FILL AS WE DID ABOVE
parallel::detectCores() - 1)) ## this will make it a bit faster
##------------------------------------------------------------------------------
## step 3: find the distance between the synthetic sequence and CDF_k and store it
# KS_D = ks.test(rand_deg, CDF_k)
KS_D = ks.test(..., ...) ### FILL AS WE DID ABOVE ks.test(first are synthetic degrees, second are
real degrees)
D_gof_df[i,'D_synthetic'] = as.numeric(KS_D$statistic)
}
##------------------------------------------------------------------------------
## Let's plot the results
### let's take a look at the D_df we have formed

ggplot(D_gof_df, aes(x = D_synthetic)) +
geom_histogram(bins = 20, color = "white", fill = "green", alpha = 0.9) +
geom_vline(xintercept = D_real, size = 1, color = "brown") +
my_theme + labs(x = "Distance", y = "Frequency", title = "Synthetic and Real Distances")
Please read this instruction carefully before starting this exercise. To complete this lab, you have a
short, required reading included in this document that explains the challenges in finding the degree
exponent () of a power law for an observed network. The fitting procedure corrects such problems;
reading it will require you to follow and understand the exercise. You will fit a power law into a
network provided for this lab. The estimation process (starting on page 3 ) contains two parts. The
first part estimates the degree exponent (), and the second part offers the goodness of fit using a
simulation process. An attached R code handles the computational heavy lifting. However, filling in
and comapleting the code requires your understanding of the process. I annotated the code in
detail. The idea is to facilitate learning rather than implement the necessary steps. You still need to
run every line of code. However, the lines you do not need to change show (nothing to fill) in their
annotation. The data you will handle is a random sample of 80,000 US patents taken from a larger
corpus of about 8 million patents. You see a patent id number (first column) and a degree
representing the number of times the patent was cited by other patents (second column). So, the
data is the degree sequence of a network, summarized to help you handle the lab. The R code
provides you with a log-log plot to get a sense of the data. Here are the first ten* Patents without
any citations do not appear in the data. Unlike your previous labs, lab 4 does not require you to
complete a quiz. All you need to submit is your R code, and a single-page document containing
two figures (similar to Figure 4.24 parts b and c ), report two values ( and k sat ) and discuss how
well your power law fits the network data, and an explanation of your finding. Note that the second
part of the exercise will need a bit of time to run. Accommodate it by starting your lab 4 early.
Required Reading: Degree Distribution of Real Networks In real systems, we rarely observe a
degree distribution that follows a pure power law. Instead, for most real systems pk has the shape
shown in Figure 4.23a, with two recurring - Low-degree saturation is a common deviation from the
power-law behavior. Its problems: signature is a flattened Px for k<k sot. This indicates that we
have fewer small degree nodes than expected for a pure power law. The origin of the saturation
will be - High-degree cutoff appears as a rapid drop in pk for k>kcut,, indicating that we have
explained in Chapter 6. fewer high-degree nodes than expected in a pure power law. This limits
the size of the largest hub, making it smaller than predicted by what we derived in the class (kmax
=kminNr11). High-degree cutoffs emerge if there are inherent limitations in the number of links a
node can have. For example, in social networks individuals have difficulty maintaining meaningful
relationships with an exceptionally large number of acquaintances, or like the case of Facebook's
5klimitonfriends.Barabasi Figure 4.23. Rescaling the Degree Distribution a. In real networks the
degree distribution frequently deviates from a pure power law by showing a low degree b. By

plotting the rescaled pk in function of (k+ksat), as suggested by (4.40), the degree distribution
follows a saturation and high degree cutoff. power law for all degrees. Given the widespread
presence of such cutoffs the degree distribution is often fitted to: (4.39) Px=a(k+ksat)exp(kcutk)
where ksat accounts for degree saturation, and the exponential term accounts for high-k cutoff.
How do we deal with these two problems? To extract the full extent of the scaling we correct the
plot by multiplying a term as follows: Px=Pxexp(kcutk) If you look at the resulting function, Px=a(k+
ksat)r follows the power law form we would like to see as a function of k=k+ksot:PRkr, correcting
for the two cutoffs, as seen in Figure 4.23b. This is the idea. We will go through this correction in
the exercise. It is occasionally claimed that the presence of low-degree or high-degree cutoffs
implies that the network is not scale-free. This is a misunderstanding of the scale-free property:
Virtually all properties of scale-free networks are insensitive to the low-degree saturation. Only the
high-degree cutoff affects the system's properties by limiting therdivergence of the second
moment (that determines the variance). The presence of such cutoffs indicates the presence of
additional phenomena that need to be understood - that will not be discussed here. Exercise:
Estimating the Degree Exponent As the properties of scale-free networks depend on the degree
exponent, we need to determine the value of . We face several difficulties, however, when we try
to fit a power law into real data. The most important is the fact that the scaling is rarely valid for the
full range of the degree distribution. Rather we observe small- and high-degree cutoffs described
above, as ksot and kcut, within which we have a clear scaling region. Here we focus on estimating
the small degree cutoff ksat as the high degree cutoff can be determined in a similar fashion. The
reader is advised to consult the discussion on systematic problems provided at the end of this
section - as optional reading. Part 1: Fitting Procedure: As the degree distribution is typically
provided as a list of positive integers kmin,,kmax, we aim to estimate from a discrete set of data
points (only those observed in the data). Here you will see figures for an article citation network to
illustrate the procedure. You will implement the procedure on the patent data explained above.
The article citation network illustrated here consists of N=384,362 nodes, each node representing
a research paper published between 1890 and 2009 in journals pubblished by the American
Physical Society. The network has L=2,353,984 links, each representing a citation from a
published research paper to some other publication in the dataset (outside citations are ignored).
The steps of the fitting process are:1. Choose a value of ksot between kmin and kmax. Estimate
the value of the degree exponent corresponding to this k sot using: * Notice two things. First, in
practice, you do not need to check all degrees between kmin and kmax Indeed, you can check
degrees up to a certain quantile. In the code, you loop over kmin up to the degree at the 25th
percentile - this is already set up for you. 3 Second, in (4.41), the summation starts from i=1. This
applies to the data that has a degree at least equal to ksot for that loop. For instance, if you are
checking ksat=4, all data corresponding to values with a degree below 4 will be discarded and do
not enter the calculation of in (4.41) - this is also set up for you in R. 2. With the obtained (,ksat)
parameter pair, assume that the degree distribution has the form: Pk=k=0(k+ksat)1k First, note
that the kk in the denominator is called a zeta function. When running the exercise, a function I
have provided will calculate zeta for you, so you do not have to worry about it. Just learn the
notation so that you can pass the arguments: zeta(,a)=x=0(x+a) is the main parameter, and a is
the shift. You will pass these two parameters in the is the main parameter, and a is the shift. You

will pass these two parameters in the same order to the function provided in R. Second, notice that
Pk in (4.42) is like a power law we have seen in the class, if you take the C=zeta(,ksat)1. This is,
in fact, a constant (given parameters and ksat ) and is the discrete form of C (in slide 15 of session
9 we derived C for a continuous closed form; the discrete form is easier to work with here since
you have discrete data from your network). You do not need to implement Pk. Next is what you
need to write in R. With the Pk(4.42) as the probability density function, the cumulative distribution
function (CDF) is: Pk=1zeta(r,ksa)zeta(,k) You will write this in R using the zeta function). Your Pk
implementation in R should be something like this that takes k as input and has and kiat inside: C
DF(k)=1zeta(Y,k)/zeta(Y,k_sat) 3. Use the Kormogorov-Smirnov (KS) test to determine the
distance D between your network data, let's call it S(k), and the fitted model provided by (4.43)
with the selected (,ksat) parameter pairs. In R, ks.test(..)$statistic will give you the D. To
implement this in R, you can pass the degree distribution of your network, which we call S(k), as
the first parameter to the function ks.test. The second parameter should be the function Pk, you
just defined. The implementation of your KS test in R for the setup described will look like this:
ks.test (S(k),CDF(k)) * Both CDF(k) and ks.test written above are essentially pseudocode. Change
them as necessary in your R code.4. Repeat steps (13) by scanning the kwot range from kmin to
the degree at 25th percentile. We aim to identify the k sat value for which D provided by the test is
minimal and call it 'optimal ksot'. To illustrate the procedure, we plot D as a function of k
satforthepaper citation network (Figure 4.24b ). The plot indicates that D is minimal for ksat=49,
and the corresponding estimated by (4.41), representing the optimal fit, is =2.79. What you report
for part 1 on your network: - Plotting D against ksat for (similar to Figure 4.24 b). - The values of
ksot that minimizes D, the resulting D, and the corresponding . a. The degree distribution Pk of the
citation network, where the straight purple line represents the best fit based Barabasi Figure 4.24
Maximum Likelihood Estimation. b. The values of Kormogorov-Smimov test vs. kut for the citation
network. c. p ( D mantheric), for M=10,000 synthetic datasets, where the grey line corresponds to
the Dm mal value extracted for the on the model (4.39). citation network. Just because we
obtained a (,k sot ) pair that represents an optimal fit to our dataset, does not Part 2: Goodness of
Fit mean that the power law itself is a good model for the studied distribution. Therefore, we need
to use a goodness-of-fit test, which generates a p-value that quantifies the plausibility of the power
law hypothesis. The most often-used procedure consists of the following steps: 1. Use the
distance (KS statistic for the best ksat) you found in part 1 . Call it D real. For1. Use the distance (
KS statistic for the best ksot ) you found in part 1. Call it D Dreal. For instance, the selected ksot=
49 and the distance Dreal=0.01158 for the citation network. This is the distance Dreal is between
the observed data and our fit based on the parameter pair (,ksot). 2. Use (4.42) to generate a
degree sequence of N degrees (i.e. the same number of random numbers as the number of nodes
in the original dataset). This is synthetic data we have generated as a hypothetical degree
sequence. 3. Now, calculate the distance between the synthetic data and your actual data using
the Kormogorov-Smirnov test. Call the new distance, D Dynthetic, Hence, D pnthetic represents
the distance between a synthetically generated degree sequence, consistent with our degree
distribution, and the real data. The goal is to see if the obtained D'mentetic is comparable to Dreal
. For this, we repeat step (2) and (3)M times (say M=100 ). Each time, we generate a new degree
sequence and determine the corresponding D. nhthect. Eventually, we obtain the p(Dannthetic))

distribution (i.e., the histogram of all 100 Dsynthetic values you generated by repeating step 2. Plot
p(D ( r interic) and show as a vertical bar Dreal (Figure 4.24c). If Dreal is within the p(Dsynthetce)
distribution, it means that the distance between the model providing the best fit and the empirical
data is comparable with the distance expected from random degree samples chosen from the best
fit distribution. Hence the power law is a reasonable model for the data. If, however, Dreal falls
outside the p(Dynthetic) distribution, then the power law is not a good model - some other function
is expected to describe the original Pk better. While the distribution shown in Figure 4.24c may be,
in some cases, useful to illustrate theWhile the distribution shown in Figure 4.24c may be, in some
cases, useful to illustrate the statistical significance of the fit, in general it is better to assign a p-
number to the fit, by: p=DP(Dsynthetic)dDsynthetic The closer p is to 1 , the more likely that the
difference between the empirical data and the model can be attributed to statistical fluctuations
alone. If p is very small, the model is not a plausible fit to the data (Typically, the model is
accepted if p>1%.). You can skip this part as our M is too small to draw any meaningful statistical
inference. Based on the histogram you drew, does the power law fit the data well? What you
report for part 2 : - In which region we discussed in session 9 (slides 3740) does your of voptimal
land? - Plot histogram of Dsynthetic and where Dreal lands (similar to Figure 4.24 c). - Is scale
free is a good choice for the network? Based on paragraph above, why? - What could be the
reason for your finding be? This concludes your lab 4. Congrats! You did some serious network
analysis! For the citation network authors obtain p<104, indicating that a pure power law is not a
suitable model for the original degree distribution. This outcome is somewhat surprising, as the
power-law nature of citation data has been documented repeatedly since 1960s. This failure
indicates the limitation of the blind fitting to a power law, without an analytical understanding of the
underlying distribution. Barabasi discusses how to correct the problem: We note that the fitting
model (4.44) eliminates all the data points with k<ksot Choosing ksat=49 forces us to discard over
96% of the data points. Yet, there is statistically useful information for the data that falls in k<ksat
that is ignored by the previous fit. We must introduce an alternate model that resolves this
problem. I included their solution as part of an optional reading to this exercise. I included pages
since the beginning of this solution due to minor notation differences. You can find the discussion
at the end of (PDF) page 4/7. 6

you need to complete the r code and a singlepage document c.pdf

Recomendados

Recomendados

Más contenido relacionado

Similar a you need to complete the r code and a singlepage document c.pdf

Similar a you need to complete the r code and a singlepage document c.pdf (20)

Más de adnankhan605720

Más de adnankhan605720 (13)

Último

Último (20)

you need to complete the r code and a singlepage document c.pdf