Denunciar

adnankhan605720Seguir

22 de Mar de 2023•0 recomendaciones•2 vistas

22 de Mar de 2023•0 recomendaciones•2 vistas

Denunciar

Educación

you need to complete the r code and a single-page document containing two figures, report the parameters you estimate and discuss how well your power law fits the network data, and explain the finding. Question: images incomplete r code: # IDS 564 - Spring 2023 # Lab 4 R Code - Estimating the Degree Exponent of a Scale-free Network #========================================================================= ===================== # 0. INITIATION ========================================================================== = #========================================================================= ===================== ## You'll need VGAM for the zeta function # install.packages("VGAM") ## When prompted to install from binary version, select no library(VGAM) ## You'll need this when calculating goodness of fit # install.packages("parallel") library(parallel) library(ggplot2) library(ggthemes) library(dplyr) library(tidyr) ##------------------------------------------------------------------------------ ## This function will calculate the zeta function for you. You don't need to worry about it! Run it and continue. ## gen_zeta(gamma , shift) will give you a number gen_zeta <- function (gamma, shift = 1, deriv = 0) { deriv.arg <- deriv rm(deriv) if (!is.Numeric(deriv.arg, length.arg = 1, integer.valued = TRUE)) stop("'deriv' must be a single non-negative integer") if (deriv.arg < 0 || deriv.arg > 2) stop("'deriv' must be 0, 1, or 2") if (deriv.arg > 0) return(zeta.specials(Zeta.derivative(gamma, deriv.arg = deriv.arg, shift = shift), gamma, deriv.arg, shift)) if (any(special <- Re(gamma) <= 1)) { ans <- gamma ans[special] <- Inf special3 <- Re(gamma) < 1 ans[special3] <- NA special4 <- (0 < Re(gamma)) & (Re(gamma) < 1) & (Im(gamma) == 0) # ans[special4] <- Zeta.derivative(gamma[special4], deriv.arg = deriv.arg, shift = shift) special2 <- Re(gamma) < 0 if (any(special2)) { gamma2 <- gamma[special2] cgamma <- 1 - gamma2 ans[special2] <- 2^(gamma2) * pi^(gamma2 - 1) * sin(pi * gamma2/2) * gamma(cgamma) * Recall(cgamma) } if (any(!special)) { ans[!special] <- Recall(gamma[!special]) } return(zeta.specials(ans, gamma, deriv.arg, shift)) } aa <- 12 ans <- 0 for (ii in 0:(aa - 1)) ans <- ans + 1/(shift + ii)^gamma ans <- ans + Zeta.aux(shape = gamma, aa, shift = shift) ans[shift <= 0] <- NaN zeta.specials(ans, gamma, deriv.arg = deriv.arg, shift = shift) } ## example: gen_zeta(2.1, 4) ##------------------------------------------------------------------------------ ## The P_k (the CDF) P_k = function(gamma, k, k_sat){ ### fill the function return(1 - ( gen_zeta(gamma, k) / ... )) } ##------------------------------------------------------------------------------ my_theme <- theme_classic() + theme(legend.position = "bottom", legend.box = "horizontal", legend.direction = "horizontal", title = element_text(size = 18), axis.title = element_text(size = 14), axis.text.y = element_text(size = 16), axis.text.x = element_text(size = 16), strip.text = element_text(size.

- 1. you need to complete the r code and a single-page document containing two figures, report the parameters you estimate and discuss how well your power law fits the network data, and explain the finding. Question: images incomplete r code: # IDS 564 - Spring 2023 # Lab 4 R Code - Estimating the Degree Exponent of a Scale-free Network #========================================================================= ===================== # 0. INITIATION ========================================================================== = #========================================================================= ===================== ## You'll need VGAM for the zeta function # install.packages("VGAM") ## When prompted to install from binary version, select no library(VGAM) ## You'll need this when calculating goodness of fit # install.packages("parallel") library(parallel) library(ggplot2) library(ggthemes) library(dplyr) library(tidyr) ##------------------------------------------------------------------------------ ## This function will calculate the zeta function for you. You don't need to worry about it! Run it and continue. ## gen_zeta(gamma , shift) will give you a number gen_zeta <- function (gamma, shift = 1, deriv = 0) { deriv.arg <- deriv rm(deriv) if (!is.Numeric(deriv.arg, length.arg = 1, integer.valued = TRUE))
- 2. stop("'deriv' must be a single non-negative integer") if (deriv.arg < 0 || deriv.arg > 2) stop("'deriv' must be 0, 1, or 2") if (deriv.arg > 0) return(zeta.specials(Zeta.derivative(gamma, deriv.arg = deriv.arg, shift = shift), gamma, deriv.arg, shift)) if (any(special <- Re(gamma) <= 1)) { ans <- gamma ans[special] <- Inf special3 <- Re(gamma) < 1 ans[special3] <- NA special4 <- (0 < Re(gamma)) & (Re(gamma) < 1) & (Im(gamma) == 0) # ans[special4] <- Zeta.derivative(gamma[special4], deriv.arg = deriv.arg, shift = shift) special2 <- Re(gamma) < 0 if (any(special2)) { gamma2 <- gamma[special2] cgamma <- 1 - gamma2 ans[special2] <- 2^(gamma2) * pi^(gamma2 - 1) * sin(pi * gamma2/2) * gamma(cgamma) * Recall(cgamma) } if (any(!special)) { ans[!special] <- Recall(gamma[!special]) } return(zeta.specials(ans, gamma, deriv.arg, shift)) } aa <- 12 ans <- 0 for (ii in 0:(aa - 1)) ans <- ans + 1/(shift + ii)^gamma ans <- ans + Zeta.aux(shape = gamma, aa, shift = shift) ans[shift <= 0] <- NaN zeta.specials(ans, gamma, deriv.arg = deriv.arg, shift = shift) } ## example: gen_zeta(2.1, 4) ##------------------------------------------------------------------------------ ## The P_k (the CDF) P_k = function(gamma, k, k_sat){ ### fill the function return(1 - ( gen_zeta(gamma, k) / ... )) }
- 3. ##------------------------------------------------------------------------------ my_theme <- theme_classic() + theme(legend.position = "bottom", legend.box = "horizontal", legend.direction = "horizontal", title = element_text(size = 18), axis.title = element_text(size = 14), axis.text.y = element_text(size = 16), axis.text.x = element_text(size = 16), strip.text = element_text(size = 14), strip.background = element_blank(), strip.text.x = element_text(size = 14), strip.text.y = element_text(size = 14), legend.title = element_text(size = 14), legend.text = element_text(size = 14)) set.seed(123) #========================================================================= ===================== # 00. LOADING DATA ======================================================================== #========================================================================= ===================== ## Load data - fill the path of the folder where you put the file. If the file is in the same folder as the R code remove the path part. your_path = "your path" pat_citation_deg = read.csv(paste0(your_path, "lab 4 data - sample_patant_citation_deg.csv")) head(pat_citation_deg) tail(pat_citation_deg) summary(pat_citation_deg) ## let's have a look at the Log-log degree distribution plot (nothing to fill) p <- ggplot(pat_citation_deg, aes(x = degree)) + geom_point(stat = 'bin', color = "blue", size = 2.5, bins = 3 * ceiling(log(nrow(pat_citation_deg))))+ scale_x_continuous(trans = "log", breaks = scales::trans_breaks("log", function(x) exp(x)), labels = scales::trans_format("log", scales::math_format(e^.x))) + scale_y_continuous(trans = "log", breaks = scales::trans_breaks("log", function(x) exp(x)), labels = scales::trans_format("log", scales::math_format(e^.x))) + labs(x = "Degree", y = "Frequency") + my_theme + theme(title = element_text(size = 12 )) ## fit a line to the Log-log degree distribution (nothing to fill) p + geom_smooth(data = ggplot_build(p)$data[[1]] %>% filter(!is.infinite(y)), ## this will take the binned data generated by ggplot to fit the line mapping = aes(x=exp(x), y= exp(y)), method = "lm", se=FALSE, color = "red", size = 0.75, alpha = 0.5)
- 4. #========================================================================= ===================== # 1. EXERCISE PART 1 - Estimating Gamma =================================================== #========================================================================= ===================== ## designate the data.frame to be used - and use standardized column names: id, degree (nothing to fill) my_df = pat_citation_deg %>% rename(id = patent_id) ##------------------------------------------------------------------------------ ## you'll write a for loop over individual unique degrees in the data-set to find the corresponding distance D ### let's create a data.frame with one column as each observed degree in our network; ### We will fill the other columns D and gamma as we calculate them in the loop (nothing to fill) D_df = data.frame(k_sat = unique(my_df$degree), D= NA, gamma = NA) %>% arrange(k_sat) ## here you set up the maximum degree to check so that you do not have to do the computation for all degrees ### Let's set it up as 25th percentile of the unique observed degrees. Next line of code does that for you: (nothing to fill) max_degree_to_check = as.numeric(ceiling(quantile(unique(my_df$degree), 0.25))) ### Now discard the rows of D_df you do not need (that are above the max_degree_to_check). Next line of code does it for you: (nothing to fill) D_df = D_df[D_df$k_sat < max_degree_to_check,] ### Let's take a look at the distance data.frame we are about to fill in the loop: (nothing to fill) head(D_df) tail(D_df) ## Understand and fill parts of the code in this loop ## I recommend setting i = 1 and running each line of this loop on your own and checking what it gives you. This will help you fill the gaps ##------------------------------------------------------------------------------ ## let's work on the loop for (i in 1:(nrow(D_df))) { ## note: the loop starts slower but will speed up as it progresses. ## let's show the current loop k_sat (so that we can see our progress): (nothing to fill) print(paste0("at %", round(100 * i/nrow(D_df), 2))) k_sat_temp = D_df$k_sat[i]
- 5. ##---------------------------------------------------------------------------- ## let's create a temporary copy of the network degree data that contains degrees equal or above k_sat_temp: (nothing to fill) temp_df = my_df[my_df$degree>k_sat_temp,] ##---------------------------------------------------------------------------- ## step 1: estimate gamma for this loop and call it 'temp_gamma' ### create a vector of k_i/ (k_sat) so that you can feed it to natural logarithm and sum over elements ### k_i is each observed node degree in your data; k_sat refers to the k_sat of this loop (nothing to fill) temp_vec_k_i = temp_df$degree/(k_sat_temp) ### now use the above vector in (4.41); remember N is the number of nodes in your network. N = nrow(my_df) (nothing to fill) temp_gamma = 1 + (nrow(temp_df) / sum(log(temp_vec_k_i)) - 1/2) ## (4.41) ##---------------------------------------------------------------------------- ## step 2: now use (temp_gamma, k_sat_temp) to write (4.43) as a function called 'CDF_k' to pass the KS test in step 3: ### k will be a variable that KS test will use, so make it an argument of CDF_k; ### put gamma and k_sat of this loop in the body of the function CDF_k = function(k) { ### FILL THIS FUNCTION ACCORDING TO (4.43) return(1 - (gen_zeta(temp_gamma, k) / ...)) } ##---------------------------------------------------------------------------- ## step 3: run KS test and pass the statistic as D to the corresponding column of D_df KS_D = ks.test(temp_df$degree, ...) ### fill in with function you created above: just pass the function name (without parantheses, or brackets, or quotes) ### * you can take a look here if you couldn't figure it out: https://stats.stackexchange.com/questions/47730/r-defining-a-new-continuous-distribution- function-to-use-with-kolmogorov-smirno D_df[i,'D'] = as.numeric(KS_D$statistic) ### (nothing to fill) ## let's also store the gamma so that we do not have to compute it again once we have an optimal k_sat (nothing to fill) D_df[i,'gamma'] = temp_gamma
- 6. } ##------------------------------------------------------------------------------ ## step 4: plot D against k_sat find the k_sat that minimizes D, and the corresponding gamma ### let's first take a look at the D_df we have formed (nothing to fill) head(D_df, 10) ### find the optimal k_sat that yields minimum D and call it 'optimal_k_sat' (nothing to fill) optimal_k_sat = D_df[which.min(D_df$D),'k_sat'] ### let's take a look at the D_df we have formed (nothing to fill) ggplot(D_df %>% drop_na(), aes(x = k_sat, y = D)) + geom_point(size = 3, alpha = .5, color = "purple") + geom_vline(xintercept = optimal_k_sat, size = 1, color = "red") + ggplot2::annotate("text", x = optimal_k_sat + 15, y = as.numeric(quantile(D_df$D,.85)), label = paste0("Optimal k_sat = ",optimal_k_sat), color = "red") + my_theme + labs(x = "k", y = "D") ### find the D corresponding to 'optimal_k_sat' (nothing to fill) min(D_df$D) ### find the gamma corresponding to the optimal k_sat and call it 'optimal_gamma' (nothing to fill) (optimal_gamma = D_df[which.min(D_df$D),'gamma']) ## Discard observations with degree below the best k_sat you found earlier. (nothing to fill) post_data = my_df %>% filter(degree >= optimal_k_sat) ##------------------------------------------------------------------------------ ## let's take a look at the resulting Log-log degree distribution plot for the remaining data-points (nothing to fill) p_post <- ggplot(post_data, aes(x = degree)) + geom_point(stat = 'bin', color = "blue", size = 2.5, bins = 3 * ceiling(log(nrow(post_data))))+ scale_x_continuous(trans = "log", breaks = scales::trans_breaks("log", function(x) exp(x)), labels = scales::trans_format("log", scales::math_format(e^.x))) + scale_y_continuous(trans = "log", breaks = scales::trans_breaks("log", function(x) exp(x)), labels = scales::trans_format("log", scales::math_format(e^.x))) + labs(x = "Degree", y = "Frequency") + my_theme ## fit a line to the Log-log degree distribution (nothing to fill) p_post + geom_smooth(data = ggplot_build(p_post)$data[[1]] %>% filter(!is.infinite(y)), ## this will
- 7. take the binned data generated by ggplot to fit the line mapping = aes(x=exp(x), y= exp(y)), method = "lm", se=FALSE, color = "red", size = 0.75, alpha = 0.5) #========================================================================= ===================== # 2. EXERCISE PART 2 - Goodness-of-fit ==================================================== #========================================================================= ===================== ## We are going to create a vector of synthetic sequences of degrees and repeat the process M times ## Usually, M is pretty big, like M = 10,000. For now, let's set M = 100: (nothing to fill) M = 100 ## so let's create a data.frame of M rows, one for every D_synthetic we will generate (nothing to fill) D_gof_df = data.frame(iter = 1:M, D_synthetic = NA) ##------------------------------------------------------------------------------ ## step 1: store the distance you found in part 1 as D_real (nothing to fill) D_real = min(D_df$D) ##------------------------------------------------------------------------------ ## I. Let's walk through steps 2 and 3 once outside of the loop ##------------------------------------------------------------------------------ ##------------------------------------------------------------------------------ ## step 2: you will need to define the inverse of the CDF function (so that you generate random probability values [0,1] and get degrees back) ### let's write the CDF that best fits the data (we did this in part 1): CDF_k = function(k) { ### FILL THE FUNCTION ACCORDING TO (4.43) return(1 - (gen_zeta(optimal_gamma, k) / ...)) } ### 2.1. Let's define the inverse of your CDF; (nothing to fill) ### if the next line is hard to understand, check here: https://stackoverflow.com/questions/23258482/use-inverse-cdf-to-generate-random-variable-in-r
- 8. #### mini step 2.1.1. this piece of code will create an inverse for you (that searches the interval 0 up to a big higher than the highest observed degree in our data) Inverse = function(f, interval = c(sqrt(min(post_data$degree)), max(post_data$degree))){ function(y) uniroot((function(x) f(x) - y), interval = interval, extendInt = "yes")$root } inverse_CDF = Inverse(CDF_k) ### mini step 2.2.1 let's try it for a couple of numbers (nothing to fill) inverse_CDF(0.4) ## mini step 2.2.2. runif(1) will generate a random real number between 0 and 1. Let's pass that to inverse_CDF rand_p = runif(1) inverse_CDF(rand_p) ### mini step 2.2. let's generate 5 random numbers betwee 0 and 1 and get 5 degrees back from our inverse #### (unfortunately we have to write this complex code because inverse_CDF does not accept a vector; try inverse_CDF(c(0.1, 0.2)). ) rand_p = runif(5) unlist(lapply(rand_p, function(p){inverse_CDF(p)})) ### step 2.2.ok, you know how to generate 5 random degrees. Let's create n random degrees, where n is the number of degrees in our remaining data (nothing to fill) rand_p = runif(nrow(post_data)) rand_deg = unlist(lapply(rand_p, function(p){inverse_CDF(p)})) ## unfortunately this is a bit slow!!! rand_deg = unlist(mclapply(rand_p, function(p){inverse_CDF(p)}, mc.cores = parallel::detectCores() - 1)) ## this will make it a bit faster, but still not great! ##------------------------------------------------------------------------------ ## step 3: use ks.test to get a D_synthetic for the rand_deg you just generated and ### FILL THE ks.test ACCORDING TO THE DISCUSSION ON STEP 3 PART 2 KS_D = ks.test(rand_deg, ...) ## ks.test(first are synthetic degrees, second are real degrees) as.numeric(KS_D$statistic) ##------------------------------------------------------------------------------ ## II. Now let's write the loop ##------------------------------------------------------------------------------ for (i in 1:M){ ## this may take a while!!! It took me 15 minutes to run M = 100 on a computer with 3.5 GHz 6-Core and 64 GB memory. Lower to M = 20 if necessary
- 9. print(paste0("at %",100 * i/M)) ##------------------------------------------------------------------------------ ## step 2: generate a synthetic (random) sequence of degrees rand_p = ... ### FILL AS WE DID ABOVE rand_deg = unlist(mclapply(rand_p, function(p){inverse_CDF(p)}, mc.cores = parallel::detectCores() - 1)) ## this will make it a bit faster ##------------------------------------------------------------------------------ ## step 3: find the distance between the synthetic sequence and CDF_k and store it # KS_D = ks.test(rand_deg, CDF_k) KS_D = ks.test(..., ...) ### FILL AS WE DID ABOVE ks.test(first are synthetic degrees, second are real degrees) D_gof_df[i,'D_synthetic'] = as.numeric(KS_D$statistic) } ##------------------------------------------------------------------------------ ## Let's plot the results ### let's take a look at the D_df we have formed ggplot(D_gof_df, aes(x = D_synthetic)) + geom_histogram(bins = 20, color = "white", fill = "green", alpha = 0.9) + geom_vline(xintercept = D_real, size = 1, color = "brown") + my_theme + labs(x = "Distance", y = "Frequency", title = "Synthetic and Real Distances") # IDS 564 - Spring 2023 # Lab 4 R Code - Estimating the Degree Exponent of a Scale-free Network #========================================================================= ===================== # 0. INITIATION ========================================================================== = #========================================================================= ===================== ## You'll need VGAM for the zeta function # install.packages("VGAM") ## When prompted to install from binary version, select no
- 10. library(VGAM) ## You'll need this when calculating goodness of fit # install.packages("parallel") library(parallel) library(ggplot2) library(ggthemes) library(dplyr) library(tidyr) ##------------------------------------------------------------------------------ ## This function will calculate the zeta function for you. You don't need to worry about it! Run it and continue. ## gen_zeta(gamma , shift) will give you a number gen_zeta <- function (gamma, shift = 1, deriv = 0) { deriv.arg <- deriv rm(deriv) if (!is.Numeric(deriv.arg, length.arg = 1, integer.valued = TRUE)) stop("'deriv' must be a single non-negative integer") if (deriv.arg < 0 || deriv.arg > 2) stop("'deriv' must be 0, 1, or 2") if (deriv.arg > 0) return(zeta.specials(Zeta.derivative(gamma, deriv.arg = deriv.arg,
- 11. shift = shift), gamma, deriv.arg, shift)) if (any(special <- Re(gamma) <= 1)) { ans <- gamma ans[special] <- Inf special3 <- Re(gamma) < 1 ans[special3] <- NA special4 <- (0 < Re(gamma)) & (Re(gamma) < 1) & (Im(gamma) == 0) # ans[special4] <- Zeta.derivative(gamma[special4], deriv.arg = deriv.arg, shift = shift) special2 <- Re(gamma) < 0 if (any(special2)) { gamma2 <- gamma[special2] cgamma <- 1 - gamma2 ans[special2] <- 2^(gamma2) * pi^(gamma2 - 1) * sin(pi * gamma2/2) * gamma(cgamma) * Recall(cgamma) } if (any(!special)) { ans[!special] <- Recall(gamma[!special]) } return(zeta.specials(ans, gamma, deriv.arg, shift)) } aa <- 12 ans <- 0
- 12. for (ii in 0:(aa - 1)) ans <- ans + 1/(shift + ii)^gamma ans <- ans + Zeta.aux(shape = gamma, aa, shift = shift) ans[shift <= 0] <- NaN zeta.specials(ans, gamma, deriv.arg = deriv.arg, shift = shift) } ## example: gen_zeta(2.1, 4) ##------------------------------------------------------------------------------ ## The P_k (the CDF) P_k = function(gamma, k, k_sat){ ### fill the function return(1 - ( gen_zeta(gamma, k) / ... )) } ##------------------------------------------------------------------------------ my_theme <- theme_classic() + theme(legend.position = "bottom", legend.box = "horizontal", legend.direction = "horizontal", title = element_text(size = 18), axis.title = element_text(size = 14), axis.text.y = element_text(size = 16), axis.text.x = element_text(size = 16), strip.text = element_text(size = 14), strip.background = element_blank(), strip.text.x = element_text(size = 14), strip.text.y = element_text(size = 14), legend.title = element_text(size = 14), legend.text = element_text(size = 14))
- 13. set.seed(123) #========================================================================= ===================== # 00. LOADING DATA ======================================================================== #========================================================================= ===================== ## Load data - fill the path of the folder where you put the file. If the file is in the same folder as the R code remove the path part. your_path = "your path" pat_citation_deg = read.csv(paste0(your_path, "lab 4 data - sample_patant_citation_deg.csv")) head(pat_citation_deg) tail(pat_citation_deg) summary(pat_citation_deg) ## let's have a look at the Log-log degree distribution plot (nothing to fill) p <- ggplot(pat_citation_deg, aes(x = degree)) + geom_point(stat = 'bin', color = "blue", size = 2.5, bins = 3 * ceiling(log(nrow(pat_citation_deg))))+ scale_x_continuous(trans = "log", breaks = scales::trans_breaks("log", function(x) exp(x)), labels = scales::trans_format("log", scales::math_format(e^.x))) + scale_y_continuous(trans = "log", breaks = scales::trans_breaks("log", function(x) exp(x)), labels = scales::trans_format("log", scales::math_format(e^.x))) +
- 14. labs(x = "Degree", y = "Frequency") + my_theme + theme(title = element_text(size = 12 )) ## fit a line to the Log-log degree distribution (nothing to fill) p + geom_smooth(data = ggplot_build(p)$data[[1]] %>% filter(!is.infinite(y)), ## this will take the binned data generated by ggplot to fit the line mapping = aes(x=exp(x), y= exp(y)), method = "lm", se=FALSE, color = "red", size = 0.75, alpha = 0.5) #========================================================================= ===================== # 1. EXERCISE PART 1 - Estimating Gamma =================================================== #========================================================================= ===================== ## designate the data.frame to be used - and use standardized column names: id, degree (nothing to fill) my_df = pat_citation_deg %>% rename(id = patent_id) ##------------------------------------------------------------------------------ ## you'll write a for loop over individual unique degrees in the data-set to find the corresponding distance D ### let's create a data.frame with one column as each observed degree in our network; ### We will fill the other columns D and gamma as we calculate them in the loop (nothing to fill) D_df = data.frame(k_sat = unique(my_df$degree), D= NA, gamma = NA) %>% arrange(k_sat)
- 15. ## here you set up the maximum degree to check so that you do not have to do the computation for all degrees ### Let's set it up as 25th percentile of the unique observed degrees. Next line of code does that for you: (nothing to fill) max_degree_to_check = as.numeric(ceiling(quantile(unique(my_df$degree), 0.25))) ### Now discard the rows of D_df you do not need (that are above the max_degree_to_check). Next line of code does it for you: (nothing to fill) D_df = D_df[D_df$k_sat < max_degree_to_check,] ### Let's take a look at the distance data.frame we are about to fill in the loop: (nothing to fill) head(D_df) tail(D_df) ## Understand and fill parts of the code in this loop ## I recommend setting i = 1 and running each line of this loop on your own and checking what it gives you. This will help you fill the gaps ##------------------------------------------------------------------------------ ## let's work on the loop for (i in 1:(nrow(D_df))) { ## note: the loop starts slower but will speed up as it progresses. ## let's show the current loop k_sat (so that we can see our progress): (nothing to fill) print(paste0("at %", round(100 * i/nrow(D_df), 2))) k_sat_temp = D_df$k_sat[i] ##---------------------------------------------------------------------------- ## let's create a temporary copy of the network degree data that contains degrees equal or above
- 16. k_sat_temp: (nothing to fill) temp_df = my_df[my_df$degree>k_sat_temp,] ##---------------------------------------------------------------------------- ## step 1: estimate gamma for this loop and call it 'temp_gamma' ### create a vector of k_i/ (k_sat) so that you can feed it to natural logarithm and sum over elements ### k_i is each observed node degree in your data; k_sat refers to the k_sat of this loop (nothing to fill) temp_vec_k_i = temp_df$degree/(k_sat_temp) ### now use the above vector in (4.41); remember N is the number of nodes in your network. N = nrow(my_df) (nothing to fill) temp_gamma = 1 + (nrow(temp_df) / sum(log(temp_vec_k_i)) - 1/2) ## (4.41) ##---------------------------------------------------------------------------- ## step 2: now use (temp_gamma, k_sat_temp) to write (4.43) as a function called 'CDF_k' to pass the KS test in step 3: ### k will be a variable that KS test will use, so make it an argument of CDF_k; ### put gamma and k_sat of this loop in the body of the function CDF_k = function(k) { ### FILL THIS FUNCTION ACCORDING TO (4.43) return(1 - (gen_zeta(temp_gamma, k) / ...)) } ##---------------------------------------------------------------------------- ## step 3: run KS test and pass the statistic as D to the corresponding column of D_df
- 17. KS_D = ks.test(temp_df$degree, ...) ### fill in with function you created above: just pass the function name (without parantheses, or brackets, or quotes) ### * you can take a look here if you couldn't figure it out: https://stats.stackexchange.com/questions/47730/r-defining-a-new-continuous-distribution- function-to-use-with-kolmogorov-smirno D_df[i,'D'] = as.numeric(KS_D$statistic) ### (nothing to fill) ## let's also store the gamma so that we do not have to compute it again once we have an optimal k_sat (nothing to fill) D_df[i,'gamma'] = temp_gamma } ##------------------------------------------------------------------------------ ## step 4: plot D against k_sat find the k_sat that minimizes D, and the corresponding gamma ### let's first take a look at the D_df we have formed (nothing to fill) head(D_df, 10) ### find the optimal k_sat that yields minimum D and call it 'optimal_k_sat' (nothing to fill) optimal_k_sat = D_df[which.min(D_df$D),'k_sat'] ### let's take a look at the D_df we have formed (nothing to fill) ggplot(D_df %>% drop_na(), aes(x = k_sat, y = D)) + geom_point(size = 3, alpha = .5, color = "purple") + geom_vline(xintercept = optimal_k_sat, size = 1, color = "red") + ggplot2::annotate("text", x = optimal_k_sat + 15, y = as.numeric(quantile(D_df$D,.85)), label = paste0("Optimal k_sat = ",optimal_k_sat), color = "red") +
- 18. my_theme + labs(x = "k", y = "D") ### find the D corresponding to 'optimal_k_sat' (nothing to fill) min(D_df$D) ### find the gamma corresponding to the optimal k_sat and call it 'optimal_gamma' (nothing to fill) (optimal_gamma = D_df[which.min(D_df$D),'gamma']) ## Discard observations with degree below the best k_sat you found earlier. (nothing to fill) post_data = my_df %>% filter(degree >= optimal_k_sat) ##------------------------------------------------------------------------------ ## let's take a look at the resulting Log-log degree distribution plot for the remaining data-points (nothing to fill) p_post <- ggplot(post_data, aes(x = degree)) + geom_point(stat = 'bin', color = "blue", size = 2.5, bins = 3 * ceiling(log(nrow(post_data))))+ scale_x_continuous(trans = "log", breaks = scales::trans_breaks("log", function(x) exp(x)), labels = scales::trans_format("log", scales::math_format(e^.x))) + scale_y_continuous(trans = "log", breaks = scales::trans_breaks("log", function(x) exp(x)), labels = scales::trans_format("log", scales::math_format(e^.x))) + labs(x = "Degree", y = "Frequency") + my_theme
- 19. ## fit a line to the Log-log degree distribution (nothing to fill) p_post + geom_smooth(data = ggplot_build(p_post)$data[[1]] %>% filter(!is.infinite(y)), ## this will take the binned data generated by ggplot to fit the line mapping = aes(x=exp(x), y= exp(y)), method = "lm", se=FALSE, color = "red", size = 0.75, alpha = 0.5) #========================================================================= ===================== # 2. EXERCISE PART 2 - Goodness-of-fit ==================================================== #========================================================================= ===================== ## We are going to create a vector of synthetic sequences of degrees and repeat the process M times ## Usually, M is pretty big, like M = 10,000. For now, let's set M = 100: (nothing to fill) M = 100 ## so let's create a data.frame of M rows, one for every D_synthetic we will generate (nothing to fill) D_gof_df = data.frame(iter = 1:M, D_synthetic = NA) ##------------------------------------------------------------------------------ ## step 1: store the distance you found in part 1 as D_real (nothing to fill) D_real = min(D_df$D) ##------------------------------------------------------------------------------
- 20. ## I. Let's walk through steps 2 and 3 once outside of the loop ##------------------------------------------------------------------------------ ##------------------------------------------------------------------------------ ## step 2: you will need to define the inverse of the CDF function (so that you generate random probability values [0,1] and get degrees back) ### let's write the CDF that best fits the data (we did this in part 1): CDF_k = function(k) { ### FILL THE FUNCTION ACCORDING TO (4.43) return(1 - (gen_zeta(optimal_gamma, k) / ...)) } ### 2.1. Let's define the inverse of your CDF; (nothing to fill) ### if the next line is hard to understand, check here: https://stackoverflow.com/questions/23258482/use-inverse-cdf-to-generate-random-variable-in-r #### mini step 2.1.1. this piece of code will create an inverse for you (that searches the interval 0 up to a big higher than the highest observed degree in our data) Inverse = function(f, interval = c(sqrt(min(post_data$degree)), max(post_data$degree))){ function(y) uniroot((function(x) f(x) - y), interval = interval, extendInt = "yes")$root } inverse_CDF = Inverse(CDF_k) ### mini step 2.2.1 let's try it for a couple of numbers (nothing to fill)
- 21. inverse_CDF(0.4) ## mini step 2.2.2. runif(1) will generate a random real number between 0 and 1. Let's pass that to inverse_CDF rand_p = runif(1) inverse_CDF(rand_p) ### mini step 2.2. let's generate 5 random numbers betwee 0 and 1 and get 5 degrees back from our inverse #### (unfortunately we have to write this complex code because inverse_CDF does not accept a vector; try inverse_CDF(c(0.1, 0.2)). ) rand_p = runif(5) unlist(lapply(rand_p, function(p){inverse_CDF(p)})) ### step 2.2.ok, you know how to generate 5 random degrees. Let's create n random degrees, where n is the number of degrees in our remaining data (nothing to fill) rand_p = runif(nrow(post_data)) rand_deg = unlist(lapply(rand_p, function(p){inverse_CDF(p)})) ## unfortunately this is a bit slow!!! rand_deg = unlist(mclapply(rand_p, function(p){inverse_CDF(p)}, mc.cores = parallel::detectCores() - 1)) ## this will make it a bit faster, but still not great! ##------------------------------------------------------------------------------ ## step 3: use ks.test to get a D_synthetic for the rand_deg you just generated and ### FILL THE ks.test ACCORDING TO THE DISCUSSION ON STEP 3 PART 2 KS_D = ks.test(rand_deg, ...) ## ks.test(first are synthetic degrees, second are real degrees) as.numeric(KS_D$statistic)
- 22. ##------------------------------------------------------------------------------ ## II. Now let's write the loop ##------------------------------------------------------------------------------ for (i in 1:M){ ## this may take a while!!! It took me 15 minutes to run M = 100 on a computer with 3.5 GHz 6-Core and 64 GB memory. Lower to M = 20 if necessary print(paste0("at %",100 * i/M)) ##------------------------------------------------------------------------------ ## step 2: generate a synthetic (random) sequence of degrees rand_p = ... ### FILL AS WE DID ABOVE rand_deg = unlist(mclapply(rand_p, function(p){inverse_CDF(p)}, mc.cores = parallel::detectCores() - 1)) ## this will make it a bit faster ##------------------------------------------------------------------------------ ## step 3: find the distance between the synthetic sequence and CDF_k and store it # KS_D = ks.test(rand_deg, CDF_k) KS_D = ks.test(..., ...) ### FILL AS WE DID ABOVE ks.test(first are synthetic degrees, second are real degrees) D_gof_df[i,'D_synthetic'] = as.numeric(KS_D$statistic) } ##------------------------------------------------------------------------------ ## Let's plot the results ### let's take a look at the D_df we have formed
- 23. ggplot(D_gof_df, aes(x = D_synthetic)) + geom_histogram(bins = 20, color = "white", fill = "green", alpha = 0.9) + geom_vline(xintercept = D_real, size = 1, color = "brown") + my_theme + labs(x = "Distance", y = "Frequency", title = "Synthetic and Real Distances") Please read this instruction carefully before starting this exercise. To complete this lab, you have a short, required reading included in this document that explains the challenges in finding the degree exponent () of a power law for an observed network. The fitting procedure corrects such problems; reading it will require you to follow and understand the exercise. You will fit a power law into a network provided for this lab. The estimation process (starting on page 3 ) contains two parts. The first part estimates the degree exponent (), and the second part offers the goodness of fit using a simulation process. An attached R code handles the computational heavy lifting. However, filling in and comapleting the code requires your understanding of the process. I annotated the code in detail. The idea is to facilitate learning rather than implement the necessary steps. You still need to run every line of code. However, the lines you do not need to change show (nothing to fill) in their annotation. The data you will handle is a random sample of 80,000 US patents taken from a larger corpus of about 8 million patents. You see a patent id number (first column) and a degree representing the number of times the patent was cited by other patents (second column). So, the data is the degree sequence of a network, summarized to help you handle the lab. The R code provides you with a log-log plot to get a sense of the data. Here are the first ten* Patents without any citations do not appear in the data. Unlike your previous labs, lab 4 does not require you to complete a quiz. All you need to submit is your R code, and a single-page document containing two figures (similar to Figure 4.24 parts b and c ), report two values ( and k sat ) and discuss how well your power law fits the network data, and an explanation of your finding. Note that the second part of the exercise will need a bit of time to run. Accommodate it by starting your lab 4 early. Required Reading: Degree Distribution of Real Networks In real systems, we rarely observe a degree distribution that follows a pure power law. Instead, for most real systems pk has the shape shown in Figure 4.23a, with two recurring - Low-degree saturation is a common deviation from the power-law behavior. Its problems: signature is a flattened Px for k<k sot. This indicates that we have fewer small degree nodes than expected for a pure power law. The origin of the saturation will be - High-degree cutoff appears as a rapid drop in pk for k>kcut,, indicating that we have explained in Chapter 6. fewer high-degree nodes than expected in a pure power law. This limits the size of the largest hub, making it smaller than predicted by what we derived in the class (kmax =kminNr11). High-degree cutoffs emerge if there are inherent limitations in the number of links a node can have. For example, in social networks individuals have difficulty maintaining meaningful relationships with an exceptionally large number of acquaintances, or like the case of Facebook's 5klimitonfriends.Barabasi Figure 4.23. Rescaling the Degree Distribution a. In real networks the degree distribution frequently deviates from a pure power law by showing a low degree b. By
- 24. plotting the rescaled pk in function of (k+ksat), as suggested by (4.40), the degree distribution follows a saturation and high degree cutoff. power law for all degrees. Given the widespread presence of such cutoffs the degree distribution is often fitted to: (4.39) Px=a(k+ksat)exp(kcutk) where ksat accounts for degree saturation, and the exponential term accounts for high-k cutoff. How do we deal with these two problems? To extract the full extent of the scaling we correct the plot by multiplying a term as follows: Px=Pxexp(kcutk) If you look at the resulting function, Px=a(k+ ksat)r follows the power law form we would like to see as a function of k=k+ksot:PRkr, correcting for the two cutoffs, as seen in Figure 4.23b. This is the idea. We will go through this correction in the exercise. It is occasionally claimed that the presence of low-degree or high-degree cutoffs implies that the network is not scale-free. This is a misunderstanding of the scale-free property: Virtually all properties of scale-free networks are insensitive to the low-degree saturation. Only the high-degree cutoff affects the system's properties by limiting therdivergence of the second moment (that determines the variance). The presence of such cutoffs indicates the presence of additional phenomena that need to be understood - that will not be discussed here. Exercise: Estimating the Degree Exponent As the properties of scale-free networks depend on the degree exponent, we need to determine the value of . We face several difficulties, however, when we try to fit a power law into real data. The most important is the fact that the scaling is rarely valid for the full range of the degree distribution. Rather we observe small- and high-degree cutoffs described above, as ksot and kcut, within which we have a clear scaling region. Here we focus on estimating the small degree cutoff ksat as the high degree cutoff can be determined in a similar fashion. The reader is advised to consult the discussion on systematic problems provided at the end of this section - as optional reading. Part 1: Fitting Procedure: As the degree distribution is typically provided as a list of positive integers kmin,,kmax, we aim to estimate from a discrete set of data points (only those observed in the data). Here you will see figures for an article citation network to illustrate the procedure. You will implement the procedure on the patent data explained above. The article citation network illustrated here consists of N=384,362 nodes, each node representing a research paper published between 1890 and 2009 in journals pubblished by the American Physical Society. The network has L=2,353,984 links, each representing a citation from a published research paper to some other publication in the dataset (outside citations are ignored). The steps of the fitting process are:1. Choose a value of ksot between kmin and kmax. Estimate the value of the degree exponent corresponding to this k sot using: * Notice two things. First, in practice, you do not need to check all degrees between kmin and kmax Indeed, you can check degrees up to a certain quantile. In the code, you loop over kmin up to the degree at the 25th percentile - this is already set up for you. 3 Second, in (4.41), the summation starts from i=1. This applies to the data that has a degree at least equal to ksot for that loop. For instance, if you are checking ksat=4, all data corresponding to values with a degree below 4 will be discarded and do not enter the calculation of in (4.41) - this is also set up for you in R. 2. With the obtained (,ksat) parameter pair, assume that the degree distribution has the form: Pk=k=0(k+ksat)1k First, note that the kk in the denominator is called a zeta function. When running the exercise, a function I have provided will calculate zeta for you, so you do not have to worry about it. Just learn the notation so that you can pass the arguments: zeta(,a)=x=0(x+a) is the main parameter, and a is the shift. You will pass these two parameters in the is the main parameter, and a is the shift. You
- 25. will pass these two parameters in the same order to the function provided in R. Second, notice that Pk in (4.42) is like a power law we have seen in the class, if you take the C=zeta(,ksat)1. This is, in fact, a constant (given parameters and ksat ) and is the discrete form of C (in slide 15 of session 9 we derived C for a continuous closed form; the discrete form is easier to work with here since you have discrete data from your network). You do not need to implement Pk. Next is what you need to write in R. With the Pk(4.42) as the probability density function, the cumulative distribution function (CDF) is: Pk=1zeta(r,ksa)zeta(,k) You will write this in R using the zeta function). Your Pk implementation in R should be something like this that takes k as input and has and kiat inside: C DF(k)=1zeta(Y,k)/zeta(Y,k_sat) 3. Use the Kormogorov-Smirnov (KS) test to determine the distance D between your network data, let's call it S(k), and the fitted model provided by (4.43) with the selected (,ksat) parameter pairs. In R, ks.test(..)$statistic will give you the D. To implement this in R, you can pass the degree distribution of your network, which we call S(k), as the first parameter to the function ks.test. The second parameter should be the function Pk, you just defined. The implementation of your KS test in R for the setup described will look like this: ks.test (S(k),CDF(k)) * Both CDF(k) and ks.test written above are essentially pseudocode. Change them as necessary in your R code.4. Repeat steps (13) by scanning the kwot range from kmin to the degree at 25th percentile. We aim to identify the k sat value for which D provided by the test is minimal and call it 'optimal ksot'. To illustrate the procedure, we plot D as a function of k satforthepaper citation network (Figure 4.24b ). The plot indicates that D is minimal for ksat=49, and the corresponding estimated by (4.41), representing the optimal fit, is =2.79. What you report for part 1 on your network: - Plotting D against ksat for (similar to Figure 4.24 b). - The values of ksot that minimizes D, the resulting D, and the corresponding . a. The degree distribution Pk of the citation network, where the straight purple line represents the best fit based Barabasi Figure 4.24 Maximum Likelihood Estimation. b. The values of Kormogorov-Smimov test vs. kut for the citation network. c. p ( D mantheric), for M=10,000 synthetic datasets, where the grey line corresponds to the Dm mal value extracted for the on the model (4.39). citation network. Just because we obtained a (,k sot ) pair that represents an optimal fit to our dataset, does not Part 2: Goodness of Fit mean that the power law itself is a good model for the studied distribution. Therefore, we need to use a goodness-of-fit test, which generates a p-value that quantifies the plausibility of the power law hypothesis. The most often-used procedure consists of the following steps: 1. Use the distance (KS statistic for the best ksat) you found in part 1 . Call it D real. For1. Use the distance ( KS statistic for the best ksot ) you found in part 1. Call it D Dreal. For instance, the selected ksot= 49 and the distance Dreal=0.01158 for the citation network. This is the distance Dreal is between the observed data and our fit based on the parameter pair (,ksot). 2. Use (4.42) to generate a degree sequence of N degrees (i.e. the same number of random numbers as the number of nodes in the original dataset). This is synthetic data we have generated as a hypothetical degree sequence. 3. Now, calculate the distance between the synthetic data and your actual data using the Kormogorov-Smirnov test. Call the new distance, D Dynthetic, Hence, D pnthetic represents the distance between a synthetically generated degree sequence, consistent with our degree distribution, and the real data. The goal is to see if the obtained D'mentetic is comparable to Dreal . For this, we repeat step (2) and (3)M times (say M=100 ). Each time, we generate a new degree sequence and determine the corresponding D. nhthect. Eventually, we obtain the p(Dannthetic))
- 26. distribution (i.e., the histogram of all 100 Dsynthetic values you generated by repeating step 2. Plot p(D ( r interic) and show as a vertical bar Dreal (Figure 4.24c). If Dreal is within the p(Dsynthetce) distribution, it means that the distance between the model providing the best fit and the empirical data is comparable with the distance expected from random degree samples chosen from the best fit distribution. Hence the power law is a reasonable model for the data. If, however, Dreal falls outside the p(Dynthetic) distribution, then the power law is not a good model - some other function is expected to describe the original Pk better. While the distribution shown in Figure 4.24c may be, in some cases, useful to illustrate theWhile the distribution shown in Figure 4.24c may be, in some cases, useful to illustrate the statistical significance of the fit, in general it is better to assign a p- number to the fit, by: p=DP(Dsynthetic)dDsynthetic The closer p is to 1 , the more likely that the difference between the empirical data and the model can be attributed to statistical fluctuations alone. If p is very small, the model is not a plausible fit to the data (Typically, the model is accepted if p>1%.). You can skip this part as our M is too small to draw any meaningful statistical inference. Based on the histogram you drew, does the power law fit the data well? What you report for part 2 : - In which region we discussed in session 9 (slides 3740) does your of voptimal land? - Plot histogram of Dsynthetic and where Dreal lands (similar to Figure 4.24 c). - Is scale free is a good choice for the network? Based on paragraph above, why? - What could be the reason for your finding be? This concludes your lab 4. Congrats! You did some serious network analysis! For the citation network authors obtain p<104, indicating that a pure power law is not a suitable model for the original degree distribution. This outcome is somewhat surprising, as the power-law nature of citation data has been documented repeatedly since 1960s. This failure indicates the limitation of the blind fitting to a power law, without an analytical understanding of the underlying distribution. Barabasi discusses how to correct the problem: We note that the fitting model (4.44) eliminates all the data points with k<ksot Choosing ksat=49 forces us to discard over 96% of the data points. Yet, there is statistically useful information for the data that falls in k<ksat that is ignored by the previous fit. We must introduce an alternate model that resolves this problem. I included their solution as part of an optional reading to this exercise. I included pages since the beginning of this solution due to minor notation differences. You can find the discussion at the end of (PDF) page 4/7. 6