Purpose
The main purpose is to be able to compute the share of resources used by node i
in relation to its neighbors:
r_i / sum_j^i{r_j}
where r_i
are node i resources and sum_j^i{r_j}
is the sum of i’s neighbors’ resources.
I am open to any R, python or eventually stata solutions, that are able to achieve this task on which I am almost giving up… See below snippets with my previous attempts.
To achieve this goal, I am trying to perform a search of this type:
node | col1 | col2 | col3 |
---|---|---|---|
i | [A] | [list] | list |
j | [A, B , i] |
search i in col 1 if found update col1
node | col1 | col2 | col3 |
---|---|---|---|
i | [A, j] | [list] | list |
j | [A, B , i] |
Data Dataframe is about 700k rows and lists can be of max 20 elements. Lists in col1-col3 may be empty. Entries look like ‘1579301860’ that are stored as strings.
The first 10 entries of the df:
df[["ID","s22_12","s22_09","s22_04"]].head(10) ,ID,s22_12,s22_09,s22_04 0,547232925,[],[],[] 1,1195452119,[],[],[] 2,543827523,[],[],[] 3,1195453927,[],[],[] 4,1195456863,[],[],[] 5,403735824,[],[],[] 6,403985344,[],[],[] 7,1522725190,"['547232925', '1561895862', '1195453927', '1473969746', '1576299336', '1614620375', '1526127302', '1523072827', '398988727', '1393784634', '1628271142', '1562369345', '1615273511', '1465706815', '1546795725']","['1550103038', '547232925', '1614620375', '1500554025', '1526127302', '1523072827', '1554793443', '1393784634', '1603417699', '1560658585', '1533511207', '1439071476', '1527861165', '1539382728', '1545880720']","['1529732185', '1241865116', '1524579382', '1523072827', '1526127302', '1560851415', '1535455909', '1457280850', '1577015775', '1600877852', '1549989930', '1528007558', '1533511207', '1527861165', '1591602766']" 8,789656124,[],[],[] 9,662539468,[1195453927],[],[]
What I tried: R Attempts Exploding the lists and put in a long format. Then I tried two main approaches in R:
- loading long data into
igraph
and then apply to the nodes’ graph neighbors(), saving into lists and using plyr to have a neighbor_df (works but 2 nodes gets done in 67 seconds)
# Initialize the result data frame result <- data.frame(Node = nodes) #result <- as.data.frame(matrix(NA, nrow = n_nodes, ncol = 0)) neighbor_lists <- lapply(nodes, function(x) { neighbors <- names(neighbors(graph, x)) if (length(neighbors) == 0) { neighbors <- NA } return(neighbors) }) neighbor_df <- plyr::ldply(neighbor_lists, rbind) names(neighbor_df) <- paste0("Neighbor",1:ncol(neighbor_df)) result <- cbind(result,neighbor_df)
- read the long format with
data.table
, split, lapply dcast on the splits (<- memory overload)
result_long <- edges[, .(to = to, Node = from)][, rn := .I][, .(Node, Neighbor = to, Number = rn)][order(Number),] result_long[,cast_cat:=findInterval(Number,seq(100000,6000000,100000))] # reshape to wide result_wide <- dcast(result_long, Node ~ Number, value.var = "Neighbor", fill = "") #Only tested on sample data, target data is 19 mln rows and dcast shall be split, but then it consumes 200Gb of ram result_wide[, (2:ncol(result_wide)) := lapply(.SD, function(x) ifelse(x == "", NA, x)), .SDcols = 2:ncol(result_wide)] result_wide = na_move(result_wide, cols = names(result_wide[,!1]) ) result_wide<- Filter(function(x)!all(is.na(x)), result_wide)
I posted as per Andy request, yet I think it clutters the question.
Advertisement
Answer
Thanks to the comment of @Stefano Barbi:
# extract attributes characteristics: r <- vertex_attr(g,"rcount",index=V(g)) #create a dgC sparse matrix from graph m <- get.adjacency(g) # premultiply the adj matrix to find the sum of the neighbors resources sum_of_rj = r %*% m # add node's own resources sum_of_r = sum_of_rj + r #find the vector of shares share = r / sum_of_r@x sh_tab = data.table(i = sum_of_r@Dimnames[[2]], sh = share) sh_tab