Skip to content
Advertisement

Updating columns of list based on match

Purpose The main purpose is to be able to compute the share of resources used by node i in relation to its neighbors: r_i / sum_j^i{r_j}

where r_i are node i resources and sum_j^i{r_j} is the sum of i’s neighbors’ resources.

I am open to any R, python or eventually stata solutions, that are able to achieve this task on which I am almost giving up… See below snippets with my previous attempts.

To achieve this goal, I am trying to perform a search of this type:

node col1 col2 col3
i [A] [list] list
j [A, B , i]

search i in col 1 if found update col1

node col1 col2 col3
i [A, j] [list] list
j [A, B , i]

Data Dataframe is about 700k rows and lists can be of max 20 elements. Lists in col1-col3 may be empty. Entries look like ‘1579301860’ that are stored as strings.

The first 10 entries of the df:

df[["ID","s22_12","s22_09","s22_04"]].head(10)
,ID,s22_12,s22_09,s22_04
0,547232925,[],[],[]
1,1195452119,[],[],[]
2,543827523,[],[],[]
3,1195453927,[],[],[]
4,1195456863,[],[],[]
5,403735824,[],[],[]
6,403985344,[],[],[]
7,1522725190,"['547232925', '1561895862', '1195453927', '1473969746', '1576299336', '1614620375', '1526127302', '1523072827', '398988727', '1393784634', '1628271142', '1562369345', '1615273511', '1465706815', '1546795725']","['1550103038', '547232925', '1614620375', '1500554025', '1526127302', '1523072827', '1554793443', '1393784634', '1603417699', '1560658585', '1533511207', '1439071476', '1527861165', '1539382728', '1545880720']","['1529732185', '1241865116', '1524579382', '1523072827', '1526127302', '1560851415', '1535455909', '1457280850', '1577015775', '1600877852', '1549989930', '1528007558', '1533511207', '1527861165', '1591602766']"
8,789656124,[],[],[]
9,662539468,[1195453927],[],[]

What I tried: R Attempts Exploding the lists and put in a long format. Then I tried two main approaches in R:

  1. loading long data into igraph and then apply to the nodes’ graph neighbors(), saving into lists and using plyr to have a neighbor_df (works but 2 nodes gets done in 67 seconds)
# Initialize the result data frame
result <- data.frame(Node = nodes)
#result <- as.data.frame(matrix(NA, nrow = n_nodes, ncol = 0))
neighbor_lists <- lapply(nodes, function(x) {
  neighbors <- names(neighbors(graph, x))
  if (length(neighbors) == 0) {
    neighbors <- NA
  }
  return(neighbors)
})
neighbor_df <- plyr::ldply(neighbor_lists, rbind)
names(neighbor_df) <- paste0("Neighbor",1:ncol(neighbor_df))
result <- cbind(result,neighbor_df)
  1. read the long format with data.table, split, lapply dcast on the splits (<- memory overload)
result_long <- edges[, .(to = to, Node = from)][, rn := .I][,   .(Node, Neighbor = to, Number = rn)][order(Number),]
result_long[,cast_cat:=findInterval(Number,seq(100000,6000000,100000))]
# reshape to wide
result_wide <- dcast(result_long, Node ~ Number, value.var = "Neighbor", fill = "")
#Only tested on sample data, target data is 19 mln rows and dcast shall be split, but then it consumes 200Gb of ram
result_wide[, (2:ncol(result_wide)) := lapply(.SD, function(x) ifelse(x == "", NA, x)), .SDcols = 2:ncol(result_wide)]
result_wide = na_move(result_wide, cols = names(result_wide[,!1]) )
result_wide<- Filter(function(x)!all(is.na(x)), result_wide)

I posted as per Andy request, yet I think it clutters the question.

Advertisement

Answer

Thanks to the comment of @Stefano Barbi:

# extract attributes characteristics:
r <- vertex_attr(g,"rcount",index=V(g))

#create a dgC sparse matrix from graph
m <- get.adjacency(g)

# premultiply the adj matrix to find the sum of the neighbors resources
sum_of_rj = r %*% m

# add node's own resources
sum_of_r = sum_of_rj + r

#find the vector of shares
share = r / sum_of_r@x

sh_tab = data.table(i = sum_of_r@Dimnames[[2]], sh = share)
sh_tab
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement