I have faced the following subtask while working with graph data:
I need to construct graph connectivity matrices in COO format for graphs with several fully-connected components from arrays of “border” indices.
As an example, given array
borders = [0, 2, 5]
the resulting COO matrix should be
coo_matrix = [[0, 0, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4], [0, 1, 0, 1, 2, 3, 4, 2, 3, 4, 2, 3, 4]].
That is, borders array contains ranges of nodes that should form fully-connected subgraphs (starting index included and ending index excluded).
I came up with the following algorithm, but I suspect that the perfomance could be improved:
import numpy as np def get_coo(borders): edge_list = [] for s, e in zip(borders, borders[1:]): # create fully-connected subgraph arr = np.arange(s, e) t = np.array(np.meshgrid(arr, arr)).T.reshape(-1, 2) t = t.T edge_list.append(t) edge_list = np.concatenate(edge_list, axis=1) return edge_list
I feel there may be a faster solution, maybe using some numpy vectorized operations.
Does anyone have any ideas?
Advertisement
Answer
Since your goal is a faster solution than what you have, you can explore itertools
for solving this efficiently. This approach benchmarks approximately 25 times faster than your current approach as tested on larger border
lists.
import numpy as np from itertools import product, chain def get_coo(borders): edges = chain(*[product(range(*i),repeat=2) for i in zip(borders, borders[1:])]) return list(edges) output = get_coo(borders) ## NOTE: Remember to can convert to array and Transpose for all approaches below to get Coo format. np.array(output).T
array([[0, 0, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4], [0, 1, 0, 1, 2, 3, 4, 2, 3, 4, 2, 3, 4]])
Alternate approaches and benchmarks:
Note: These have been benchmarked on your current small list as well as on a larger border list as generated by
borders = np.arange(300)[np.random.randint(0,2,(300,),dtype=bool)]
Disjoint union of complete graphs
What you are trying to do is essentially combine disjoint complete graphs. Adjacency matrices for such graphs have complete connections for selective items along it’s diagonal. You can use networkx
to solve these.
While slower than your current solution, you will find that working on these graphs objects would be much much easier and rewarding than using NumPy to represent graphs.
Approach 1:
- Assuming that nodes are in sequence, calculate the number of nodes in each subgraph as
i
- Create a complete matrix filled with 1s of the shape
i*i
- Combine the graphs using
nx.disjoint_union_all
- Fetch the edges of this graph
import numpy as np import networkx as nx def get_coo(borders): graphs = [nx.from_numpy_matrix(np.ones((i,i))).to_directed() for i in np.diff(borders)] edges = nx.disjoint_union_all(graphs).edges() return edges %timeit get_coo(borders) #Small- 277 µs ± 33.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) #Large- 300 ms ± 36.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Approach 2:
- Iterate over the rolling 2-gram tuples of the
borders
usingzip
- Create
nx.complete_graph
usingranges(start, end)
from these tuples - Combine the graphs using
nx.disjoint_union_all
- Fetch the edges of this graph
import numpy as np import networkx as nx def get_coo(borders): graphs = [nx.complete_graph(range(*i),nx.DiGraph()) for i in zip(borders, borders[1:])] edges = nx.disjoint_union_all(graphs).edges() return edges %timeit get_coo(borders) #Small- 116 µs ± 13.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) #Large- 207 ms ± 35.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Output is a bit faster than previous but lacks the self-loops on the nodes which would have to be added separately
Using itertools.product
Approach 3:
- Iterate over the rolling 2-gram tuples of the
borders
usingzip
- Use
itertools.product
completely connected edge list for each subgraph - Use
itertools.chain
to “append” the two iterators - Return them as edges
import numpy as np from itertools import product, chain def get_coo(borders): edges = chain(*[product(range(*i),repeat=2) for i in zip(borders, borders[1:])]) return list(edges) %timeit get_coo(borders) #Small- 3.91 µs ± 787 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) #Large- 183 µs ± 21.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
This approach is approximately 25 times faster than your current approach
Your current approach – benchmark
def get_coo(borders): edge_list = [] for s, e in zip(borders, borders[1:]): # create fully-connected subgraph arr = np.arange(s, e) t = np.array(np.meshgrid(arr, arr)).T.reshape(-1, 2) t = t.T edge_list.append(t) edge_list = np.concatenate(edge_list, axis=1) return edge_list %timeit get_coo(borders) #Small- 95.1 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) #Large- 3.91 ms ± 962 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)