I’m trying to agglomerate adjacent cells (and their neighbours) that have the same type (integer from 1 to 10) into new clusters by assigning them to a cluster id.
As visualised here for some of the clusters:
Currently, I use an abbreviation from Breadth-First search to go through all neighbours and their neighbours and then assign a cluster-id to all found neighbours of the same type. Since my data set is fairly big (<3 Million grid cells) this operation would take days in its current form.
My function that I run in a loop choosing cells by chance until all cells have been assigned a cluster id:
def bfs(df, grid_id): visited = [] queue = [] visited.append(grid_id) queue.append(grid_id) base_pred = df.loc[grid_id, "class"] while queue: _ = queue.pop(0) n, e = get_n_e_index(grid_id) neighbours = get_same_neighbours(base_pred, n, e, agglo_df) for neighbour in neighbours: if neighbour not in visited: visited.append(neighbour) queue.append(neighbour) queue = queue + get_same_neighbours(base_pred, n, e, agglo_df) return visited
visited
is used to assign the same cluster_id
get_same_neighbours
just returns neighbours of a given cell that have the same type
Before I ended up validating the approach I came to the conclusion that it’s ultimately to slow and was wondering if anybody knows of an algorithm fast enough to solve this. Searching online for a solution didn’t help.
Edit: That it took so long wasn’t an issue of the algorithm, but me forgetting to reduce the possible start cells to only those who haven’t been touched before.
Edit2:
get_n_e_index
just converts a string ID to North East Integers
def get_same_neighbours(base_class, north, east, df): neighbours = [] grid_ids = df["ids"].values for i in [-1, 1, 0]: for j in [-1, 1, 0]: n = int(north + i) e = int(east + j) grid_id = get_grid_id(n, e) invalid = [(-1, 1), (1, 1), (0, 0), (1, -1), (-1, -1)] # invalid combinations if grid_id in grid_ids and (j, i) not in invalid and base_class == df.loc[grid_id, "class"]: neighbours.append(grid_id) return neighbours
get_grid_id
converts north (n) and east (e) integers to the string id of the grid cell.
Advertisement
Answer
You’re making a couple of mistakes. Note that for ~3 million cells, this should work within about a second.
As pointed out in the comments, visited
should be a set for O(1)
lookup.
Second, queue = queue + get_same_neighbours(base_pred, n, e, agglo_df)
can get very slow if the queue gets very long, as it creates a copy of the queue. Instead, simply append
the new items.
Finally, your queue
is not actually a queue, but a list. Therefore, popping the first element takes O(n)
time. Instead you can use a queue (deque
in python), or you can pop the last element instead, which will turn Breadth-firsth search into Depth-first search, which should work just as well.
Apart from this, I think the only remaining problems are in the graph/grid representation. Storing it as a dataframe seems to often require parsing the entire dataframe to find cells. If you store it as a multidimensional array (be it list
s or numpy array
s) or even as a dict
, you can find any cell in O(1)
time. Therefore I’d suggest first preprocessing the data from the dataframe into another data structure that grants easy access.