Skip to content
Advertisement

Agglomerate adjecent cells and their neighbours of the same type to clusters with python

I’m trying to agglomerate adjacent cells (and their neighbours) that have the same type (integer from 1 to 10) into new clusters by assigning them to a cluster id.

As visualised here for some of the clusters:

enter image description here

Currently, I use an abbreviation from Breadth-First search to go through all neighbours and their neighbours and then assign a cluster-id to all found neighbours of the same type. Since my data set is fairly big (<3 Million grid cells) this operation would take days in its current form.

My function that I run in a loop choosing cells by chance until all cells have been assigned a cluster id:

def bfs(df, grid_id):
    visited = []
    queue = []
    visited.append(grid_id)
    queue.append(grid_id)
    base_pred = df.loc[grid_id, "class"]
    while queue:
        _ = queue.pop(0)
        n, e = get_n_e_index(grid_id)
        neighbours = get_same_neighbours(base_pred, n, e, agglo_df)
        for neighbour in neighbours:
            if neighbour not in visited:
                visited.append(neighbour)
                queue.append(neighbour)
                queue = queue + get_same_neighbours(base_pred, n, e, agglo_df)

    return visited

visited is used to assign the same cluster_id

get_same_neighbours just returns neighbours of a given cell that have the same type

Before I ended up validating the approach I came to the conclusion that it’s ultimately to slow and was wondering if anybody knows of an algorithm fast enough to solve this. Searching online for a solution didn’t help.

Edit: That it took so long wasn’t an issue of the algorithm, but me forgetting to reduce the possible start cells to only those who haven’t been touched before.

Edit2:

get_n_e_index just converts a string ID to North East Integers

def get_same_neighbours(base_class, north, east, df):
    neighbours = []
    grid_ids = df["ids"].values
    for i in [-1, 1, 0]:
        for j in [-1, 1, 0]:
            n = int(north + i)
            e = int(east + j)
            grid_id = get_grid_id(n, e)
            invalid = [(-1, 1), (1, 1), (0, 0), (1, -1), (-1, -1)] # invalid combinations
            if grid_id in grid_ids and (j, i) not in invalid and base_class == df.loc[grid_id, "class"]:
                neighbours.append(grid_id)
    return neighbours

get_grid_id converts north (n) and east (e) integers to the string id of the grid cell.

Advertisement

Answer

You’re making a couple of mistakes. Note that for ~3 million cells, this should work within about a second.

As pointed out in the comments, visited should be a set for O(1) lookup.

Second, queue = queue + get_same_neighbours(base_pred, n, e, agglo_df) can get very slow if the queue gets very long, as it creates a copy of the queue. Instead, simply append the new items.

Finally, your queue is not actually a queue, but a list. Therefore, popping the first element takes O(n) time. Instead you can use a queue (deque in python), or you can pop the last element instead, which will turn Breadth-firsth search into Depth-first search, which should work just as well.

Apart from this, I think the only remaining problems are in the graph/grid representation. Storing it as a dataframe seems to often require parsing the entire dataframe to find cells. If you store it as a multidimensional array (be it lists or numpy arrays) or even as a dict, you can find any cell in O(1) time. Therefore I’d suggest first preprocessing the data from the dataframe into another data structure that grants easy access.

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement