Basically I have DataFrame
with a huge amount of boxes which are defined with xmin
ymin
xmax
ymax
tuples.
xmin ymin xmax ymax 0 66 88 130 151 1 143 390 236 468 2 77 331 143 423 3 289 112 337 157 4 343 282 405 352 .....
My task is to remove all nested boxes. (I.e. any box which is within
another box has to be removed)
My current method:
- construct GeoDataFrame with box geometry
- sort by box size (descending)
- iteratively find smaller boxes within a larger box.
Sandbox: https://www.kaggle.com/code/easzil/remove-nested-bbox/
def remove_nested_bbox(df): # make an extra unique 'id' df['__id'] = range(0, len(df)) # create geometry df['__geometry'] = df.apply(lambda x: shapely.geometry.box(x.xmin, x.ymin, x.xmax, x.ymax), axis=1) gdf = gpd.GeoDataFrame(df, geometry='__geometry') # sort by area gdf['__area'] = gdf.__geometry.area gdf.sort_values('__area', ascending=False, inplace=True) nested_id = set() for iloc in range(len(gdf)): # skip aready identifed one if gdf.iloc[iloc]['__id'] in nested_id: continue bbox = gdf.iloc[iloc] # current target larger bbox tests = gdf.iloc[iloc+1:] # all bboxes smaller than the urrent target tests = tests[~tests['__id'].isin(nested_id)] # skip aready identifed one nested = tests[tests['__geometry'].within(bbox['__geometry'])] nested_id.update(list(nested['__id'])) df = df[~df['__id'].isin(nested_id)] del df['__id'] del df['__geometry'] del df['__area'] return df
Is there any better way to optimize the task to make it faster? The current method is pretty slow to handle large dataset.
I would also consider other methods such as implementing in C or CUDA.
Advertisement
Answer
- your sample data is not large and has no instances of boxes within boxes. Have generated some randomly
- have used approach of using
loc
checking dimensions are bigger - not sure if this is faster than your approach, timing details
%timeit gdf["within"] = gdf.apply(within, args=(gdf,), axis=1) print(f"""number of polygons: {len(gdf)} number kept: {len(gdf.loc[lambda d: ~d["within"]])} """)
2.37 s ± 118 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) number of polygons: 2503 number kept: 241
visuals
full code
import pandas as pd import numpy as np import geopandas as gpd import io import shapely df = pd.read_csv( io.StringIO( """ xmin ymin xmax ymax 0 66 88 130 151 1 143 390 236 468 2 77 331 143 423 3 289 112 337 157 4 343 282 405 352""" ), sep="s+", ) # randomly generate some boxes, check they are valid df = pd.DataFrame( np.random.randint(1, 200, [10000, 4]), columns=["xmin", "ymin", "xmax", "ymax"] ).loc[lambda d: (d["xmax"] > d["xmin"]) & (d["ymax"] > d["ymin"])] gdf = gpd.GeoDataFrame( df, geometry=df.apply(lambda r: shapely.geometry.box(*r), axis=1) ) gdf.plot(edgecolor="black", alpha=0.6) # somewhat optimised by limiting polygons that are considered by looking at dimensions def within(r, gdf): for g in gdf.loc[ ~(gdf.index == r.name) & gdf["xmin"].lt(r["xmin"]) & gdf["ymin"].lt(r["ymin"]) & gdf["xmax"].gt(r["xmax"]) & gdf["ymax"].gt(r["ymax"]), "geometry", ]: if r["geometry"].within(g): return True return False gdf["within"] = gdf.apply(within, args=(gdf, ), axis=1) gdf.loc[lambda d: ~d["within"]].plot(edgecolor="black", alpha=0.6)
approach 2
- using sample data you provided on kaggle
- this returns in about half the time (5s) compared to previous version
- concept is similar, a box is within another box if xmin & ymin are greater than that of another box and max & ymax are less
import functools df = pd.read_csv("https://storage.googleapis.com/kagglesdsdata/datasets/2015126/3336994/sample.csv?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20220322%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20220322T093633Z&X-Goog-Expires=259199&X-Goog-SignedHeaders=host&X-Goog-Signature=3cc7824afe45313fe152858a6b8d79f93b0d90237ad82737fcf28949b9314df4be2f247821934a371d09cff4b463d69fc2422d8d7f746d6fccf014605b2e0f2cba54c23fba012c2531c4cd714436545bd83db0e880072fa049b116106ba4e296c259c32bc19267a15b9b9af78494bb6859cb53ffe4388c3b8c375a330e09008bb1d9c839f8ab4c14a8f01c38179ba31dc9f4ea9fa11f5ecc7e6ba87757edbe48577d60988349b948ceb70e885be5d6ebc36abe438a5275fa683ee4e318e21661ea032af7d8e2f488020288a1a2ff15af8aa153bb8ac33a0b827dd53c928ddf3abb024f2972ba6ef21bc9a0034e504706a2b3fc78be9ea3bb9190437d98a8ab35") def within_np(df): d = {} for c in df.columns[0:4]: a = np.tile(df[c].values.T ,(len(df),1)) d[c] = a.T > a if c[1:] == "min" else a.T < a aa = functools.reduce(np.logical_and, (aa for aa in d.values())) return aa.sum(axis=1)>0 df.loc[~within_np(df)]