find nested boxes from huge dataset with geopandas (or other tools)

Question

Basically I have DataFrame with a huge amount of boxes which are defined with xmin ymin xmax ymax tuples. My task is to remove all nested boxes. (I.e. any box which is within another box has to be removed) My current method: construct GeoDataFrame with box geometry sort by box size (descending) iteratively find smaller boxes within a larger box.

Accepted Answer

your sample data is not large and has no instances of boxes within boxes.  Have generated some randomlyhave used approach of using loc checking dimensions are biggernot sure if this is faster than your approach, timing details%timeit gdf["within"] = gdf.apply(within, args=(gdf,), axis=1)print(f"""number of polygons: {len(gdf)}number kept: {len(gdf.loc[lambda d: ~d["within"]])}""")2.37 s ± 118 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)number of polygons: 2503number kept: 241visualsfull codeimport pandas as pdimport numpy as npimport geopandas as gpdimport ioimport shapelydf = pd.read_csv(    io.StringIO(        """    xmin    ymin    xmax    ymax0   66      88      130     1511   143     390     236     4682   77      331     143     4233   289     112     337     1574   343     282     405     352"""    ),    sep="s+",)# randomly generate some boxes, check they are validdf = pd.DataFrame(    np.random.randint(1, 200, [10000, 4]), columns=["xmin", "ymin", "xmax", "ymax"]).loc[lambda d: (d["xmax"] > d["xmin"]) & (d["ymax"] > d["ymin"])]gdf = gpd.GeoDataFrame(    df, geometry=df.apply(lambda r: shapely.geometry.box(*r), axis=1))gdf.plot(edgecolor="black", alpha=0.6)# somewhat optimised by limiting polygons that are considered by looking at dimensionsdef within(r, gdf):    for g in gdf.loc[        ~(gdf.index == r.name)        & gdf["xmin"].lt(r["xmin"])        & gdf["ymin"].lt(r["ymin"])        & gdf["xmax"].gt(r["xmax"])        & gdf["ymax"].gt(r["ymax"]),        "geometry",    ]:        if r["geometry"].within(g):            return True    return Falsegdf["within"] = gdf.apply(within, args=(gdf, ), axis=1)gdf.loc[lambda d: ~d["within"]].plot(edgecolor="black", alpha=0.6)approach 2using sample data you provided on kagglethis returns in about half the time (5s) compared to previous versionconcept is similar, a box is within another box if xmin & ymin are greater than that of another box and max & ymax are lessimport functoolsdf = pd.read_csv("https://storage.googleapis.com/kagglesdsdata/datasets/2015126/3336994/sample.csv?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20220322%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20220322T093633Z&X-Goog-Expires=259199&X-Goog-SignedHeaders=host&X-Goog-Signature=3cc7824afe45313fe152858a6b8d79f93b0d90237ad82737fcf28949b9314df4be2f247821934a371d09cff4b463d69fc2422d8d7f746d6fccf014605b2e0f2cba54c23fba012c2531c4cd714436545bd83db0e880072fa049b116106ba4e296c259c32bc19267a15b9b9af78494bb6859cb53ffe4388c3b8c375a330e09008bb1d9c839f8ab4c14a8f01c38179ba31dc9f4ea9fa11f5ecc7e6ba87757edbe48577d60988349b948ceb70e885be5d6ebc36abe438a5275fa683ee4e318e21661ea032af7d8e2f488020288a1a2ff15af8aa153bb8ac33a0b827dd53c928ddf3abb024f2972ba6ef21bc9a0034e504706a2b3fc78be9ea3bb9190437d98a8ab35")def within_np(df):    d = {}    for c in df.columns[0:4]:        a = np.tile(df[c].values.T ,(len(df),1))        d[c] = a.T > a if c[1:] == "min" else a.T < a    aa = functools.reduce(np.logical_and, (aa for aa in d.values()))    return aa.sum(axis=1)>0df.loc[~within_np(df)]

Advertisement

Answer

visuals

full code

approach 2