I have a Pandas dataframe of intervals defined by 2 numerical coordinates, ‘start’ and ‘end’.
I am trying to collapse all intervals that are overlapping, and keep the inner coordinates.
index start end 0 10 40 1 13 34 2 50 100 3 44 94
Output: The same Pandas dataframe with collapsed intervals and inner coordinates. Two intervals overlap if they share a common point, including closed endpoints. Intervals that only have an open endpoint in common do not overlap.
e.g. The intervals with row index = [0,1] are overlapping. I want to collapse these 2 intervals into a new interval, which has new_start == max([10, 13]) and new_end == min([40,34]). The collapse interval for row index [0,1] will have new_start = 13, new_end = 34.
index start end 0 13 34 1 50 94
The size of the dataframe is 2M rows, therefore I am also looking for an efficient way to do it.
Advertisement
Answer
it can be done like below
df.groupby(((df.shift()["end"] - df["start"])<0).cumsum()).agg({"start":"min", "end":"max"})