Skip to content
Advertisement

How do I split a Pandas DataFrame into sub-arrays (specific use case outlined in detail)?

I apologize for the title, but I don’t know enough to properly condense my question into a single line.

Here is the use case:

  • I have a pd.DataFrame with arbitrary index values and a column, ‘timestamp’.
  • I have an ordered List of timestamp values.
  • I want to split the DataFrame into chunks with ‘timestamp’ values that are:
    1. less than List[0] (the lowest value in the List)
    2. in between each consecutive timestamps in the List (lesser value inclusive, greater value exclusive)
    3. greater than or equal to List[-1] (the greatest value in the List)

I’ve already made of list-of-dataframes, chunking the original with a While Statement. However, I feel like this is inefficient and there should be a way to use np.split() or df.groupby() along with the List of timestamps to do this more elegantly and efficiently. Then again, I could be wrong.

So I guess my question boils down to: “what is the most time efficient method, having the most elegant presentation, to achieve the goals stated above”?

@KU99 Mentioned providing an example and the output:

df =

colA colB timestamp
First row 1
Second row 2
First row 3
Second row 4
First row 5
Second row 6
First row 7
Second row 8
First row 9
Second row 10
First row 11
Second row 12

List = [3, 7, 8, 9]

output =

colA colB timestamp
First row 1
Second row 2
colA colB timestamp
First row 3
Second row 4
First row 5
Second row 6
colA colB timestamp
First row 7
colA colB timestamp
Second row 8
colA colB timestamp
First row 9
Second row 10
First row 11
Second row 12

The type of output is going to be dependent on the method, but I don’t care if it’s a list, a dictionary, or some other indexable type.

Advertisement

Answer

Try pd.cut + .groupby:

bins = [3, 7, 8, 9]

for _, g in df.groupby(
    pd.cut(df.timestamp, [float("-inf")] + bins + [float("+inf")], right=False)
):
    print(g)
    print("-" * 80)

Prints:

     colA colB  timestamp
0   First  row          1
1  Second  row          2
--------------------------------------------------------------------------------
     colA colB  timestamp
2   First  row          3
3  Second  row          4
4   First  row          5
5  Second  row          6
--------------------------------------------------------------------------------
    colA colB  timestamp
6  First  row          7
--------------------------------------------------------------------------------
     colA colB  timestamp
7  Second  row          8
--------------------------------------------------------------------------------
      colA colB  timestamp
8    First  row          9
9   Second  row         10
10   First  row         11
11  Second  row         12
--------------------------------------------------------------------------------
User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement