I apologize for the title, but I don’t know enough to properly condense my question into a single line.
Here is the use case:
- I have a pd.DataFrame with arbitrary index values and a column, ‘timestamp’.
- I have an ordered List of timestamp values.
- I want to split the DataFrame into chunks with ‘timestamp’ values that are:
- less than List[0] (the lowest value in the List)
- in between each consecutive timestamps in the List (lesser value inclusive, greater value exclusive)
- greater than or equal to List[-1] (the greatest value in the List)
I’ve already made of list-of-dataframes, chunking the original with a While Statement. However, I feel like this is inefficient and there should be a way to use np.split() or df.groupby() along with the List of timestamps to do this more elegantly and efficiently. Then again, I could be wrong.
So I guess my question boils down to: “what is the most time efficient method, having the most elegant presentation, to achieve the goals stated above”?
@KU99 Mentioned providing an example and the output:
df =
colA | colB | timestamp |
---|---|---|
First | row | 1 |
Second | row | 2 |
First | row | 3 |
Second | row | 4 |
First | row | 5 |
Second | row | 6 |
First | row | 7 |
Second | row | 8 |
First | row | 9 |
Second | row | 10 |
First | row | 11 |
Second | row | 12 |
List = [3, 7, 8, 9]
output =
colA | colB | timestamp |
---|---|---|
First | row | 1 |
Second | row | 2 |
colA | colB | timestamp |
---|---|---|
First | row | 3 |
Second | row | 4 |
First | row | 5 |
Second | row | 6 |
colA | colB | timestamp |
---|---|---|
First | row | 7 |
colA | colB | timestamp |
---|---|---|
Second | row | 8 |
colA | colB | timestamp |
---|---|---|
First | row | 9 |
Second | row | 10 |
First | row | 11 |
Second | row | 12 |
The type of output is going to be dependent on the method, but I don’t care if it’s a list, a dictionary, or some other indexable type.
Advertisement
Answer
Try pd.cut
+ .groupby
:
bins = [3, 7, 8, 9] for _, g in df.groupby( pd.cut(df.timestamp, [float("-inf")] + bins + [float("+inf")], right=False) ): print(g) print("-" * 80)
Prints:
colA colB timestamp 0 First row 1 1 Second row 2 -------------------------------------------------------------------------------- colA colB timestamp 2 First row 3 3 Second row 4 4 First row 5 5 Second row 6 -------------------------------------------------------------------------------- colA colB timestamp 6 First row 7 -------------------------------------------------------------------------------- colA colB timestamp 7 Second row 8 -------------------------------------------------------------------------------- colA colB timestamp 8 First row 9 9 Second row 10 10 First row 11 11 Second row 12 --------------------------------------------------------------------------------