I have a dataframe containing a transaction list. What I want to do is create something like a P&L per transaction and additionally make it suitable for entry into another software.
Basically, I would like to get a “Buy” transaction and find the next row with a “Sell” transaction. Then take that “Sell” transaction and append it to the end of the “Buy” row. Alternatively, it can be short action: in such case, I’d be taking the “Sell” transaction, search for the next (closest in time) “Buy” transaction, and add it to the end of the “Buy” row.
Input dataframe:
| |Side|Qty|Price |Time |Commission| ID | 5min_time | |-------|----|---|-------|------------------------------------|--------------------------------------| |228 |Buy |1 |4087.00|2022-06-10 00:16:00+08:00|0.4 | 727127819| 2022-06-10 00:15:00+08:00| |229 |Buy |1 |4087.00|2022-06-10 00:16:00+08:00|0.4 | 727127819| 2022-06-10 00:15:00+08:00| |230 |Buy |1 |4098.75|2022-06-10 00:57:00+08:00|0.4 | 727141021| 2022-06-10 00:55:00+08:00| |231 |Sell|1 |4094.25|2022-06-10 00:56:00+08:00|0.4 | 727140488| 2022-06-10 00:55:00+08:00| |232 |Sell|1 |4094.25|2022-06-10 00:56:00+08:00|0.4 | 727140488| 2022-06-10 00:55:00+08:00| |233 |Buy |1 |4104.25|2022-06-10 00:59:00+08:00|0.4 | 727141284| 2022-06-10 00:55:00+08:00| |234 |Sell|1 |4085.75|2022-06-10 01:43:00+08:00|0.4 | 727156519| 2022-06-10 01:40:00+08:00| |235 |Sell|1 |4085.75|2022-06-10 01:43:00+08:00|0.4 | 727156519| 2022-06-10 01:40:00+08:00| |236 |Sell|1 |4085.75|2022-06-10 01:43:00+08:00|0.4 | 727156519| 2022-06-10 01:40:00+08:00| |237 |Sell|1 |4085.75|2022-06-10 01:43:00+08:00|0.4 | 727156519| 2022-06-10 01:40:00+08:00| |238 |Buy |1 |4059.75|2022-06-10 03:04:00+08:00|0.4 | 727156591| 2022-06-10 03:00:00+08:00| |239 |Buy |1 |4059.75|2022-06-10 03:04:00+08:00|0.4 | 727156591| 2022-06-10 03:00:00+08:00|
Desired outcome:
| |Side|Qty|Price |Time |Commission| ID | 5min_time | |-------|----|---|-------|------------------------------------|--------------------------------------| |228 |Buy |1 |4087.00|2022-06-10 00:16:00+08:00|0.4 | 727127819| 2022-06-10 00:15:00+08:00|231 |Sell|1 |4094.25|2022-06-10 00:56:00+08:00|0.4 | 727140488| 2022-06-10 00:55:00+08:00| |229 |Buy |1 |4087.00|2022-06-10 00:16:00+08:00|0.4 | 727127819| 2022-06-10 00:15:00+08:00|232 |Sell|1 |4094.25|2022-06-10 00:56:00+08:00|0.4 | 727140488| 2022-06-10 00:55:00+08:00| |230 |Buy |1 |4098.75|2022-06-10 00:57:00+08:00|0.4 | 727141021| 2022-06-10 00:55:00+08:00|234 |Sell|1 |4085.75|2022-06-10 01:43:00+08:00|0.4 | 727156519| 2022-06-10 01:40:00+08:00| |233 |Buy |1 |4104.25|2022-06-10 00:59:00+08:00|0.4 | 727141284| 2022-06-10 00:55:00+08:00|235 |Sell|1 |4085.75|2022-06-10 01:43:00+08:00|0.4 | 727156519| 2022-06-10 01:40:00+08:00| |236 |Sell|1 |4085.75|2022-06-10 01:43:00+08:00|0.4 | 727156519| 2022-06-10 01:40:00+08:00|238 |Buy |1 |4059.75|2022-06-10 03:04:00+08:00|0.4 | 727156591| 2022-06-10 03:00:00+08:00| |237 |Sell|1 |4085.75|2022-06-10 01:43:00+08:00|0.4 | 727156519| 2022-06-10 01:40:00+08:00|239 |Buy |1 |4059.75|2022-06-10 03:04:00+08:00|0.4 | 727156591| 2022-06-10 03:00:00+08:00|
The only idea that comes to mind is to loop through all rows and individually append them to the end of others but it is hardly an efficient way. Can someone recommend me a Pythonic way of achieving the desired outcome?
Advertisement
Answer
You can use:
out = df.reset_index().set_index(df.groupby('Side').cumcount()) dup = out.index.duplicated() out = pd.concat([out[dup], out[~dup]], axis=1)
Output:
>>> out index Side Qty Price Time Commission ID 5min_time index Side Qty Price Time Commission ID 5min_time 0 231 Sell 1 4094.25 2022-06-10 00:56:00+08:00 0.4 727140488 2022-06-10 00:55:00+08:00 228 Buy 1 4087.00 2022-06-10 00:16:00+08:00 0.4 727127819 2022-06-10 00:15:00+08:00 1 232 Sell 1 4094.25 2022-06-10 00:56:00+08:00 0.4 727140488 2022-06-10 00:55:00+08:00 229 Buy 1 4087.00 2022-06-10 00:16:00+08:00 0.4 727127819 2022-06-10 00:15:00+08:00 2 234 Sell 1 4085.75 2022-06-10 01:43:00+08:00 0.4 727156519 2022-06-10 01:40:00+08:00 230 Buy 1 4098.75 2022-06-10 00:57:00+08:00 0.4 727141021 2022-06-10 00:55:00+08:00 3 235 Sell 1 4085.75 2022-06-10 01:43:00+08:00 0.4 727156519 2022-06-10 01:40:00+08:00 233 Buy 1 4104.25 2022-06-10 00:59:00+08:00 0.4 727141284 2022-06-10 00:55:00+08:00 4 238 Buy 1 4059.75 2022-06-10 03:04:00+08:00 0.4 727156591 2022-06-10 03:00:00+08:00 236 Sell 1 4085.75 2022-06-10 01:43:00+08:00 0.4 727156519 2022-06-10 01:40:00+08:00 5 239 Buy 1 4059.75 2022-06-10 03:04:00+08:00 0.4 727156591 2022-06-10 03:00:00+08:00 237 Sell 1 4085.75 2022-06-10 01:43:00+08:00 0.4 727156519 2022-06-10 01:40:00+08:00
First, we count the occurrence of each side. The first occurrence of “Sell” has index index 0, the second one has index 1 and so on. It’s the same for “Buy” because we use groupby
on Side
column.
Side Idx 228 Buy 0 # 1st occ. of Buy 229 Buy 1 # 2nd occ. of Buy 230 Buy 2 # 3rd occ. of Buy 231 Sell 0 # 1st occ. of Sell 232 Sell 1 # 1st occ. of Sell 233 Buy 3 # 4th occ. of Buy 234 Sell 2 # 3rd occ. of Sell 235 Sell 3 # 4st occ. of Sell 236 Sell 4 # 5st occ. of Sell 237 Sell 5 # 6st occ. of Sell 238 Buy 4 # 5st occ. of Buy 239 Buy 5 # 6st occ. of Buy
Now we have the row index of your new dataframe but we have to align each index of one side to other side (0 -> 0, 1 -> 1, …). As you have 2 sides, we can use duplicated
because the first occurrence of 0 is False and the second occurrence of 0 is True. It acts as a boolean mask:
Side Idx Dup 228 Buy 0 False # Left 229 Buy 1 False # Left 230 Buy 2 False # Left 231 Sell 0 True # Right 232 Sell 1 True # Right 233 Buy 3 False # Left 234 Sell 2 True # Right 235 Sell 3 True # Right 236 Sell 4 False # Left 237 Sell 5 False # Left 238 Buy 4 True # Right 239 Buy 5 True # Right