I have a dataframe containing a transaction list. What I want to do is create something like a P&L per transaction and additionally make it suitable for entry into another software.
Basically, I would like to get a “Buy” transaction and find the next row with a “Sell” transaction. Then take that “Sell” transaction and append it to the end of the “Buy” row. Alternatively, it can be short action: in such case, I’d be taking the “Sell” transaction, search for the next (closest in time) “Buy” transaction, and add it to the end of the “Buy” row.
Input dataframe:
| |Side|Qty|Price |Time |Commission| ID | 5min_time |
|-------|----|---|-------|------------------------------------|--------------------------------------|
|228 |Buy |1 |4087.00|2022-06-10 00:16:00+08:00|0.4 | 727127819| 2022-06-10 00:15:00+08:00|
|229 |Buy |1 |4087.00|2022-06-10 00:16:00+08:00|0.4 | 727127819| 2022-06-10 00:15:00+08:00|
|230 |Buy |1 |4098.75|2022-06-10 00:57:00+08:00|0.4 | 727141021| 2022-06-10 00:55:00+08:00|
|231 |Sell|1 |4094.25|2022-06-10 00:56:00+08:00|0.4 | 727140488| 2022-06-10 00:55:00+08:00|
|232 |Sell|1 |4094.25|2022-06-10 00:56:00+08:00|0.4 | 727140488| 2022-06-10 00:55:00+08:00|
|233 |Buy |1 |4104.25|2022-06-10 00:59:00+08:00|0.4 | 727141284| 2022-06-10 00:55:00+08:00|
|234 |Sell|1 |4085.75|2022-06-10 01:43:00+08:00|0.4 | 727156519| 2022-06-10 01:40:00+08:00|
|235 |Sell|1 |4085.75|2022-06-10 01:43:00+08:00|0.4 | 727156519| 2022-06-10 01:40:00+08:00|
|236 |Sell|1 |4085.75|2022-06-10 01:43:00+08:00|0.4 | 727156519| 2022-06-10 01:40:00+08:00|
|237 |Sell|1 |4085.75|2022-06-10 01:43:00+08:00|0.4 | 727156519| 2022-06-10 01:40:00+08:00|
|238 |Buy |1 |4059.75|2022-06-10 03:04:00+08:00|0.4 | 727156591| 2022-06-10 03:00:00+08:00|
|239 |Buy |1 |4059.75|2022-06-10 03:04:00+08:00|0.4 | 727156591| 2022-06-10 03:00:00+08:00|
Desired outcome:
| |Side|Qty|Price |Time |Commission| ID | 5min_time |
|-------|----|---|-------|------------------------------------|--------------------------------------|
|228 |Buy |1 |4087.00|2022-06-10 00:16:00+08:00|0.4 | 727127819| 2022-06-10 00:15:00+08:00|231 |Sell|1 |4094.25|2022-06-10 00:56:00+08:00|0.4 | 727140488| 2022-06-10 00:55:00+08:00|
|229 |Buy |1 |4087.00|2022-06-10 00:16:00+08:00|0.4 | 727127819| 2022-06-10 00:15:00+08:00|232 |Sell|1 |4094.25|2022-06-10 00:56:00+08:00|0.4 | 727140488| 2022-06-10 00:55:00+08:00|
|230 |Buy |1 |4098.75|2022-06-10 00:57:00+08:00|0.4 | 727141021| 2022-06-10 00:55:00+08:00|234 |Sell|1 |4085.75|2022-06-10 01:43:00+08:00|0.4 | 727156519| 2022-06-10 01:40:00+08:00|
|233 |Buy |1 |4104.25|2022-06-10 00:59:00+08:00|0.4 | 727141284| 2022-06-10 00:55:00+08:00|235 |Sell|1 |4085.75|2022-06-10 01:43:00+08:00|0.4 | 727156519| 2022-06-10 01:40:00+08:00|
|236 |Sell|1 |4085.75|2022-06-10 01:43:00+08:00|0.4 | 727156519| 2022-06-10 01:40:00+08:00|238 |Buy |1 |4059.75|2022-06-10 03:04:00+08:00|0.4 | 727156591| 2022-06-10 03:00:00+08:00|
|237 |Sell|1 |4085.75|2022-06-10 01:43:00+08:00|0.4 | 727156519| 2022-06-10 01:40:00+08:00|239 |Buy |1 |4059.75|2022-06-10 03:04:00+08:00|0.4 | 727156591| 2022-06-10 03:00:00+08:00|
The only idea that comes to mind is to loop through all rows and individually append them to the end of others but it is hardly an efficient way. Can someone recommend me a Pythonic way of achieving the desired outcome?
Advertisement
Answer
You can use:
out = df.reset_index().set_index(df.groupby('Side').cumcount())
dup = out.index.duplicated()
out = pd.concat([out[dup], out[~dup]], axis=1)
Output:
>>> out
index Side Qty Price Time Commission ID 5min_time index Side Qty Price Time Commission ID 5min_time
0 231 Sell 1 4094.25 2022-06-10 00:56:00+08:00 0.4 727140488 2022-06-10 00:55:00+08:00 228 Buy 1 4087.00 2022-06-10 00:16:00+08:00 0.4 727127819 2022-06-10 00:15:00+08:00
1 232 Sell 1 4094.25 2022-06-10 00:56:00+08:00 0.4 727140488 2022-06-10 00:55:00+08:00 229 Buy 1 4087.00 2022-06-10 00:16:00+08:00 0.4 727127819 2022-06-10 00:15:00+08:00
2 234 Sell 1 4085.75 2022-06-10 01:43:00+08:00 0.4 727156519 2022-06-10 01:40:00+08:00 230 Buy 1 4098.75 2022-06-10 00:57:00+08:00 0.4 727141021 2022-06-10 00:55:00+08:00
3 235 Sell 1 4085.75 2022-06-10 01:43:00+08:00 0.4 727156519 2022-06-10 01:40:00+08:00 233 Buy 1 4104.25 2022-06-10 00:59:00+08:00 0.4 727141284 2022-06-10 00:55:00+08:00
4 238 Buy 1 4059.75 2022-06-10 03:04:00+08:00 0.4 727156591 2022-06-10 03:00:00+08:00 236 Sell 1 4085.75 2022-06-10 01:43:00+08:00 0.4 727156519 2022-06-10 01:40:00+08:00
5 239 Buy 1 4059.75 2022-06-10 03:04:00+08:00 0.4 727156591 2022-06-10 03:00:00+08:00 237 Sell 1 4085.75 2022-06-10 01:43:00+08:00 0.4 727156519 2022-06-10 01:40:00+08:00
First, we count the occurrence of each side. The first occurrence of “Sell” has index index 0, the second one has index 1 and so on. It’s the same for “Buy” because we use groupby
on Side
column.
Side Idx
228 Buy 0 # 1st occ. of Buy
229 Buy 1 # 2nd occ. of Buy
230 Buy 2 # 3rd occ. of Buy
231 Sell 0 # 1st occ. of Sell
232 Sell 1 # 1st occ. of Sell
233 Buy 3 # 4th occ. of Buy
234 Sell 2 # 3rd occ. of Sell
235 Sell 3 # 4st occ. of Sell
236 Sell 4 # 5st occ. of Sell
237 Sell 5 # 6st occ. of Sell
238 Buy 4 # 5st occ. of Buy
239 Buy 5 # 6st occ. of Buy
Now we have the row index of your new dataframe but we have to align each index of one side to other side (0 -> 0, 1 -> 1, …). As you have 2 sides, we can use duplicated
because the first occurrence of 0 is False and the second occurrence of 0 is True. It acts as a boolean mask:
Side Idx Dup
228 Buy 0 False # Left
229 Buy 1 False # Left
230 Buy 2 False # Left
231 Sell 0 True # Right
232 Sell 1 True # Right
233 Buy 3 False # Left
234 Sell 2 True # Right
235 Sell 3 True # Right
236 Sell 4 False # Left
237 Sell 5 False # Left
238 Buy 4 True # Right
239 Buy 5 True # Right