How to calculate average returns over separate consecutive ranges determined by another column in Python?

Tags: , ,



I currently have a Pandas DataFrame which contains a time series of asset prices and a column containing a “state”. There are three states -1, 0, 1 that occur at various points in the data.

I am trying to find the average return on the asset in each of these states, ideally using a vectorised method.

Here is an example of the DataFrame:

| Timestamp                        | mid_price | state |
|----------------------------------|-----------|-------|
| 2020-01-01 00:05:25+00:00        | 528.50    | 0     |
| 2020-01-01 00:05:25.500000+00:00 | 529.00    | 0     |
| 2020-01-01 00:05:26+00:00        | 527.50    | 1     |
| 2020-01-01 00:05:26.500000+00:00 | 528.00    | 1     |
| 2020-01-01 00:05:27+00:00        | 529.00    | 1     |
| 2020-01-01 00:05:27.500000+00:00 | 531.50    | -1    |
| 2020-01-01 00:05:28+00:00        | 531.50    | -1    |
| 2020-01-01 00:05:28.500000+00:00 | 532.50    | 1     |
| 2020-01-01 00:05:29+00:00        | 532.50    | 1     |
| 2020-01-01 00:05:29.500000+00:00 | 530.00    | 1     |

I am trying to calculate the average return for each state, so for example for state 1:

First sequence: (529 - 527.50) / 527.50 = 0.0028...

Second sequence: (530 - 532.50) / 532.50 = -0.0047...

Average return: (0.0028 - 0.0047) / 2 = -0.00095

Is there a neat, vectorised way to do this?

Answer

Creating and ID for each sequence and using groupby:

df['Seq'] = (df.state.diff() != 0).cumsum()  # ID for each sequence
df.groupby(['Seq', 'state']).agg(
    {'mid_price': lambda x: (x.values[-1] - x.values[0]) / x.values[0]}  # compute return of first each sequence
).groupby('state').mid_price.mean()  # compute average return for each state

The Seq column is computed so that all the rows belonging to a sequence have the same ID:

                         Timestamp  mid_price  state  Seq
0        2020-01-01 00:05:25+00:00      528.5      0    1
1 2020-01-01 00:05:25.500000+00:00      529.0      0    1
2        2020-01-01 00:05:26+00:00      527.5      1    2
3 2020-01-01 00:05:26.500000+00:00      528.0      1    2
4        2020-01-01 00:05:27+00:00      529.0      1    2
5 2020-01-01 00:05:27.500000+00:00      531.5     -1    3
6        2020-01-01 00:05:28+00:00      531.5     -1    3
7 2020-01-01 00:05:28.500000+00:00      532.5      1    4
8        2020-01-01 00:05:29+00:00      532.5      1    4
9 2020-01-01 00:05:29.500000+00:00      530.0      1    4

The main idea is to identify the elements of the state column where the value change of state change using diff: if the state is different w.r.t. the previous column, the difference will be different than zero. Then, you can create an increasing ID using cumsum, i.e. the cumulative sum. This works also because the first element returned by the diff is NaN, that is different from zero.

Once you have the Seq column, the solution of your problem is very simple: firstly, group by state and Seq and get the return for each sequence and then compute the average of returns for each state.

Here the complete code and the result:

import pandas as pd

df = pd.DataFrame(
    columns=["Timestamp", "mid_price", "state"],
    data=[
        ["2020-01-01 00:05:25+00:00", 528.50, 0],
        ["2020-01-01 00:05:25.500000+00:00", 529.00, 0],
        ["2020-01-01 00:05:26+00:00", 527.50, 1],
        ["2020-01-01 00:05:26.500000+00:00", 528.00, 1],
        ["2020-01-01 00:05:27+00:00", 529.00, 1],
        ["2020-01-01 00:05:27.500000+00:00", 531.50, -1],
        ["2020-01-01 00:05:28+00:00", 531.50, -1],
        ["2020-01-01 00:05:28.500000+00:00", 532.50, 1],
        ["2020-01-01 00:05:29+00:00", 532.50, 1],
        ["2020-01-01 00:05:29.500000+00:00", 530.00, 1],
    ],
)

df['Timestamp'] = pd.to_datetime(df.Timestamp)
df['Seq'] = (df.state.diff() != 0).cumsum()
df.groupby(['Seq', 'state']).agg(
    {'mid_price': lambda x: (x.values[-1] - x.values[0]) / x.values[0]}  # compute return of first each sequence
).groupby('state').mid_price.mean()  # compute average return for each state
state
-1    0.000000
 0    0.000946
 1   -0.000926
Name: mid_price, dtype: float64


Source: stackoverflow