# How to calculate average returns over separate consecutive ranges determined by another column in Python?

#### Tags: dataframe, pandas, python

I currently have a Pandas DataFrame which contains a time series of asset prices and a column containing a “state”. There are three states -1, 0, 1 that occur at various points in the data.

I am trying to find the average return on the asset in each of these states, ideally using a vectorised method.

Here is an example of the DataFrame:

```| Timestamp                        | mid_price | state |
|----------------------------------|-----------|-------|
| 2020-01-01 00:05:25+00:00        | 528.50    | 0     |
| 2020-01-01 00:05:25.500000+00:00 | 529.00    | 0     |
| 2020-01-01 00:05:26+00:00        | 527.50    | 1     |
| 2020-01-01 00:05:26.500000+00:00 | 528.00    | 1     |
| 2020-01-01 00:05:27+00:00        | 529.00    | 1     |
| 2020-01-01 00:05:27.500000+00:00 | 531.50    | -1    |
| 2020-01-01 00:05:28+00:00        | 531.50    | -1    |
| 2020-01-01 00:05:28.500000+00:00 | 532.50    | 1     |
| 2020-01-01 00:05:29+00:00        | 532.50    | 1     |
| 2020-01-01 00:05:29.500000+00:00 | 530.00    | 1     |
```

I am trying to calculate the average return for each state, so for example for state 1:

```First sequence: (529 - 527.50) / 527.50 = 0.0028...

Second sequence: (530 - 532.50) / 532.50 = -0.0047...

Average return: (0.0028 - 0.0047) / 2 = -0.00095
```

Is there a neat, vectorised way to do this?

## Answer

Creating and ID for each sequence and using groupby:

```df['Seq'] = (df.state.diff() != 0).cumsum()  # ID for each sequence
df.groupby(['Seq', 'state']).agg(
{'mid_price': lambda x: (x.values[-1] - x.values) / x.values}  # compute return of first each sequence
).groupby('state').mid_price.mean()  # compute average return for each state
```

The `Seq` column is computed so that all the rows belonging to a sequence have the same ID:

```                         Timestamp  mid_price  state  Seq
0        2020-01-01 00:05:25+00:00      528.5      0    1
1 2020-01-01 00:05:25.500000+00:00      529.0      0    1
2        2020-01-01 00:05:26+00:00      527.5      1    2
3 2020-01-01 00:05:26.500000+00:00      528.0      1    2
4        2020-01-01 00:05:27+00:00      529.0      1    2
5 2020-01-01 00:05:27.500000+00:00      531.5     -1    3
6        2020-01-01 00:05:28+00:00      531.5     -1    3
7 2020-01-01 00:05:28.500000+00:00      532.5      1    4
8        2020-01-01 00:05:29+00:00      532.5      1    4
9 2020-01-01 00:05:29.500000+00:00      530.0      1    4
```

The main idea is to identify the elements of the `state` column where the value change of `state` change using `diff`: if the state is different w.r.t. the previous column, the difference will be different than zero. Then, you can create an increasing ID using `cumsum`, i.e. the cumulative sum. This works also because the first element returned by the diff is NaN, that is different from zero.

Once you have the `Seq` column, the solution of your problem is very simple: firstly, group by `state` and `Seq` and get the return for each sequence and then compute the average of returns for each state.

Here the complete code and the result:

```import pandas as pd

df = pd.DataFrame(
columns=["Timestamp", "mid_price", "state"],
data=[
["2020-01-01 00:05:25+00:00", 528.50, 0],
["2020-01-01 00:05:25.500000+00:00", 529.00, 0],
["2020-01-01 00:05:26+00:00", 527.50, 1],
["2020-01-01 00:05:26.500000+00:00", 528.00, 1],
["2020-01-01 00:05:27+00:00", 529.00, 1],
["2020-01-01 00:05:27.500000+00:00", 531.50, -1],
["2020-01-01 00:05:28+00:00", 531.50, -1],
["2020-01-01 00:05:28.500000+00:00", 532.50, 1],
["2020-01-01 00:05:29+00:00", 532.50, 1],
["2020-01-01 00:05:29.500000+00:00", 530.00, 1],
],
)

df['Timestamp'] = pd.to_datetime(df.Timestamp)
df['Seq'] = (df.state.diff() != 0).cumsum()
df.groupby(['Seq', 'state']).agg(
{'mid_price': lambda x: (x.values[-1] - x.values) / x.values}  # compute return of first each sequence
).groupby('state').mid_price.mean()  # compute average return for each state
```
```state
-1    0.000000
0    0.000946
1   -0.000926
Name: mid_price, dtype: float64
```

Source: stackoverflow