How to calculate average returns over separate consecutive ranges determined by another column in Python?

Question

I currently have a Pandas DataFrame which contains a time series of asset prices and a column containing a "state". There are three states -1, 0, 1 that occur at various points in the data. I am trying to find the average return on the asset in each of these states, ideally using a vectorised method. Here is an example

Accepted Answer

Creating and ID for each sequence and using groupby:df['Seq'] = (df.state.diff() != 0).cumsum()  # ID for each sequencedf.groupby(['Seq', 'state']).agg(    {'mid_price': lambda x: (x.values[-1] - x.values[0]) / x.values[0]}  # compute return of first each sequence).groupby('state').mid_price.mean()  # compute average return for each stateThe Seq column is computed so that all the rows belonging to a sequence have the same ID:                         Timestamp  mid_price  state  Seq0        2020-01-01 00:05:25+00:00      528.5      0    11 2020-01-01 00:05:25.500000+00:00      529.0      0    12        2020-01-01 00:05:26+00:00      527.5      1    23 2020-01-01 00:05:26.500000+00:00      528.0      1    24        2020-01-01 00:05:27+00:00      529.0      1    25 2020-01-01 00:05:27.500000+00:00      531.5     -1    36        2020-01-01 00:05:28+00:00      531.5     -1    37 2020-01-01 00:05:28.500000+00:00      532.5      1    48        2020-01-01 00:05:29+00:00      532.5      1    49 2020-01-01 00:05:29.500000+00:00      530.0      1    4The main idea is to identify the elements of the state column where the value change of state change using diff: if the state is different w.r.t. the previous column, the difference will be different than zero. Then, you can create an increasing ID using cumsum, i.e. the cumulative sum. This works also because the first element returned by the diff is NaN, that is different from zero.Once you have the Seq column, the solution of your problem is very simple: firstly, group by state and Seq and get the return for each sequence and then compute the average of returns for each state.Here the complete code and the result:import pandas as pddf = pd.DataFrame(    columns=["Timestamp", "mid_price", "state"],    data=[        ["2020-01-01 00:05:25+00:00", 528.50, 0],        ["2020-01-01 00:05:25.500000+00:00", 529.00, 0],        ["2020-01-01 00:05:26+00:00", 527.50, 1],        ["2020-01-01 00:05:26.500000+00:00", 528.00, 1],        ["2020-01-01 00:05:27+00:00", 529.00, 1],        ["2020-01-01 00:05:27.500000+00:00", 531.50, -1],        ["2020-01-01 00:05:28+00:00", 531.50, -1],        ["2020-01-01 00:05:28.500000+00:00", 532.50, 1],        ["2020-01-01 00:05:29+00:00", 532.50, 1],        ["2020-01-01 00:05:29.500000+00:00", 530.00, 1],    ],)df['Timestamp'] = pd.to_datetime(df.Timestamp)df['Seq'] = (df.state.diff() != 0).cumsum()df.groupby(['Seq', 'state']).agg(    {'mid_price': lambda x: (x.values[-1] - x.values[0]) / x.values[0]}  # compute return of first each sequence).groupby('state').mid_price.mean()  # compute average return for each statestate-1    0.000000 0    0.000946 1   -0.000926Name: mid_price, dtype: float64

Advertisement

Answer