when using group_by: TypeError: incompatible index of inserted column with frame index

Question

I have a df that I've read from sql: I then try and get the average of the last 5 days and add it to a new column: Which gives me the following error: Any ideas what's going wrong here? Previously this worked and now it's throwing an error - and I can't seem to figure out why Answer Chain

Accepted Answer

Chain Series.to_numpy to add the values as a np.array and make sure to add sort=False inside df.groupby:df['volume_5_day'] = df.groupby('stock_id', sort=False)['volume']    .rolling(5).mean().to_numpy()print(df)      id  stock_id symbol        date  ...      low  close  volume  volume_5_day0      1        35   ABSI  2022-09-28  ...   3.0400   3.27  217040           NaN1      2        35   ABSI  2022-09-29  ...   3.0300   3.12  187309           NaN2      3        35   ABSI  2022-09-30  ...   3.0700   3.13  196566           NaN3      4        35   ABSI  2022-10-03  ...   2.8600   2.97  310441           NaN4      5        35   ABSI  2022-10-04  ...   2.9600   3.27  361082      254487.6383  384        16    VVI  2022-10-03  ...  31.3050  33.60  151357           NaN384  385        16    VVI  2022-10-04  ...  34.1900  35.39  105773           NaN385  386        16    VVI  2022-10-05  ...  34.5000  34.86   59605           NaN386  387        16    VVI  2022-10-06  ...  34.3850  34.50   55323           NaN387  388        16    VVI  2022-10-07  ...  33.3409  33.70   45187       83449.0Your initial approach fails, because the df.groupby method that you are using, returns a pd.Series with a different index than your df. E.g.:print(df.groupby('stock_id')['volume'].rolling(5).mean().index)MultiIndex([(16, 383),            (16, 384),            (16, 385),            (16, 386),            (16, 387),            (35,   0),            (35,   1),            (35,   2),            (35,   3),            (35,   4)],           names=['stock_id', None])So, it is saying it is unable to map this onto:print(df.index)Int64Index([0, 1, 2, 3, 4, 383, 384, 385, 386, 387], dtype='int64')With a np.array you don&#8217;t have this problem. You could also have used:df['volume_5_day'] = df.groupby('stock_id', as_index=False)['volume']    .rolling(5).mean()['volume']In this case, you don&#8217;t need to add sort=False, as it will match correctly on the index values.

Advertisement

Answer