I have a df that I’ve read from sql:
id stock_id symbol date open high low close volume 0 1 35 ABSI 2022-09-28 3.06 3.33 3.0400 3.27 217040 1 2 35 ABSI 2022-09-29 3.19 3.19 3.0300 3.12 187309 2 3 35 ABSI 2022-09-30 3.11 3.27 3.0700 3.13 196566 3 4 35 ABSI 2022-10-03 3.16 3.16 2.8600 2.97 310441 4 5 35 ABSI 2022-10-04 3.04 3.37 2.9600 3.27 361082 .. ... ... ... ... ... ... ... ... ... 383 384 16 VVI 2022-10-03 31.93 33.85 31.3050 33.60 151357 384 385 16 VVI 2022-10-04 34.41 35.46 34.1900 35.39 105773 385 386 16 VVI 2022-10-05 34.67 35.30 34.5000 34.86 59605 386 387 16 VVI 2022-10-06 34.80 35.14 34.3850 34.50 55323 387 388 16 VVI 2022-10-07 33.99 33.99 33.3409 33.70 45187 [388 rows x 9 columns]
I then try and get the average of the last 5 days and add it to a new column:
df['volume_5_day'] = df.groupby('stock_id')['volume'].rolling(5).mean()
Which gives me the following error:
Traceback (most recent call last): File "/home/dan/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 11003, in _reindex_for_setitem reindexed_value = value.reindex(index)._values File "/home/dan/.local/lib/python3.10/site-packages/pandas/core/series.py", line 4672, in reindex return super().reindex(**kwargs) File "/home/dan/.local/lib/python3.10/site-packages/pandas/core/generic.py", line 4966, in reindex return self._reindex_axes( File "/home/dan/.local/lib/python3.10/site-packages/pandas/core/generic.py", line 4981, in _reindex_axes new_index, indexer = ax.reindex( File "/home/dan/.local/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 4237, in reindex target = self._wrap_reindex_result(target, indexer, preserve_names) File "/home/dan/.local/lib/python3.10/site-packages/pandas/core/indexes/multi.py", line 2520, in _wrap_reindex_result target = MultiIndex.from_tuples(target) File "/home/dan/.local/lib/python3.10/site-packages/pandas/core/indexes/multi.py", line 204, in new_meth return meth(self_or_cls, *args, **kwargs) File "/home/dan/.local/lib/python3.10/site-packages/pandas/core/indexes/multi.py", line 559, in from_tuples arrays = list(lib.tuples_to_object_array(tuples).T) File "pandas/_libs/lib.pyx", line 2930, in pandas._libs.lib.tuples_to_object_array ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/dan/Documents/code/wolfhound/add_indicators_daily.py", line 10, in <module> df['volume_10_day'] = df.groupby('stock_id')['volume'].rolling(1).mean() File "/home/dan/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 3655, in __setitem__ self._set_item(key, value) File "/home/dan/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 3832, in _set_item value = self._sanitize_column(value) File "/home/dan/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 4535, in _sanitize_column return _reindex_for_setitem(value, self.index) File "/home/dan/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 11010, in _reindex_for_setitem raise TypeError( TypeError: incompatible index of inserted column with frame index
Any ideas what’s going wrong here? Previously this worked and now it’s throwing an error – and I can’t seem to figure out why
Advertisement
Answer
Chain Series.to_numpy
to add the values as a np.array
and make sure to add sort=False
inside df.groupby
:
df['volume_5_day'] = df.groupby('stock_id', sort=False)['volume'] .rolling(5).mean().to_numpy() print(df) id stock_id symbol date ... low close volume volume_5_day 0 1 35 ABSI 2022-09-28 ... 3.0400 3.27 217040 NaN 1 2 35 ABSI 2022-09-29 ... 3.0300 3.12 187309 NaN 2 3 35 ABSI 2022-09-30 ... 3.0700 3.13 196566 NaN 3 4 35 ABSI 2022-10-03 ... 2.8600 2.97 310441 NaN 4 5 35 ABSI 2022-10-04 ... 2.9600 3.27 361082 254487.6 383 384 16 VVI 2022-10-03 ... 31.3050 33.60 151357 NaN 384 385 16 VVI 2022-10-04 ... 34.1900 35.39 105773 NaN 385 386 16 VVI 2022-10-05 ... 34.5000 34.86 59605 NaN 386 387 16 VVI 2022-10-06 ... 34.3850 34.50 55323 NaN 387 388 16 VVI 2022-10-07 ... 33.3409 33.70 45187 83449.0
Your initial approach fails, because the df.groupby
method that you are using, returns a pd.Series
with a different index than your df
. E.g.:
print(df.groupby('stock_id')['volume'].rolling(5).mean().index) MultiIndex([(16, 383), (16, 384), (16, 385), (16, 386), (16, 387), (35, 0), (35, 1), (35, 2), (35, 3), (35, 4)], names=['stock_id', None])
So, it is saying it is unable to map this onto:
print(df.index) Int64Index([0, 1, 2, 3, 4, 383, 384, 385, 386, 387], dtype='int64')
With a np.array
you don’t have this problem. You could also have used:
df['volume_5_day'] = df.groupby('stock_id', as_index=False)['volume'] .rolling(5).mean()['volume']
In this case, you don’t need to add sort=False
, as it will match correctly on the index values.