I have a df that I’ve read from sql:
JavaScript
x
15
15
1
id stock_id symbol date open high low close volume
2
0 1 35 ABSI 2022-09-28 3.06 3.33 3.0400 3.27 217040
3
1 2 35 ABSI 2022-09-29 3.19 3.19 3.0300 3.12 187309
4
2 3 35 ABSI 2022-09-30 3.11 3.27 3.0700 3.13 196566
5
3 4 35 ABSI 2022-10-03 3.16 3.16 2.8600 2.97 310441
6
4 5 35 ABSI 2022-10-04 3.04 3.37 2.9600 3.27 361082
7
..
8
383 384 16 VVI 2022-10-03 31.93 33.85 31.3050 33.60 151357
9
384 385 16 VVI 2022-10-04 34.41 35.46 34.1900 35.39 105773
10
385 386 16 VVI 2022-10-05 34.67 35.30 34.5000 34.86 59605
11
386 387 16 VVI 2022-10-06 34.80 35.14 34.3850 34.50 55323
12
387 388 16 VVI 2022-10-07 33.99 33.99 33.3409 33.70 45187
13
14
[388 rows x 9 columns]
15
I then try and get the average of the last 5 days and add it to a new column:
JavaScript
1
2
1
df['volume_5_day'] = df.groupby('stock_id')['volume'].rolling(5).mean()
2
Which gives me the following error:
JavaScript
1
35
35
1
Traceback (most recent call last):
2
File "/home/dan/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 11003, in _reindex_for_setitem
3
reindexed_value = value.reindex(index)._values
4
File "/home/dan/.local/lib/python3.10/site-packages/pandas/core/series.py", line 4672, in reindex
5
return super().reindex(**kwargs)
6
File "/home/dan/.local/lib/python3.10/site-packages/pandas/core/generic.py", line 4966, in reindex
7
return self._reindex_axes(
8
File "/home/dan/.local/lib/python3.10/site-packages/pandas/core/generic.py", line 4981, in _reindex_axes
9
new_index, indexer = ax.reindex(
10
File "/home/dan/.local/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 4237, in reindex
11
target = self._wrap_reindex_result(target, indexer, preserve_names)
12
File "/home/dan/.local/lib/python3.10/site-packages/pandas/core/indexes/multi.py", line 2520, in _wrap_reindex_result
13
target = MultiIndex.from_tuples(target)
14
File "/home/dan/.local/lib/python3.10/site-packages/pandas/core/indexes/multi.py", line 204, in new_meth
15
return meth(self_or_cls, *args, **kwargs)
16
File "/home/dan/.local/lib/python3.10/site-packages/pandas/core/indexes/multi.py", line 559, in from_tuples
17
arrays = list(lib.tuples_to_object_array(tuples).T)
18
File "pandas/_libs/lib.pyx", line 2930, in pandas._libs.lib.tuples_to_object_array
19
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'
20
21
The above exception was the direct cause of the following exception:
22
23
Traceback (most recent call last):
24
File "/home/dan/Documents/code/wolfhound/add_indicators_daily.py", line 10, in <module>
25
df['volume_10_day'] = df.groupby('stock_id')['volume'].rolling(1).mean()
26
File "/home/dan/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 3655, in __setitem__
27
self._set_item(key, value)
28
File "/home/dan/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 3832, in _set_item
29
value = self._sanitize_column(value)
30
File "/home/dan/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 4535, in _sanitize_column
31
return _reindex_for_setitem(value, self.index)
32
File "/home/dan/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 11010, in _reindex_for_setitem
33
raise TypeError(
34
TypeError: incompatible index of inserted column with frame index
35
Any ideas what’s going wrong here? Previously this worked and now it’s throwing an error – and I can’t seem to figure out why
Advertisement
Answer
Chain Series.to_numpy
to add the values as a np.array
and make sure to add sort=False
inside df.groupby
:
JavaScript
1
17
17
1
df['volume_5_day'] = df.groupby('stock_id', sort=False)['volume']
2
.rolling(5).mean().to_numpy()
3
4
print(df)
5
6
id stock_id symbol date low close volume volume_5_day
7
0 1 35 ABSI 2022-09-28 3.0400 3.27 217040 NaN
8
1 2 35 ABSI 2022-09-29 3.0300 3.12 187309 NaN
9
2 3 35 ABSI 2022-09-30 3.0700 3.13 196566 NaN
10
3 4 35 ABSI 2022-10-03 2.8600 2.97 310441 NaN
11
4 5 35 ABSI 2022-10-04 2.9600 3.27 361082 254487.6
12
383 384 16 VVI 2022-10-03 31.3050 33.60 151357 NaN
13
384 385 16 VVI 2022-10-04 34.1900 35.39 105773 NaN
14
385 386 16 VVI 2022-10-05 34.5000 34.86 59605 NaN
15
386 387 16 VVI 2022-10-06 34.3850 34.50 55323 NaN
16
387 388 16 VVI 2022-10-07 33.3409 33.70 45187 83449.0
17
Your initial approach fails, because the df.groupby
method that you are using, returns a pd.Series
with a different index than your df
. E.g.:
JavaScript
1
13
13
1
print(df.groupby('stock_id')['volume'].rolling(5).mean().index)
2
MultiIndex([(16, 383),
3
(16, 384),
4
(16, 385),
5
(16, 386),
6
(16, 387),
7
(35, 0),
8
(35, 1),
9
(35, 2),
10
(35, 3),
11
(35, 4)],
12
names=['stock_id', None])
13
So, it is saying it is unable to map this onto:
JavaScript
1
3
1
print(df.index)
2
Int64Index([0, 1, 2, 3, 4, 383, 384, 385, 386, 387], dtype='int64')
3
With a np.array
you don’t have this problem. You could also have used:
JavaScript
1
3
1
df['volume_5_day'] = df.groupby('stock_id', as_index=False)['volume']
2
.rolling(5).mean()['volume']
3
In this case, you don’t need to add sort=False
, as it will match correctly on the index values.