I was explaining[1] in-place operations vs out-of-place operations to a new user of Pandas. This resulted in us discussing passing objects by reference of by value.
Naturally, I wanted to show pandas.DataFrame.values
as I thought it shared the memory location of the underlying data of the DataFrame. However, I was surprised with and then sidetracked by the results of the following code segment.
import pandas as pd df = pd.DataFrame({'x': [1,2,3,4], 'y': [5,4,3,2]}) print(df) df.values += 1 # raises AttributeError
x y 0 1 5 1 2 4 2 3 3 3 4 2 <ipython-input-126-9fa9f393972b>:8: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access df.values += 1 --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) ~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/generic.py in __setattr__(self, name, value) 5169 else: -> 5170 object.__setattr__(self, name, value) 5171 except (AttributeError, TypeError): AttributeError: can't set attribute During handling of the above exception, another exception occurred: AttributeError Traceback (most recent call last) <ipython-input-126-9fa9f393972b> in <module> 6 print(df) 7 ----> 8 df.values += 1 ~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/generic.py in __setattr__(self, name, value) 5178 stacklevel=2, 5179 ) -> 5180 object.__setattr__(self, name, value) 5181 5182 def _dir_additions(self): AttributeError: can't set attribute
However, despite this error, if we re-examine the df, it has changed.
print(df)
x y 0 2 6 1 3 5 2 4 4 3 5 3
My attempt to explain this behavior.
First, we can write df.values += 1
as df.values = df.values.__iadd__(1)
That means the RHS of this expression evaluates properly resulting in the underlying data being changed. Then, re-assigning df.values
to a new value raises the exception.
If I break up these two operations, no error is raised and the underlying data is changed.
print(df) values = df.values values += 1 print(df)
x y 0 2 6 1 3 5 2 4 4 3 5 3 x y 0 3 7 1 4 6 2 5 5 3 6 4
Is this a bug?
Should .values
be treated differently than with __getattr__/__setattr__
?
Part of me wants to say this is not a bug as the user should read the documentation and use the recommend replacement pandas.DataFrame.to_numpy.
However, part of me says that it is pretty unintuitive to see a “AttributeError: can’t set attribute” but have the underlying operation actually work. That being said, I can’t think of a solution that allows these operations to work in the proper situations while still preventing improper use.
Does anyone have any insights into this?
[1]: Until I got derailed by this issue and [Insert Link] potential issue.
Advertisement
Answer
Pass-by-value vs. pass-by-reference in Python is a knotty topic, see Emulating pass-by-value behaviour in python and also read the comments under the question
This is the ‘state of the art’ :
Not quite. Python passes arguments neither by reference nor by value, but by assignment.
source : https://realpython.com/python-pass-by-reference/
Similar is in https://www.geeksforgeeks.org/pass-by-reference-vs-value-in-python/
Outgoing from this i think this behavior is not a bug but it is in a grey zone. This behavior is in my opinion rooted in the linking between Pandas and Numpy. df.values
returns a numpy representation (an array) of the dataframe ( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.values.html ) call it a
and a+1
is valid syntax for increasing an entire numpy array (https://scipy-lectures.org/intro/numpy/operations.html). On the other hand according to the error message, Pandas [!] does not allow new columns to be created via a new attribute. This error message emerges from the re-assignment step in df.values =+1
, the re-assignment is in df.values = df.values+1
: df.values
is a numpy array that is increased by df.values+1
(what is valid syntax).
Then this numpy array is re-assigned to its pandas dataframe representation by df.values=df.values+1
what throws the known error message. This step is only allowed to work because it alters the same memory location, the same object. So it is not essentially a bug however it is also not purely white but grey instead…