Skip to content
Advertisement

Questions about In-place memory operations in pandas (1/2)

I was explaining[1] in-place operations vs out-of-place operations to a new user of Pandas. This resulted in us discussing passing objects by reference of by value.

Naturally, I wanted to show pandas.DataFrame.values as I thought it shared the memory location of the underlying data of the DataFrame. However, I was surprised with and then sidetracked by the results of the following code segment.

import pandas as pd
df = pd.DataFrame({'x': [1,2,3,4],
                   'y': [5,4,3,2]})
print(df)
df.values += 1 # raises AttributeError
   x  y
0  1  5
1  2  4
2  3  3
3  4  2
<ipython-input-126-9fa9f393972b>:8: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
  df.values += 1
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/generic.py in __setattr__(self, name, value)
   5169                 else:
-> 5170                     object.__setattr__(self, name, value)
   5171             except (AttributeError, TypeError):

AttributeError: can't set attribute

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
<ipython-input-126-9fa9f393972b> in <module>
      6 print(df)
      7 
----> 8 df.values += 1

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/generic.py in __setattr__(self, name, value)
   5178                         stacklevel=2,
   5179                     )
-> 5180                 object.__setattr__(self, name, value)
   5181 
   5182     def _dir_additions(self):

AttributeError: can't set attribute

However, despite this error, if we re-examine the df, it has changed.

print(df)
  x  y
0  2  6
1  3  5
2  4  4
3  5  3

My attempt to explain this behavior.

First, we can write df.values += 1 as df.values = df.values.__iadd__(1)

That means the RHS of this expression evaluates properly resulting in the underlying data being changed. Then, re-assigning df.values to a new value raises the exception.

If I break up these two operations, no error is raised and the underlying data is changed.

print(df)

values = df.values

values += 1

print(df)
   x  y
0  2  6
1  3  5
2  4  4
3  5  3
   x  y
0  3  7
1  4  6
2  5  5
3  6  4

Is this a bug?

Should .values be treated differently than with __getattr__/__setattr__?

Part of me wants to say this is not a bug as the user should read the documentation and use the recommend replacement pandas.DataFrame.to_numpy.

However, part of me says that it is pretty unintuitive to see a “AttributeError: can’t set attribute” but have the underlying operation actually work. That being said, I can’t think of a solution that allows these operations to work in the proper situations while still preventing improper use.

Does anyone have any insights into this?

[1]: Until I got derailed by this issue and [Insert Link] potential issue.

Advertisement

Answer

Pass-by-value vs. pass-by-reference in Python is a knotty topic, see Emulating pass-by-value behaviour in python and also read the comments under the question

This is the ‘state of the art’ :

Not quite. Python passes arguments neither by reference nor by value, but by assignment.

source : https://realpython.com/python-pass-by-reference/

Similar is in https://www.geeksforgeeks.org/pass-by-reference-vs-value-in-python/

Outgoing from this i think this behavior is not a bug but it is in a grey zone. This behavior is in my opinion rooted in the linking between Pandas and Numpy. df.values returns a numpy representation (an array) of the dataframe ( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.values.html ) call it a and a+1 is valid syntax for increasing an entire numpy array (https://scipy-lectures.org/intro/numpy/operations.html). On the other hand according to the error message, Pandas [!] does not allow new columns to be created via a new attribute. This error message emerges from the re-assignment step in df.values =+1, the re-assignment is in df.values = df.values+1 : df.values is a numpy array that is increased by df.values+1 (what is valid syntax).

Then this numpy array is re-assigned to its pandas dataframe representation by df.values=df.values+1 what throws the known error message. This step is only allowed to work because it alters the same memory location, the same object. So it is not essentially a bug however it is also not purely white but grey instead…

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement