Skip to content
Advertisement

Questions about In-place memory operations in pandas (1/2)

I was explaining[1] in-place operations vs out-of-place operations to a new user of Pandas. This resulted in us discussing passing objects by reference of by value.

Naturally, I wanted to show pandas.DataFrame.values as I thought it shared the memory location of the underlying data of the DataFrame. However, I was surprised with and then sidetracked by the results of the following code segment.

JavaScript
JavaScript

However, despite this error, if we re-examine the df, it has changed.

JavaScript
JavaScript

My attempt to explain this behavior.

First, we can write df.values += 1 as df.values = df.values.__iadd__(1)

That means the RHS of this expression evaluates properly resulting in the underlying data being changed. Then, re-assigning df.values to a new value raises the exception.

If I break up these two operations, no error is raised and the underlying data is changed.

JavaScript
JavaScript

Is this a bug?

Should .values be treated differently than with __getattr__/__setattr__?

Part of me wants to say this is not a bug as the user should read the documentation and use the recommend replacement pandas.DataFrame.to_numpy.

However, part of me says that it is pretty unintuitive to see a “AttributeError: can’t set attribute” but have the underlying operation actually work. That being said, I can’t think of a solution that allows these operations to work in the proper situations while still preventing improper use.

Does anyone have any insights into this?

[1]: Until I got derailed by this issue and [Insert Link] potential issue.

Advertisement

Answer

Pass-by-value vs. pass-by-reference in Python is a knotty topic, see Emulating pass-by-value behaviour in python and also read the comments under the question

This is the ‘state of the art’ :

Not quite. Python passes arguments neither by reference nor by value, but by assignment.

source : https://realpython.com/python-pass-by-reference/

Similar is in https://www.geeksforgeeks.org/pass-by-reference-vs-value-in-python/

Outgoing from this i think this behavior is not a bug but it is in a grey zone. This behavior is in my opinion rooted in the linking between Pandas and Numpy. df.values returns a numpy representation (an array) of the dataframe ( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.values.html ) call it a and a+1 is valid syntax for increasing an entire numpy array (https://scipy-lectures.org/intro/numpy/operations.html). On the other hand according to the error message, Pandas [!] does not allow new columns to be created via a new attribute. This error message emerges from the re-assignment step in df.values =+1, the re-assignment is in df.values = df.values+1 : df.values is a numpy array that is increased by df.values+1 (what is valid syntax).

Then this numpy array is re-assigned to its pandas dataframe representation by df.values=df.values+1 what throws the known error message. This step is only allowed to work because it alters the same memory location, the same object. So it is not essentially a bug however it is also not purely white but grey instead…

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement