I always assume that the apply function won’t change the original pandas dataframe and need the assignment to return the changes, however, could anyone help to explain why this happen?
def f(row): row['a'] = 10 row['b'] = 20 df_x = pd.DataFrame({'a':[10,11,12], 'b':[3,4,5], 'c':[1,1,1]}) #, 'd':[[1,2],[1,2],[1,2]] df_x.apply(f, axis = 1) df_x
returns
a b c 0 10 20 1 1 10 20 1 2 10 20 1
So, apply function changed the original pd.DataFrame without return, but if there’s an non-basic type column in the data frame, then it won’t do anything:
def f(row): row['a'] = 10 row['b'] = 20 row['d'] = [0] df_x = pd.DataFrame({'a':[10,11,12], 'b':[3,4,5], 'c':[1,1,1], 'd':[[1,2],[1,2],[1,2]]}) df_x.apply(f, axis = 1) df_x
This return result without any change
a b c d 0 10 3 1 [1, 2] 1 11 4 1 [1, 2] 2 12 5 1 [1, 2]
Could anyone help to explain this or provide some reference? thx
Advertisement
Answer
Series are mutable objects. If you modify them during an operation, the changes will be reflected if no copy is made.
This is what happens in the first case. My guess: no copy is made as your DataFrame has a homogenous dtype (integer), so all the DataFrame is stored as a unique array internally.
In the second case, you have at least one item being a list. This make the dtype object
, the DataFrame not a single dtype and apply
must generate a new Series before running due to the mixed type of the row.
You can actually reproduce this just by changing a single element to another type:
def f(row): row['a'] = 10 row['b'] = 20 df_x = pd.DataFrame({'a':[10,11,12], 'b':[3,4,5], 'c':[1,1.,1]}) # float df_x.apply(f, axis = 1) df_x # different types # no mutation a b c 0 10 3 1.0 1 11 4 1.0 2 12 5 1.0
Take home message: never modify a mutable input in a function (unless you want it and know what you’re doing).