I have been trying to use method chaining in Pandas however there are a few things related to how you reference a DataFrame or its columns that keep tripping me up.
For example in the code below I have filtered the dataset and then want to create a new column that sums the columns remaining after the filter. However I don’t know how to reference the DataFrame that has just been created from the filter. df in the example below refers to the original DataFrame.
df = pd.DataFrame( { 'xx':[1,2,3,4,5,6], 'xy':[1,2,3,4,5,6], 'z':[1,2,3,4,5,6], } ) df = ( df .filter(like='x') .assign(n = df .sum(axis=1)) ) df.head(6)
Or what about this instance, where the DataFrame is being created in the method chain, This would normally be a pd.read_csv step as opposed to generating the DataFrame. This piece of code would naturally not work as df2 has not been created as yet.
df2 = ( pd.DataFrame( { 'xx':[1,2,3,4,5,6], 'xy':[1,2,3,4,5,6], 'z':[1,2,3,4,5,6], } ) .assign( xx = df2['xx'].mask(df2['xx']>2,0) ) ) df2.head(6)
Interestingly enough the issue above is not a problem here as df3[‘xx’] refers to the df3 that has been queried which makes some sense in the context of the second example but then does not make sense with the first example.
df3 = pd.DataFrame( { 'xx':[1,2,3,4,5,6], 'xy':[1,2,3,4,5,6], 'z':[1,2,3,4,5,6], } ) df3 = ( df3 .query('xx > 3') .assign( xx = df3['xx'].mask(df3['xx']>4,0) ) ) df3.head(6)
I have worked in other languages/libraries such as R or PySpark and method chaining is quite flexible and does not appear to have these barriers. Unless there is something I am missing on how its meant to be done in Pandas or how you meant to reference df[‘xx’] in some other manner.
Lastly I understand that the example problems are easily worked around but I am trying to understand if there is a set method chaining syntax that I am maybe not aware of when referencing these columns.
Advertisement
Answer
For referencing the DataFrame based on a previous computation, the anonymous function(lambda helps) :
df.filter(like='x').assign(n = lambda df: df.sum(1)) xx xy n 0 1 1 2 1 2 2 4 2 3 3 6 3 4 4 8 4 5 5 10 5 6 6 12
It basically references the previous DataFrame. This works with assign.
The pipe
method is another option where you can chain methods while referencing the computed DataFrame.
The example below is superflous; hopefully it explains how pipe
works:
df3.pipe(lambda df: df.assign(r = 2)) Out[37]: xx xy z r 0 1 1 1 2 1 2 2 2 2 2 3 3 3 2 3 4 4 4 2 4 5 5 5 2 5 6 6 6 2
Not all Pandas functions support chaining; this is where the pipe function could come in handy; you could even write custom functions and pass it to pipe
.
All of this information is in the docs: assign; pipe; function application; assignment in method chaining