Is there a better more readable way to coalese columns in pandas

Tags: ,



I often need a new column that is the best I can achieve from other columns and I have a specific list of preference priorities. I am willing to take the first non null value.

def coalesce(values):
    not_none = (el for el in values if el is not None)
    return next(not_none, None)

df = pd.DataFrame([{'third':'B','first':'A','second':'C'},
                   {'third':'B','first':None,'second':'C'},
                   {'third':'B','first':None,'second':None},                   
                   {'third':None,'first':None,'second':None},
                   {'third':'B','first':'A','second':None}])

df['combo1'] = df.apply(coalesce, axis=1)
df['combo2'] = df[['second','third','first']].apply(coalesce, axis=1)
print df

Results

  first second third combo1 combo2
0     A      C     B      A      C
1  None      C     B      C      C
2  None   None     B      B      B
3  None   None  None   None   None
4     A   None     B      A      B

this code works (and the result are what I want) but it is not very fast.
I get to pick my priorities if I need to [[‘second’,’third’,’first’]]

Coalesce somewhat like the function of the same name from tsql.
I suspect that I may have overlooked an easy way to achieve it with good performance on large DataFrames (+400,000 rows)

I know there are lots of ways to fill in missing data which I often use on axis=0 this is what makes me think I may have missed an easy option for axis=1

Can you suggest something nicer/faster… or confirm that this is as good as it gets.

Answer

You could use pd.isnull to find the null — in this case None — values:

In [169]: pd.isnull(df)
Out[169]: 
   first second  third
0  False  False  False
1   True  False  False
2   True   True  False
3   True   True   True
4  False   True  False

and then use np.argmin to find the index of the first non-null value. If all the values are null, np.argmin returns 0:

In [186]: np.argmin(pd.isnull(df).values, axis=1)
Out[186]: array([0, 1, 2, 0, 0])

Then you could select the desired values from df using NumPy integer-indexing:

In [193]: df.values[np.arange(len(df)), np.argmin(pd.isnull(df).values, axis=1)]
Out[193]: array(['A', 'C', 'B', None, 'A'], dtype=object)

For example,

import pandas as pd
df = pd.DataFrame([{'third':'B','first':'A','second':'C'},
                   {'third':'B','first':None,'second':'C'},
                   {'third':'B','first':None,'second':None},                   
                   {'third':None,'first':None,'second':None},
                   {'third':'B','first':'A','second':None}])

mask = pd.isnull(df).values
df['combo1'] = df.values[np.arange(len(df)), np.argmin(mask, axis=1)]
order = np.array([1,2,0])
mask = mask[:, order]
df['combo2'] = df.values[np.arange(len(df)), order[np.argmin(mask, axis=1)]]

yields

  first second third combo1 combo2
0     A      C     B      A      C
1  None      C     B      C      C
2  None   None     B      B      B
3  None   None  None   None   None
4     A   None     B      A      B

Using argmin instead of df3.apply(coalesce, ...) is significantly quicker if the DataFrame has a lot of rows:

df2 = pd.concat([df]*1000)

In [230]: %timeit mask = pd.isnull(df2).values; df2.values[np.arange(len(df2)), np.argmin(mask, axis=1)]
1000 loops, best of 3: 617 ┬Ás per loop

In [231]: %timeit df2.apply(coalesce, axis=1)
10 loops, best of 3: 84.1 ms per loop


Source: stackoverflow