Is there a better more readable way to coalese columns in pandas

Question

I often need a new column that is the best I can achieve from other columns and I have a specific list of preference priorities. I am willing to take the first non null value. Results this code works (and the result are what I want) but it is not very fast. I get to pick my priorities if I

Accepted Answer

You could use pd.isnull to find the null &#8212; in this case None &#8212; values:In [169]: pd.isnull(df)Out[169]:    first second  third0  False  False  False1   True  False  False2   True   True  False3   True   True   True4  False   True  Falseand then use np.argmin to find the index of the first non-null value. If all the values are null, np.argmin returns 0:In [186]: np.argmin(pd.isnull(df).values, axis=1)Out[186]: array([0, 1, 2, 0, 0])Then you could select the desired values from df using NumPy integer-indexing:In [193]: df.values[np.arange(len(df)), np.argmin(pd.isnull(df).values, axis=1)]Out[193]: array(['A', 'C', 'B', None, 'A'], dtype=object)For example,import pandas as pddf = pd.DataFrame([{'third':'B','first':'A','second':'C'},                   {'third':'B','first':None,'second':'C'},                   {'third':'B','first':None,'second':None},                                      {'third':None,'first':None,'second':None},                   {'third':'B','first':'A','second':None}])mask = pd.isnull(df).valuesdf['combo1'] = df.values[np.arange(len(df)), np.argmin(mask, axis=1)]order = np.array([1,2,0])mask = mask[:, order]df['combo2'] = df.values[np.arange(len(df)), order[np.argmin(mask, axis=1)]]yields  first second third combo1 combo20     A      C     B      A      C1  None      C     B      C      C2  None   None     B      B      B3  None   None  None   None   None4     A   None     B      A      BUsing argmin instead of df3.apply(coalesce, ...) is significantly quicker if the DataFrame has a lot of rows:df2 = pd.concat([df]*1000)In [230]: %timeit mask = pd.isnull(df2).values; df2.values[np.arange(len(df2)), np.argmin(mask, axis=1)]1000 loops, best of 3: 617 µs per loopIn [231]: %timeit df2.apply(coalesce, axis=1)10 loops, best of 3: 84.1 ms per loop

Advertisement

Answer