I often need a new column that is the best I can achieve from other columns and I have a specific list of preference priorities. I am willing to take the first non null value.

def coalesce(values): not_none = (el for el in values if el is not None) return next(not_none, None) df = pd.DataFrame([{'third':'B','first':'A','second':'C'}, {'third':'B','first':None,'second':'C'}, {'third':'B','first':None,'second':None}, {'third':None,'first':None,'second':None}, {'third':'B','first':'A','second':None}]) df['combo1'] = df.apply(coalesce, axis=1) df['combo2'] = df[['second','third','first']].apply(coalesce, axis=1) print df

Results

first second third combo1 combo2 0 A C B A C 1 None C B C C 2 None None B B B 3 None None None None None 4 A None B A B

this code works (and the result are what I want) but it is not very fast.

I get to pick my priorities if I need to [[‘second’,’third’,’first’]]

Coalesce somewhat like the function of the same name from tsql.

I suspect that I may have overlooked an easy way to achieve it with good performance on large DataFrames (+400,000 rows)

I know there are lots of ways to fill in missing data which I often use on axis=0 this is what makes me think I may have missed an easy option for axis=1

Can you suggest something nicer/faster… or confirm that this is as good as it gets.

You could use `pd.isnull`

to find the null — in this case `None`

— values:

In [169]: pd.isnull(df) Out[169]: first second third 0 False False False 1 True False False 2 True True False 3 True True True 4 False True False

and then use `np.argmin`

to find the index of the first non-null value. If all the values are null, `np.argmin`

returns 0:

In [186]: np.argmin(pd.isnull(df).values, axis=1) Out[186]: array([0, 1, 2, 0, 0])

Then you could select the desired values from `df`

using NumPy integer-indexing:

In [193]: df.values[np.arange(len(df)), np.argmin(pd.isnull(df).values, axis=1)] Out[193]: array(['A', 'C', 'B', None, 'A'], dtype=object)

For example,

import pandas as pd df = pd.DataFrame([{'third':'B','first':'A','second':'C'}, {'third':'B','first':None,'second':'C'}, {'third':'B','first':None,'second':None}, {'third':None,'first':None,'second':None}, {'third':'B','first':'A','second':None}]) mask = pd.isnull(df).values df['combo1'] = df.values[np.arange(len(df)), np.argmin(mask, axis=1)] order = np.array([1,2,0]) mask = mask[:, order] df['combo2'] = df.values[np.arange(len(df)), order[np.argmin(mask, axis=1)]]

yields

first second third combo1 combo2 0 A C B A C 1 None C B C C 2 None None B B B 3 None None None None None 4 A None B A B

Using argmin instead of `df3.apply(coalesce, ...)`

is significantly quicker if the DataFrame has a lot of rows:

df2 = pd.concat([df]*1000) In [230]: %timeit mask = pd.isnull(df2).values; df2.values[np.arange(len(df2)), np.argmin(mask, axis=1)] 1000 loops, best of 3: 617 µs per loop In [231]: %timeit df2.apply(coalesce, axis=1) 10 loops, best of 3: 84.1 ms per loop

## Recent Comments