I often need a new column that is the best I can achieve from other columns and I have a specific list of preference priorities. I am willing to take the first non null value.
def coalesce(values): not_none = (el for el in values if el is not None) return next(not_none, None) df = pd.DataFrame([{'third':'B','first':'A','second':'C'}, {'third':'B','first':None,'second':'C'}, {'third':'B','first':None,'second':None}, {'third':None,'first':None,'second':None}, {'third':'B','first':'A','second':None}]) df['combo1'] = df.apply(coalesce, axis=1) df['combo2'] = df[['second','third','first']].apply(coalesce, axis=1) print df
Results
first second third combo1 combo2 0 A C B A C 1 None C B C C 2 None None B B B 3 None None None None None 4 A None B A B
this code works (and the result are what I want) but it is not very fast.
I get to pick my priorities if I need to [[‘second’,’third’,’first’]]
Coalesce somewhat like the function of the same name from tsql.
I suspect that I may have overlooked an easy way to achieve it with good performance on large DataFrames (+400,000 rows)
I know there are lots of ways to fill in missing data which I often use on axis=0 this is what makes me think I may have missed an easy option for axis=1
Can you suggest something nicer/faster… or confirm that this is as good as it gets.
Advertisement
Answer
You could use pd.isnull
to find the null — in this case None
— values:
In [169]: pd.isnull(df) Out[169]: first second third 0 False False False 1 True False False 2 True True False 3 True True True 4 False True False
and then use np.argmin
to find the index of the first non-null value. If all the values are null, np.argmin
returns 0:
In [186]: np.argmin(pd.isnull(df).values, axis=1) Out[186]: array([0, 1, 2, 0, 0])
Then you could select the desired values from df
using NumPy integer-indexing:
In [193]: df.values[np.arange(len(df)), np.argmin(pd.isnull(df).values, axis=1)] Out[193]: array(['A', 'C', 'B', None, 'A'], dtype=object)
For example,
import pandas as pd df = pd.DataFrame([{'third':'B','first':'A','second':'C'}, {'third':'B','first':None,'second':'C'}, {'third':'B','first':None,'second':None}, {'third':None,'first':None,'second':None}, {'third':'B','first':'A','second':None}]) mask = pd.isnull(df).values df['combo1'] = df.values[np.arange(len(df)), np.argmin(mask, axis=1)] order = np.array([1,2,0]) mask = mask[:, order] df['combo2'] = df.values[np.arange(len(df)), order[np.argmin(mask, axis=1)]]
yields
first second third combo1 combo2 0 A C B A C 1 None C B C C 2 None None B B B 3 None None None None None 4 A None B A B
Using argmin instead of df3.apply(coalesce, ...)
is significantly quicker if the DataFrame has a lot of rows:
df2 = pd.concat([df]*1000) In [230]: %timeit mask = pd.isnull(df2).values; df2.values[np.arange(len(df2)), np.argmin(mask, axis=1)] 1000 loops, best of 3: 617 µs per loop In [231]: %timeit df2.apply(coalesce, axis=1) 10 loops, best of 3: 84.1 ms per loop