I have a big dataframe with more than 2500 columns but the structure is very similar than this:
A B C D E 0 1 0 8 0 0 1 0 0 0 0 5 2 1 2 3 0 0 3 0 2 0 1 0
I need to detect all the columns with a value greater than 0 and I have done with this:
df['X'] = df.gt(0).dot(df.columns + ',')
And then I get this:
A B C D E X 0 1 0 8 0 0 A,C 1 0 0 0 0 5 E 2 1 2 9 0 0 A,B,C 3 0 3 0 1 0 B,D
The problem is that I don’t need in ‘X’ the name of the columns but the value of the columns and I need to perform the following mathematical operation:
I want it to detect only the 2 lowest values and subtract the lowest value from the second lowest. In the case that there were not more than 2 values in ‘X’ it would be enough for me to show the value of the column only.
The final result in my example would be the following:
A B C D E X 0 1 0 8 0 0 7 1 0 0 0 0 5 5 2 1 2 9 0 0 1 3 0 3 0 1 0 2
Any idea how to solve it or any orientation?
Advertisement
Answer
We can do nsmallest
then follow by np.ptp
and condition for those row only have one value not equal to 0
df['new'] = df.apply(lambda x : np.ptp(pd.Series.nsmallest(x[x!=0],2)) if sum(x!=0) != 1 else x[x!=0].iloc[0],axis=1) Out[520]: 0 7 1 5 2 1 3 1 dtype: int64
Or doing two steps
df['new'] = df[df.ne(0).sum(1)>1].apply(lambda x : np.ptp(pd.Series.nsmallest(x,2)),axis=1) df['new'].fillna(df.max(1),inplace=True) df Out[530]: A B C D E new 0 1 0 8 0 0 7.0 1 0 0 0 0 5 5.0 2 1 2 3 0 0 1.0 3 0 2 0 1 0 1.0