Skip to content
Advertisement

Pandas unique values per row, variable number of columns with data

Consider the below dataframe:

import pandas as pd
from numpy import nan

data = [
    (111, nan, nan, 111),
    (112, 112, nan, 115),
    (113, nan, nan, nan),
    (nan, nan, nan, nan),
    (118, 110, 117, nan),
]

df = pd.DataFrame(data, columns=[f'num{i}' for i in range(len(data[0]))])

    num0    num1    num2    num3
0   111.0   NaN     NaN     111.0
1   112.0   112.0   NaN     115.0
2   113.0   NaN     NaN     NaN
3   NaN     NaN     NaN     NaN
4   118.0   110.0   117.0   NaN

Assuming my index is unique, I’m looking to retrieve the unique values per index row, to an output like the one below. I wish to keep the empty rows.

    num1    num2    num3
0   111.0   NaN     NaN
1   112.0   115.0   NaN
2   113.0   NaN     NaN
3   NaN     NaN     NaN
4   110.0   117.0   118.0

I have a working, albeit slow, solution, see below. The output number order is not relevant, as long all values are presented to the leftmost column and nulls to the right. I’m looking for best practices and potential ideas to speed up the code. Thank you in advance.

def arrange_row(row):
    values = list(set(row.dropna(axis=1).values[0]))
    values = [nan] if not values else values
    series = pd.Series(values, index=[f"num{i}" for i in range(1, len(values)+1)])
    return series

df.groupby(level=-1).apply(arrange_row).unstack(level=-1)
pd.version == '1.2.3'

Advertisement

Answer

Another option, albeit longer:

outcome = (df.melt(ignore_index= False) # keep the index as a tracker
             .reset_index()
            # get the unique rows
             .drop_duplicates(subset=['index','value'])
             .dropna()
            # use this to build the new column names
             .assign(counter = lambda df: df.groupby('index').cumcount() + 1)
             .pivot('index', 'counter', 'value')
             .add_prefix('num')
             .reindex(df.index)
             .rename_axis(columns=None)
) 

outcome 

    num1   num2   num3
0  111.0    NaN    NaN
1  112.0  115.0    NaN
2  113.0    NaN    NaN
3    NaN    NaN    NaN
4  118.0  110.0  117.0

If you want it to exactly match your output, you can dump it into numpy, sort and return to pandas:

pd.DataFrame(np.sort(outcome, axis = 1), columns = outcome.columns)

    num1   num2   num3
0  111.0    NaN    NaN
1  112.0  115.0    NaN
2  113.0    NaN    NaN
3    NaN    NaN    NaN
4  110.0  117.0  118.0

Another option is to do the sorting within numpy before reshaping in Pandas:

(pd.DataFrame(np.sort(df, axis = 1))
   .apply(pd.unique, axis=1)
   .apply(pd.Series)
   .dropna(how='all',axis=1)
   .set_axis(['num1', 'num2','num3'], axis=1)
) 
    num1   num2   num3
0  111.0    NaN    NaN
1  112.0  115.0    NaN
2  113.0    NaN    NaN
3    NaN    NaN    NaN
4  110.0  117.0  118.0
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement