Consider the below dataframe:
import pandas as pd from numpy import nan data = [ (111, nan, nan, 111), (112, 112, nan, 115), (113, nan, nan, nan), (nan, nan, nan, nan), (118, 110, 117, nan), ] df = pd.DataFrame(data, columns=[f'num{i}' for i in range(len(data[0]))]) num0 num1 num2 num3 0 111.0 NaN NaN 111.0 1 112.0 112.0 NaN 115.0 2 113.0 NaN NaN NaN 3 NaN NaN NaN NaN 4 118.0 110.0 117.0 NaN
Assuming my index is unique, I’m looking to retrieve the unique values per index row, to an output like the one below. I wish to keep the empty rows.
num1 num2 num3 0 111.0 NaN NaN 1 112.0 115.0 NaN 2 113.0 NaN NaN 3 NaN NaN NaN 4 110.0 117.0 118.0
I have a working, albeit slow, solution, see below. The output number order is not relevant, as long all values are presented to the leftmost column and nulls to the right. I’m looking for best practices and potential ideas to speed up the code. Thank you in advance.
def arrange_row(row): values = list(set(row.dropna(axis=1).values[0])) values = [nan] if not values else values series = pd.Series(values, index=[f"num{i}" for i in range(1, len(values)+1)]) return series df.groupby(level=-1).apply(arrange_row).unstack(level=-1) pd.version == '1.2.3'
Advertisement
Answer
Another option, albeit longer:
outcome = (df.melt(ignore_index= False) # keep the index as a tracker .reset_index() # get the unique rows .drop_duplicates(subset=['index','value']) .dropna() # use this to build the new column names .assign(counter = lambda df: df.groupby('index').cumcount() + 1) .pivot('index', 'counter', 'value') .add_prefix('num') .reindex(df.index) .rename_axis(columns=None) ) outcome num1 num2 num3 0 111.0 NaN NaN 1 112.0 115.0 NaN 2 113.0 NaN NaN 3 NaN NaN NaN 4 118.0 110.0 117.0
If you want it to exactly match your output, you can dump it into numpy, sort and return to pandas:
pd.DataFrame(np.sort(outcome, axis = 1), columns = outcome.columns) num1 num2 num3 0 111.0 NaN NaN 1 112.0 115.0 NaN 2 113.0 NaN NaN 3 NaN NaN NaN 4 110.0 117.0 118.0
Another option is to do the sorting within numpy before reshaping in Pandas:
(pd.DataFrame(np.sort(df, axis = 1)) .apply(pd.unique, axis=1) .apply(pd.Series) .dropna(how='all',axis=1) .set_axis(['num1', 'num2','num3'], axis=1) ) num1 num2 num3 0 111.0 NaN NaN 1 112.0 115.0 NaN 2 113.0 NaN NaN 3 NaN NaN NaN 4 110.0 117.0 118.0