Consider the below dataframe:
JavaScript
x
20
20
1
import pandas as pd
2
from numpy import nan
3
4
data = [
5
(111, nan, nan, 111),
6
(112, 112, nan, 115),
7
(113, nan, nan, nan),
8
(nan, nan, nan, nan),
9
(118, 110, 117, nan),
10
]
11
12
df = pd.DataFrame(data, columns=[f'num{i}' for i in range(len(data[0]))])
13
14
num0 num1 num2 num3
15
0 111.0 NaN NaN 111.0
16
1 112.0 112.0 NaN 115.0
17
2 113.0 NaN NaN NaN
18
3 NaN NaN NaN NaN
19
4 118.0 110.0 117.0 NaN
20
Assuming my index is unique, I’m looking to retrieve the unique values per index row, to an output like the one below. I wish to keep the empty rows.
JavaScript
1
7
1
num1 num2 num3
2
0 111.0 NaN NaN
3
1 112.0 115.0 NaN
4
2 113.0 NaN NaN
5
3 NaN NaN NaN
6
4 110.0 117.0 118.0
7
I have a working, albeit slow, solution, see below. The output number order is not relevant, as long all values are presented to the leftmost column and nulls to the right. I’m looking for best practices and potential ideas to speed up the code. Thank you in advance.
JavaScript
1
9
1
def arrange_row(row):
2
values = list(set(row.dropna(axis=1).values[0]))
3
values = [nan] if not values else values
4
series = pd.Series(values, index=[f"num{i}" for i in range(1, len(values)+1)])
5
return series
6
7
df.groupby(level=-1).apply(arrange_row).unstack(level=-1)
8
pd.version == '1.2.3'
9
Advertisement
Answer
Another option, albeit longer:
JavaScript
1
22
22
1
outcome = (df.melt(ignore_index= False) # keep the index as a tracker
2
.reset_index()
3
# get the unique rows
4
.drop_duplicates(subset=['index','value'])
5
.dropna()
6
# use this to build the new column names
7
.assign(counter = lambda df: df.groupby('index').cumcount() + 1)
8
.pivot('index', 'counter', 'value')
9
.add_prefix('num')
10
.reindex(df.index)
11
.rename_axis(columns=None)
12
)
13
14
outcome
15
16
num1 num2 num3
17
0 111.0 NaN NaN
18
1 112.0 115.0 NaN
19
2 113.0 NaN NaN
20
3 NaN NaN NaN
21
4 118.0 110.0 117.0
22
If you want it to exactly match your output, you can dump it into numpy, sort and return to pandas:
JavaScript
1
9
1
pd.DataFrame(np.sort(outcome, axis = 1), columns = outcome.columns)
2
3
num1 num2 num3
4
0 111.0 NaN NaN
5
1 112.0 115.0 NaN
6
2 113.0 NaN NaN
7
3 NaN NaN NaN
8
4 110.0 117.0 118.0
9
Another option is to do the sorting within numpy before reshaping in Pandas:
JavaScript
1
13
13
1
(pd.DataFrame(np.sort(df, axis = 1))
2
.apply(pd.unique, axis=1)
3
.apply(pd.Series)
4
.dropna(how='all',axis=1)
5
.set_axis(['num1', 'num2','num3'], axis=1)
6
)
7
num1 num2 num3
8
0 111.0 NaN NaN
9
1 112.0 115.0 NaN
10
2 113.0 NaN NaN
11
3 NaN NaN NaN
12
4 110.0 117.0 118.0
13