I have a df
which as following
JavaScript
x
15
15
1
import pandas as pd
2
3
df = pd.DataFrame(
4
{'number_C1_E1': ['1', '2', None, None, '5', '6', '7', '8'],
5
'fruit_C11_E1': ['apple', 'banana', None, None, 'watermelon', 'peach', 'orange', 'lemon'],
6
'name_C111_E1': ['tom', 'jerry', None, None, 'paul', 'edward', 'reggie', 'nicholas'],
7
'number_C2_E2': [None, None, '3', None, None, None, None, None],
8
'fruit_C22_E2': [None, None, 'blueberry', None, None, None, None, None],
9
'name_C222_E2': [None, None, 'anthony', None, None, None, None, None],
10
'number_C3_E1': [None, None, '3', '4', None, None, None, None],
11
'fruit_C33_E1': [None, None, 'blueberry', 'strawberry', None, None, None, None],
12
'name_C333_E1': [None, None, 'anthony', 'terry', None, None, None, None],
13
}
14
)
15
Here what I want to do is combine those columns and we have two rules:
- If a column removes
_C{0~9}
or_C{0~9}{0~9}
or_C{0~9}{0~9}{0~9}
is equal to another column, these two columns can be combined.
Let’s take
number_C1_E1
number_C2_E2
number_C3_E1
as an example, herenumber_C1_E1
andnumber_C3_E1
can be combined because they are bothnumber_E1
afterremoving _C{0~9}
.
- The two combined columns should get rid of the
None
values.
The desired result is
JavaScript
1
10
10
1
number_C1_1_E1 fruit_C11_1_E1 name_C111_1_E1 number_C2_1_E2 fruit_C22_1_E2 name_C222_1_E2
2
0 1 apple tom None None None
3
1 2 banana jerry None None None
4
2 3 blueberry anthony 3 blueberry anthony
5
3 4 strawberry terry None None None
6
4 5 watermelon paul None None None
7
5 6 peach edward None None None
8
6 7 orange reggie None None None
9
7 8 lemon nicholas None None None
10
Anyone has a good solution?
Advertisement
Answer
Use the same as your previous question, but also compute a renamer for your columns:
JavaScript
1
8
1
group = df.columns.str.replace(r'_Cd+', '', regex=True)
2
3
names = df.columns.to_series().groupby(group).first()
4
5
out = (df.groupby(group, axis=1, sort=False).first()
6
.rename(columns=names)
7
)
8
Alternative:
JavaScript
1
6
1
group = df.columns.str.replace(r'_Cd+', '', regex=True)
2
3
out = (df.groupby(group, axis=1, sort=False).first()
4
.set_axis(df.columns[~group.duplicated()], axis=1)
5
)
6
Output:
JavaScript
1
10
10
1
number_C1_E1 fruit_C11_E1 name_C111_E1 number_C2_E2 fruit_C22_E2 name_C222_E2
2
0 1 apple tom None None None
3
1 2 banana jerry None None None
4
2 3 blueberry anthony 3 blueberry anthony
5
3 4 strawberry terry None None None
6
4 5 watermelon paul None None None
7
5 6 peach edward None None None
8
6 7 orange reggie None None None
9
7 8 lemon nicholas None None None
10