Skip to content
Advertisement

How to merge multiple columns of a dataframe using regex?

I have a df which as following

import pandas as pd

df = pd.DataFrame(
    {'number_C1_E1': ['1', '2', None, None, '5', '6', '7', '8'],
     'fruit_C11_E1': ['apple', 'banana', None, None, 'watermelon', 'peach', 'orange', 'lemon'],
     'name_C111_E1': ['tom', 'jerry', None, None, 'paul', 'edward', 'reggie', 'nicholas'],
     'number_C2_E2': [None, None, '3', None, None, None, None, None],
     'fruit_C22_E2': [None, None, 'blueberry', None, None, None, None, None],
     'name_C222_E2': [None, None, 'anthony', None, None, None, None, None],
     'number_C3_E1': [None, None, '3', '4', None, None, None, None],
     'fruit_C33_E1': [None, None, 'blueberry', 'strawberry', None, None, None, None],
     'name_C333_E1': [None, None, 'anthony', 'terry', None, None, None, None],
     }
)

Here what I want to do is combine those columns and we have two rules:

  1. If a column removes _C{0~9} or _C{0~9}{0~9} or _C{0~9}{0~9}{0~9} is equal to another column, these two columns can be combined.

Let’s take number_C1_E1 number_C2_E2 number_C3_E1 as an example, here number_C1_E1 and number_C3_E1 can be combined because they are both number_E1 after removing _C{0~9}.

  1. The two combined columns should get rid of the None values.

The desired result is

  number_C1_1_E1 fruit_C11_1_E1 name_C111_1_E1 number_C2_1_E2 fruit_C22_1_E2 name_C222_1_E2
0              1          apple            tom           None           None           None
1              2         banana          jerry           None           None           None
2              3      blueberry        anthony              3      blueberry        anthony
3              4     strawberry          terry           None           None           None
4              5     watermelon           paul           None           None           None
5              6          peach         edward           None           None           None
6              7         orange         reggie           None           None           None
7              8          lemon       nicholas           None           None           None

Anyone has a good solution?

Advertisement

Answer

Use the same as your previous question, but also compute a renamer for your columns:

group = df.columns.str.replace(r'_Cd+', '', regex=True)

names = df.columns.to_series().groupby(group).first()

out = (df.groupby(group, axis=1, sort=False).first()
         .rename(columns=names)
       )

Alternative:

group = df.columns.str.replace(r'_Cd+', '', regex=True)

out = (df.groupby(group, axis=1, sort=False).first()
         .set_axis(df.columns[~group.duplicated()], axis=1)
       )

Output:

  number_C1_E1 fruit_C11_E1 name_C111_E1 number_C2_E2 fruit_C22_E2 name_C222_E2
0            1        apple          tom         None         None         None
1            2       banana        jerry         None         None         None
2            3    blueberry      anthony            3    blueberry      anthony
3            4   strawberry        terry         None         None         None
4            5   watermelon         paul         None         None         None
5            6        peach       edward         None         None         None
6            7       orange       reggie         None         None         None
7            8        lemon     nicholas         None         None         None
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement