Create pandas dataframe on column name conditions

Question

Python newbie attempting a complex pandas dataframe logic I have multiple dataframes I need to join but I'll show two below for the example. The dataframe have duplicate columns labelled with suffix '_duplicate'. I need to replicate the row instead of having the duplicate column as seen below. My first thought is to get a list of unique column names

Accepted Answer

This is a very silly way of doing it and I am hoping someone comes up with a better way&#8230; but it does work:##################### Recreate OP's dataframe ###########################data1 = {"a":1, "b":2, "a_duplicate":3,"b_duplicate":4,"c":5, "id":"id1"}data2 = {"a":6, "c":7, "a_duplicate":8,"c_duplicate":9,"d":10, "id":"id2"}df1 = pd.DataFrame(data1, index=[0])df2 = pd.DataFrame(data2, index=[0])########################################################################## Append columns together while renaming the duplicate columnsdf1 = df1[["a", "b", "c", "id"]].append(df1[["a_duplicate", "b_duplicate", "c", "id"]].rename(columns={"a_duplicate": "a", "b_duplicate": "b"}))df2 = df2[["a", "c", "d", "id"]].append(df2[["a_duplicate", "c_duplicate", "d", "id"]].rename(columns={"a_duplicate": "a", "c_duplicate": "c"}))# Concatenate the resulting datafraames, reset the index, then put it in the correct column orderdf3 = pd.concat([df1, df2], ignore_index=True)[["a", "b", "c", "d", "id"]]df3Output:    a   b   c   d       id0   1   2.0 5   NaN     id11   3   4.0 5   NaN     id12   6   NaN 7   10.0    id23   8   NaN 9   10.0    id2~~ For OP&#8217;s Comment ~~This is pretty hacky but should be able to go through all of your 50 dataframes and correct them then combine them into a master dataframe. You will have to come up with your own way of looping through all of them (this codes places all of them in a dataframeList then cycles through those dataframes). I don&#8217;t know how long it will take as I don&#8217;t know how big your data is but&#8230; it&#8217; worth a shot.data1 = {"a":1, "b":2, "a_duplicate":3,"b_duplicate":4,"c":5, "id":"id1"}data2 = {"a":6, "c":7, "a_duplicate":8,"c_duplicate":9,"d":10, "id":"id2"}data3 = {"a":3, "b":2, "c":7, "a_duplicate":15,"b_duplicate":20, "c_duplicate":9,"d":10, "id":"id3"}data4 = {"a":4, "d":3, "c":5, "a_duplicate":7,"d_duplicate":15, "c_duplicate":9,"d":10, "id":"id4"}df1 = pd.DataFrame(data1, index=[0])df2 = pd.DataFrame(data2, index=[0])df3 = pd.DataFrame(data3, index=[0])df4 = pd.DataFrame(data4, index=[0])dataframeList = [df1, df2, df3, df4]finalDF = pd.DataFrame(columns=["a", "b", "c", "d", "id"])for df in dataframeList:    notDup = [x for x in df.columns if "_duplicate" not in x]                   # Find column names that are not duplicated    isDup  = list(set(df.columns)-set(notDup))                                  # Find duplicate column names    dupColumns = isDup + list(set(notDup) - {x.split("_")[0] for x in isDup})   # Create list of column names for duplicated dataframe    dupDF = df[dupColumns]                                                      # set the duplicate dataframe to be these columns    for dup in isDup:                                                           # Cycle through every duplicated column name and rename it        letter = dup.split("_")[0]                                              # to just the column name without "_duplicate"        dupDF = dupDF.rename(columns={dup:letter})    df = df[notDup].append(dupDF)                                               # Append the not duplicated columns with the duplicated columns    finalDF = pd.concat([finalDF, df], ignore_index = True)                     # Concatenate all of them into one master dataframeOutput:a   b   c   d   id0   1   2   5   NaN id11   3   4   5   NaN id12   6   NaN 7   10  id23   8   NaN 9   10  id24   3   2   7   10  id35   15  20  9   10  id36   4   NaN 5   10  id47   7   NaN 9   15  id4

a	b	c	d	id
1	2	5	Null	id1
3	4	5	Null	id1
6	Null	7	10	id2
8	Null	7	10	id2

Advertisement

Answer