Skip to content
Advertisement

Create pandas dataframe on column name conditions

Python newbie attempting a complex pandas dataframe logic

I have multiple dataframes I need to join but I’ll show two below for the example. The dataframe have duplicate columns labelled with suffix ‘_duplicate’. I need to replicate the row instead of having the duplicate column as seen below.

My first thought is to get a list of unique column names then create an empty dataframe with those columns. Then have a for loop checking if column exists if so append, if the column_duplicate also append etc but unsure how to create this dataframe.

List_of_columns = ["a", "b", "c", "d", "id"]

Dataframe1:X

a b a_duplicate b_duplicate c id
1 2 3 4 5 id1

Dataframe2:Y

a c a_duplicate c_duplicate d id
6 7 8 9 10 id2

Created dataframe:

a b c d id
1 2 5 Null id1
3 4 5 Null id1
6 Null 7 10 id2
8 Null 7 10 id2

Advertisement

Answer

This is a very silly way of doing it and I am hoping someone comes up with a better way… but it does work:

##################### Recreate OP's dataframe ###########################
data1 = {"a":1, "b":2, "a_duplicate":3,"b_duplicate":4,"c":5, "id":"id1"}
data2 = {"a":6, "c":7, "a_duplicate":8,"c_duplicate":9,"d":10, "id":"id2"}
df1 = pd.DataFrame(data1, index=[0])
df2 = pd.DataFrame(data2, index=[0])
#########################################################################

# Append columns together while renaming the duplicate columns
df1 = df1[["a", "b", "c", "id"]].append(df1[["a_duplicate", "b_duplicate", "c", "id"]].rename(columns={"a_duplicate": "a", "b_duplicate": "b"}))
df2 = df2[["a", "c", "d", "id"]].append(df2[["a_duplicate", "c_duplicate", "d", "id"]].rename(columns={"a_duplicate": "a", "c_duplicate": "c"}))

# Concatenate the resulting datafraames, reset the index, then put it in the correct column order
df3 = pd.concat([df1, df2], ignore_index=True)[["a", "b", "c", "d", "id"]]
df3

Output:

    a   b   c   d       id
0   1   2.0 5   NaN     id1
1   3   4.0 5   NaN     id1
2   6   NaN 7   10.0    id2
3   8   NaN 9   10.0    id2

~~ For OP’s Comment ~~

This is pretty hacky but should be able to go through all of your 50 dataframes and correct them then combine them into a master dataframe. You will have to come up with your own way of looping through all of them (this codes places all of them in a dataframeList then cycles through those dataframes). I don’t know how long it will take as I don’t know how big your data is but… it’ worth a shot.

data1 = {"a":1, "b":2, "a_duplicate":3,"b_duplicate":4,"c":5, "id":"id1"}
data2 = {"a":6, "c":7, "a_duplicate":8,"c_duplicate":9,"d":10, "id":"id2"}
data3 = {"a":3, "b":2, "c":7, "a_duplicate":15,"b_duplicate":20, "c_duplicate":9,"d":10, "id":"id3"}
data4 = {"a":4, "d":3, "c":5, "a_duplicate":7,"d_duplicate":15, "c_duplicate":9,"d":10, "id":"id4"}
df1 = pd.DataFrame(data1, index=[0])
df2 = pd.DataFrame(data2, index=[0])
df3 = pd.DataFrame(data3, index=[0])
df4 = pd.DataFrame(data4, index=[0])
dataframeList = [df1, df2, df3, df4]

finalDF = pd.DataFrame(columns=["a", "b", "c", "d", "id"])

for df in dataframeList:
    notDup = [x for x in df.columns if "_duplicate" not in x]                   # Find column names that are not duplicated
    isDup  = list(set(df.columns)-set(notDup))                                  # Find duplicate column names
    dupColumns = isDup + list(set(notDup) - {x.split("_")[0] for x in isDup})   # Create list of column names for duplicated dataframe
    dupDF = df[dupColumns]                                                      # set the duplicate dataframe to be these columns

    for dup in isDup:                                                           # Cycle through every duplicated column name and rename it
        letter = dup.split("_")[0]                                              # to just the column name without "_duplicate"
        dupDF = dupDF.rename(columns={dup:letter})

    df = df[notDup].append(dupDF)                                               # Append the not duplicated columns with the duplicated columns

    finalDF = pd.concat([finalDF, df], ignore_index = True)                     # Concatenate all of them into one master dataframe

Output:

a   b   c   d   id
0   1   2   5   NaN id1
1   3   4   5   NaN id1
2   6   NaN 7   10  id2
3   8   NaN 9   10  id2
4   3   2   7   10  id3
5   15  20  9   10  id3
6   4   NaN 5   10  id4
7   7   NaN 9   15  id4
User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement