Python newbie attempting a complex pandas dataframe logic
I have multiple dataframes I need to join but I’ll show two below for the example. The dataframe have duplicate columns labelled with suffix ‘_duplicate’. I need to replicate the row instead of having the duplicate column as seen below.
My first thought is to get a list of unique column names then create an empty dataframe with those columns. Then have a for loop checking if column exists if so append, if the column_duplicate also append etc but unsure how to create this dataframe.
 List_of_columns = ["a", "b", "c", "d", "id"]
Dataframe1:X
| a | b | a_duplicate | b_duplicate | c | id | 
|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | id1 | 
Dataframe2:Y
| a | c | a_duplicate | c_duplicate | d | id | 
|---|---|---|---|---|---|
| 6 | 7 | 8 | 9 | 10 | id2 | 
Created dataframe:
| a | b | c | d | id | 
|---|---|---|---|---|
| 1 | 2 | 5 | Null | id1 | 
| 3 | 4 | 5 | Null | id1 | 
| 6 | Null | 7 | 10 | id2 | 
| 8 | Null | 7 | 10 | id2 | 
Advertisement
Answer
This is a very silly way of doing it and I am hoping someone comes up with a better way… but it does work:
##################### Recreate OP's dataframe ###########################
data1 = {"a":1, "b":2, "a_duplicate":3,"b_duplicate":4,"c":5, "id":"id1"}
data2 = {"a":6, "c":7, "a_duplicate":8,"c_duplicate":9,"d":10, "id":"id2"}
df1 = pd.DataFrame(data1, index=[0])
df2 = pd.DataFrame(data2, index=[0])
#########################################################################
# Append columns together while renaming the duplicate columns
df1 = df1[["a", "b", "c", "id"]].append(df1[["a_duplicate", "b_duplicate", "c", "id"]].rename(columns={"a_duplicate": "a", "b_duplicate": "b"}))
df2 = df2[["a", "c", "d", "id"]].append(df2[["a_duplicate", "c_duplicate", "d", "id"]].rename(columns={"a_duplicate": "a", "c_duplicate": "c"}))
# Concatenate the resulting datafraames, reset the index, then put it in the correct column order
df3 = pd.concat([df1, df2], ignore_index=True)[["a", "b", "c", "d", "id"]]
df3
Output:
a b c d id 0 1 2.0 5 NaN id1 1 3 4.0 5 NaN id1 2 6 NaN 7 10.0 id2 3 8 NaN 9 10.0 id2
~~ For OP’s Comment ~~
This is pretty hacky but should be able to go through all of your 50 dataframes and correct them then combine them into a master dataframe. You will have to come up with your own way of looping through all of them (this codes places all of them in a dataframeList then cycles through those dataframes). I don’t know how long it will take as I don’t know how big your data is but… it’ worth a shot.
data1 = {"a":1, "b":2, "a_duplicate":3,"b_duplicate":4,"c":5, "id":"id1"}
data2 = {"a":6, "c":7, "a_duplicate":8,"c_duplicate":9,"d":10, "id":"id2"}
data3 = {"a":3, "b":2, "c":7, "a_duplicate":15,"b_duplicate":20, "c_duplicate":9,"d":10, "id":"id3"}
data4 = {"a":4, "d":3, "c":5, "a_duplicate":7,"d_duplicate":15, "c_duplicate":9,"d":10, "id":"id4"}
df1 = pd.DataFrame(data1, index=[0])
df2 = pd.DataFrame(data2, index=[0])
df3 = pd.DataFrame(data3, index=[0])
df4 = pd.DataFrame(data4, index=[0])
dataframeList = [df1, df2, df3, df4]
finalDF = pd.DataFrame(columns=["a", "b", "c", "d", "id"])
for df in dataframeList:
    notDup = [x for x in df.columns if "_duplicate" not in x]                   # Find column names that are not duplicated
    isDup  = list(set(df.columns)-set(notDup))                                  # Find duplicate column names
    dupColumns = isDup + list(set(notDup) - {x.split("_")[0] for x in isDup})   # Create list of column names for duplicated dataframe
    dupDF = df[dupColumns]                                                      # set the duplicate dataframe to be these columns
    for dup in isDup:                                                           # Cycle through every duplicated column name and rename it
        letter = dup.split("_")[0]                                              # to just the column name without "_duplicate"
        dupDF = dupDF.rename(columns={dup:letter})
    df = df[notDup].append(dupDF)                                               # Append the not duplicated columns with the duplicated columns
    finalDF = pd.concat([finalDF, df], ignore_index = True)                     # Concatenate all of them into one master dataframe
Output:
a b c d id 0 1 2 5 NaN id1 1 3 4 5 NaN id1 2 6 NaN 7 10 id2 3 8 NaN 9 10 id2 4 3 2 7 10 id3 5 15 20 9 10 id3 6 4 NaN 5 10 id4 7 7 NaN 9 15 id4
