Python newbie attempting a complex pandas dataframe logic
I have multiple dataframes I need to join but I’ll show two below for the example. The dataframe have duplicate columns labelled with suffix ‘_duplicate’. I need to replicate the row instead of having the duplicate column as seen below.
My first thought is to get a list of unique column names then create an empty dataframe with those columns. Then have a for loop checking if column exists if so append, if the column_duplicate also append etc but unsure how to create this dataframe.
List_of_columns = ["a", "b", "c", "d", "id"]
Dataframe1:X
a | b | a_duplicate | b_duplicate | c | id |
---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | id1 |
Dataframe2:Y
a | c | a_duplicate | c_duplicate | d | id |
---|---|---|---|---|---|
6 | 7 | 8 | 9 | 10 | id2 |
Created dataframe:
a | b | c | d | id |
---|---|---|---|---|
1 | 2 | 5 | Null | id1 |
3 | 4 | 5 | Null | id1 |
6 | Null | 7 | 10 | id2 |
8 | Null | 7 | 10 | id2 |
Advertisement
Answer
This is a very silly way of doing it and I am hoping someone comes up with a better way… but it does work:
##################### Recreate OP's dataframe ########################### data1 = {"a":1, "b":2, "a_duplicate":3,"b_duplicate":4,"c":5, "id":"id1"} data2 = {"a":6, "c":7, "a_duplicate":8,"c_duplicate":9,"d":10, "id":"id2"} df1 = pd.DataFrame(data1, index=[0]) df2 = pd.DataFrame(data2, index=[0]) ######################################################################### # Append columns together while renaming the duplicate columns df1 = df1[["a", "b", "c", "id"]].append(df1[["a_duplicate", "b_duplicate", "c", "id"]].rename(columns={"a_duplicate": "a", "b_duplicate": "b"})) df2 = df2[["a", "c", "d", "id"]].append(df2[["a_duplicate", "c_duplicate", "d", "id"]].rename(columns={"a_duplicate": "a", "c_duplicate": "c"})) # Concatenate the resulting datafraames, reset the index, then put it in the correct column order df3 = pd.concat([df1, df2], ignore_index=True)[["a", "b", "c", "d", "id"]] df3
Output:
a b c d id 0 1 2.0 5 NaN id1 1 3 4.0 5 NaN id1 2 6 NaN 7 10.0 id2 3 8 NaN 9 10.0 id2
~~ For OP’s Comment ~~
This is pretty hacky but should be able to go through all of your 50 dataframes and correct them then combine them into a master dataframe. You will have to come up with your own way of looping through all of them (this codes places all of them in a dataframeList
then cycles through those dataframes). I don’t know how long it will take as I don’t know how big your data is but… it’ worth a shot.
data1 = {"a":1, "b":2, "a_duplicate":3,"b_duplicate":4,"c":5, "id":"id1"} data2 = {"a":6, "c":7, "a_duplicate":8,"c_duplicate":9,"d":10, "id":"id2"} data3 = {"a":3, "b":2, "c":7, "a_duplicate":15,"b_duplicate":20, "c_duplicate":9,"d":10, "id":"id3"} data4 = {"a":4, "d":3, "c":5, "a_duplicate":7,"d_duplicate":15, "c_duplicate":9,"d":10, "id":"id4"} df1 = pd.DataFrame(data1, index=[0]) df2 = pd.DataFrame(data2, index=[0]) df3 = pd.DataFrame(data3, index=[0]) df4 = pd.DataFrame(data4, index=[0]) dataframeList = [df1, df2, df3, df4] finalDF = pd.DataFrame(columns=["a", "b", "c", "d", "id"]) for df in dataframeList: notDup = [x for x in df.columns if "_duplicate" not in x] # Find column names that are not duplicated isDup = list(set(df.columns)-set(notDup)) # Find duplicate column names dupColumns = isDup + list(set(notDup) - {x.split("_")[0] for x in isDup}) # Create list of column names for duplicated dataframe dupDF = df[dupColumns] # set the duplicate dataframe to be these columns for dup in isDup: # Cycle through every duplicated column name and rename it letter = dup.split("_")[0] # to just the column name without "_duplicate" dupDF = dupDF.rename(columns={dup:letter}) df = df[notDup].append(dupDF) # Append the not duplicated columns with the duplicated columns finalDF = pd.concat([finalDF, df], ignore_index = True) # Concatenate all of them into one master dataframe
Output:
a b c d id 0 1 2 5 NaN id1 1 3 4 5 NaN id1 2 6 NaN 7 10 id2 3 8 NaN 9 10 id2 4 3 2 7 10 id3 5 15 20 9 10 id3 6 4 NaN 5 10 id4 7 7 NaN 9 15 id4