Skip to content
Advertisement

How does rsuffix and lsuffix work while joining multiple dataframes?

I have written the following code however I am unable to understand how to name the rsuffix and lsuffix parameters

dfs_list = []
for cycle in email_df.cycle_end_date.unique():
    temp = email_df[email_df.cycle_end_date == cycle].transpose()
                    .join(flash_df[flash_df.cycle_end_date == cycle].transpose(), how='outer', lsuffix='email', rsuffix='flash')
                    .join(sms_df[sms_df.cycle_end_date == cycle].transpose(), how='outer', lsuffix='email', rsuffix='flash')
                    .join(upi_df[upi_df.cycle_end_date==cycle].transpose(),how='outer',lsuffix='lsuf', rsuffix='rsuf')
                    .join(ivr_df[ivr_df.cycle_end_date==cycle].transpose(),how='outer',lsuffix='lsuff', rsuffix='rsuff')
    dfs_list.append(temp)

All my dfs have same column names

example:

cycle_end_date | triggered | delivered | cost | payment_value | delivery%%
2021-15-01  | 34 | 32 | 4 | 7899 | 5%%
2021-31-01  | 45 | 49 | 8 | 1500 | 4%%

When I am printing dfs_list[2].reset_index() I do get my expected output but I am unable to comprehend the suffix names. How do we define it?

output:

**index |   2email |    1lsuff |    2flash |    2   | 1rsuff**
0   absolute_cost   3.00    9.40    9.40    0.00    6.00
1   bill_paid_percent   3.28    0.33    1.87    68139.72    0.28
2   bill_paid_using_reminder    21.20   0.70    9.45    1.78    0.64
3   bounced_email   5018    NaN NaN NaN NaN
4   clicked_email   13385   NaN NaN NaN NaN
5   cycle_end_date  2022-02-28  2022-02-28  2022-02-28  2022-02-28  2022-02-28

Can someone throw light on how to name the suffix to get the exact numbers of the dataframe considered?

Advertisement

Answer

lsuffix and rsuffix only work when there are overlapping columns between joined dataframes.

Let’s look at the continous join in your script

temp = email_df[email_df.cycle_end_date == cycle].transpose()
         .join(flash_df[flash_df.cycle_end_date == cycle].transpose(), how='outer', lsuffix='email', rsuffix='flash')
         .join(sms_df[sms_df.cycle_end_date == cycle].transpose(), how='outer', lsuffix='email', rsuffix='flash')
         .join(upi_df[upi_df.cycle_end_date==cycle].transpose(),how='outer',lsuffix='lsuf', rsuffix='rsuf')
         .join(ivr_df[ivr_df.cycle_end_date==cycle].transpose(),how='outer',lsuffix='lsuff', rsuffix='rsuff')

Since you do a transpose to all your joined dataframe, so actually you are using dataframe where column names are original index.

                         1
cycle_end_date  2021-31-01
triggered               45
delivered               49
cost                     8
payment_value         1500
delivery%%               4%%

After you join email_df with flash_df, the 1 might overlap, so the joined df is

                    1email       1flash
cycle_end_date  2021-15-01  2021-15-01
triggered               34          34
delivered               32          32
cost                     4           4
payment_value         7899        7899
delivery%%               5%%          5%%

In the next join with sms_df whose column name is index doesn’t overlap with above joined df, so the output might like

                    0email      0flash           1
cycle_end_date  2021-15-01  2021-15-01  2021-15-01
triggered               34          34          34
delivered               32          32          32
cost                     4           4           4
payment_value         7899        7899        7899
delivery%%               5%%          5%%          5%%

And this process goes on…

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement