Skip to content
Advertisement

Map 2 df but column to value instead of value to value for each ID

I have a table with top 3 reasons (Table 1) and another table with the category it belongs to for each variable (Table 2). I am trying to match the category bins into the reason table like in table 3.

Table 1: top 3 reasons, basically which 3 variable was the most important for that Register number

      RegNo    Reason1  Reason2 Reason3
        1111         v3       v2     v6
        2222         v2       v3     v5
        3333         v3       v2     v6
        4444         v3       v6     v2
        5555         v3       v2     v5


Table 2: The category bin for each variable 

    RegNo           v2              v3        v4          v5                 v6
    1111    (0.44, 0.64]    (0.0, 60.0]     missing    (-inf, 3.0]   (102.0, 175.0]
    2222    (-inf, 0.33]    (0.0, 60.0]     welldone   missing       missing
    3333    (0.44, 0.64]    (0.0, 60.0]     rare       missing       missing
    4444    (0.44, 0.64]    (0.0, 60.0]     missing    missing       (20.0, 102.0]
    5555     (0.64, inf]    (0.0, 60.0]     missing    missing       (-inf, 20.0]


Table 3: For each ID, find category bin label and replace reason1,2 & 3 with the labels

       RegNo            Reason1         Reason2         Reason3
        1111         (0.0, 60.0]    (0.44, 0.64]     (102.0, 175.0]
        2222         (-inf, 0.33]    (0.0, 60.0]       missing
        3333         (0.0, 60.0]    (0.44, 0.64]       missing
        4444         (0.0, 60.0]    (20.0, 102.0]    (0.44, 0.64]
        5555         (0.0, 60.0]     (0.64, inf]        missing

Advertisement

Answer

Approach

  • index two data frames in way that works with join()
  • then it’s a pd.concat() of each of the reasons from output of join
df1 = pd.read_csv(io.StringIO("""      RegNo    Reason1  Reason2 Reason3
        1111         v3       v2     v6
        2222         v2       v3     v5
        3333         v3       v2     v6
        4444         v3       v6     v2
        5555         v3       v2     v5
"""), sep="s+")

df2 = pd.read_csv(io.StringIO("""    RegNo           v2              v3        v4          v5                 v6
    1111    (0.44, 0.64]    (0.0, 60.0]     missing    (-inf, 3.0]   (102.0, 175.0]
    2222    (-inf, 0.33]    (0.0, 60.0]     welldone   missing       missing
    3333    (0.44, 0.64]    (0.0, 60.0]     rare       missing       missing
    4444    (0.44, 0.64]    (0.0, 60.0]     missing    missing       (20.0, 102.0]
    5555     (0.64, inf]    (0.0, 60.0]     missing    missing       (-inf, 20.0]
"""),sep="ss+", engine="python")

# index main df
df1 = df1.set_index("RegNo")
# reshape and index reasons
dfm = df2.set_index("RegNo").stack()
df3 = pd.concat([
    # add apprpriate column into index,  don't want othert columns
    (df1.set_index(c, append="True").loc[:,[]]
     # now it's a straight forward join
     .join(dfm.rename_axis(["RegNo",c]).to_frame())
     # cleanup index and rename columns of joined DF
     .droplevel(1).rename(columns={0:c}))
    for c in ["Reason1","Reason2","Reason3"]], axis=1)
RegNo Reason1 Reason2 Reason3
1111 (0.0, 60.0] (0.44, 0.64] (102.0, 175.0]
2222 (-inf, 0.33] (0.0, 60.0] missing
3333 (0.0, 60.0] (0.44, 0.64] missing
4444 (0.0, 60.0] (20.0, 102.0] (0.44, 0.64]
5555 (0.0, 60.0] (0.64, inf] missing
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement