I have a table with top 3 reasons (Table 1) and another table with the category it belongs to for each variable (Table 2). I am trying to match the category bins into the reason table like in table 3.
Table 1: top 3 reasons, basically which 3 variable was the most important for that Register number RegNo Reason1 Reason2 Reason3 1111 v3 v2 v6 2222 v2 v3 v5 3333 v3 v2 v6 4444 v3 v6 v2 5555 v3 v2 v5 Table 2: The category bin for each variable RegNo v2 v3 v4 v5 v6 1111 (0.44, 0.64] (0.0, 60.0] missing (-inf, 3.0] (102.0, 175.0] 2222 (-inf, 0.33] (0.0, 60.0] welldone missing missing 3333 (0.44, 0.64] (0.0, 60.0] rare missing missing 4444 (0.44, 0.64] (0.0, 60.0] missing missing (20.0, 102.0] 5555 (0.64, inf] (0.0, 60.0] missing missing (-inf, 20.0] Table 3: For each ID, find category bin label and replace reason1,2 & 3 with the labels RegNo Reason1 Reason2 Reason3 1111 (0.0, 60.0] (0.44, 0.64] (102.0, 175.0] 2222 (-inf, 0.33] (0.0, 60.0] missing 3333 (0.0, 60.0] (0.44, 0.64] missing 4444 (0.0, 60.0] (20.0, 102.0] (0.44, 0.64] 5555 (0.0, 60.0] (0.64, inf] missing
Advertisement
Answer
Approach
- index two data frames in way that works with
join()
- then it’s a
pd.concat()
of each of the reasons from output of join
df1 = pd.read_csv(io.StringIO(""" RegNo Reason1 Reason2 Reason3 1111 v3 v2 v6 2222 v2 v3 v5 3333 v3 v2 v6 4444 v3 v6 v2 5555 v3 v2 v5 """), sep="s+") df2 = pd.read_csv(io.StringIO(""" RegNo v2 v3 v4 v5 v6 1111 (0.44, 0.64] (0.0, 60.0] missing (-inf, 3.0] (102.0, 175.0] 2222 (-inf, 0.33] (0.0, 60.0] welldone missing missing 3333 (0.44, 0.64] (0.0, 60.0] rare missing missing 4444 (0.44, 0.64] (0.0, 60.0] missing missing (20.0, 102.0] 5555 (0.64, inf] (0.0, 60.0] missing missing (-inf, 20.0] """),sep="ss+", engine="python") # index main df df1 = df1.set_index("RegNo") # reshape and index reasons dfm = df2.set_index("RegNo").stack() df3 = pd.concat([ # add apprpriate column into index, don't want othert columns (df1.set_index(c, append="True").loc[:,[]] # now it's a straight forward join .join(dfm.rename_axis(["RegNo",c]).to_frame()) # cleanup index and rename columns of joined DF .droplevel(1).rename(columns={0:c})) for c in ["Reason1","Reason2","Reason3"]], axis=1)
RegNo | Reason1 | Reason2 | Reason3 |
---|---|---|---|
1111 | (0.0, 60.0] | (0.44, 0.64] | (102.0, 175.0] |
2222 | (-inf, 0.33] | (0.0, 60.0] | missing |
3333 | (0.0, 60.0] | (0.44, 0.64] | missing |
4444 | (0.0, 60.0] | (20.0, 102.0] | (0.44, 0.64] |
5555 | (0.0, 60.0] | (0.64, inf] | missing |