I have a table with top 3 reasons (Table 1) and another table with the category it belongs to for each variable (Table 2). I am trying to match the category bins into the reason table like in table 3.
JavaScript
x
29
29
1
Table 1: top 3 reasons, basically which 3 variable was the most important for that Register number
2
3
RegNo Reason1 Reason2 Reason3
4
1111 v3 v2 v6
5
2222 v2 v3 v5
6
3333 v3 v2 v6
7
4444 v3 v6 v2
8
5555 v3 v2 v5
9
10
11
Table 2: The category bin for each variable
12
13
RegNo v2 v3 v4 v5 v6
14
1111 (0.44, 0.64] (0.0, 60.0] missing (-inf, 3.0] (102.0, 175.0]
15
2222 (-inf, 0.33] (0.0, 60.0] welldone missing missing
16
3333 (0.44, 0.64] (0.0, 60.0] rare missing missing
17
4444 (0.44, 0.64] (0.0, 60.0] missing missing (20.0, 102.0]
18
5555 (0.64, inf] (0.0, 60.0] missing missing (-inf, 20.0]
19
20
21
Table 3: For each ID, find category bin label and replace reason1,2 & 3 with the labels
22
23
RegNo Reason1 Reason2 Reason3
24
1111 (0.0, 60.0] (0.44, 0.64] (102.0, 175.0]
25
2222 (-inf, 0.33] (0.0, 60.0] missing
26
3333 (0.0, 60.0] (0.44, 0.64] missing
27
4444 (0.0, 60.0] (20.0, 102.0] (0.44, 0.64]
28
5555 (0.0, 60.0] (0.64, inf] missing
29
Advertisement
Answer
Approach
- index two data frames in way that works with
join()
- then it’s a
pd.concat()
of each of the reasons from output of join
JavaScript
1
29
29
1
df1 = pd.read_csv(io.StringIO(""" RegNo Reason1 Reason2 Reason3
2
1111 v3 v2 v6
3
2222 v2 v3 v5
4
3333 v3 v2 v6
5
4444 v3 v6 v2
6
5555 v3 v2 v5
7
"""), sep="s+")
8
9
df2 = pd.read_csv(io.StringIO(""" RegNo v2 v3 v4 v5 v6
10
1111 (0.44, 0.64] (0.0, 60.0] missing (-inf, 3.0] (102.0, 175.0]
11
2222 (-inf, 0.33] (0.0, 60.0] welldone missing missing
12
3333 (0.44, 0.64] (0.0, 60.0] rare missing missing
13
4444 (0.44, 0.64] (0.0, 60.0] missing missing (20.0, 102.0]
14
5555 (0.64, inf] (0.0, 60.0] missing missing (-inf, 20.0]
15
"""),sep="ss+", engine="python")
16
17
# index main df
18
df1 = df1.set_index("RegNo")
19
# reshape and index reasons
20
dfm = df2.set_index("RegNo").stack()
21
df3 = pd.concat([
22
# add apprpriate column into index, don't want othert columns
23
(df1.set_index(c, append="True").loc[:,[]]
24
# now it's a straight forward join
25
.join(dfm.rename_axis(["RegNo",c]).to_frame())
26
# cleanup index and rename columns of joined DF
27
.droplevel(1).rename(columns={0:c}))
28
for c in ["Reason1","Reason2","Reason3"]], axis=1)
29
RegNo | Reason1 | Reason2 | Reason3 |
---|---|---|---|
1111 | (0.0, 60.0] | (0.44, 0.64] | (102.0, 175.0] |
2222 | (-inf, 0.33] | (0.0, 60.0] | missing |
3333 | (0.0, 60.0] | (0.44, 0.64] | missing |
4444 | (0.0, 60.0] | (20.0, 102.0] | (0.44, 0.64] |
5555 | (0.0, 60.0] | (0.64, inf] | missing |