Resolving conflicts in Pandas dataframe

Question

I am performing record linkage on a dataframe such as: When my model overpredicts and links the same ID_1 to more than one ID_2 (indicated by a 1 in Predicted Link) I want to resolve the conflicts based on the Probability-value. If one predicted link has a higher probability than the other I want to keep a 1 for that,

Accepted Answer

For each ID_1, you want to keep one and only one row. Thus, grouping is a good start.First let&#8217;s construct our data :import pandas as pdfrom io import StringIOcsvfile = StringIO("""ID_1tID_2tPredicted LinktProbability1t0t1t0.91t1t1t0.51t2t0t02t1t1t0.82t5t1t0.83t1t0t03t2t1t0.5""")df = pd.read_csv(csvfile, sep = 't', engine='python')We want to a group for each value of ID_1 and then looking for the row holding the max value of Probability for that said value of ID_1. Let&#8217;s create a mask :max_proba = df.groupby("ID_1")["Probability"].transform(lambda x : x.eq(x.max()))max_probaOut[196]: 0     True1    False2    False3     True4     True5    False6     TrueName: Probability, dtype: boolConsidering your rules, rows 0, 1, 2 and rows 5, 6 are valid (only one max for that ID_1 value), but not the 3 and 4 rows. Let&#8217;s build a mask that consider these two conditions, True if max value and if only one max value.To be more accurate, for each ID_1, if a Probablity value is duplicated then it can&#8217;t be a candidate for the said max. We will then build a max that exclude duplicates Probability value for each ID_1 valuemask_unique = df.groupby(["ID_1", "Probability"])["Probability"].transform(lambda x : len(x) == 1)mask_uniqueOut[284]: 0     True1     True2     True3    False4    False5     True6     TrueName: Probability, dtype: boolFinally, let&#8217;s combine our two masks :df.loc[:, "Predicted Link"] = 1 * (mask_max_proba & mask_unique)dfOut[285]:    ID_1  ID_2  Predicted Link  Probability0     1     0               1          0.91     1     1               0          0.52     1     2               0          0.03     2     1               0          0.84     2     5               0          0.85     3     1               0          0.06     3     2               1          0.5

Advertisement

Answer