I am performing record linkage on a dataframe such as:
ID_1 ID_2 Predicted Link Probability 1 0 1 0.9 1 1 1 0.5 1 2 0 0 2 1 1 0.8 2 5 1 0.8 3 1 0 0 3 2 1 0.5
When my model overpredicts and links the same ID_1 to more than one ID_2 (indicated by a 1 in Predicted Link) I want to resolve the conflicts based on the Probability-value. If one predicted link has a higher probability than the other I want to keep a 1 for that, but reverse the other prediction link values for that ID_1 to 0. If the (highest) probabilities are of equal value I want to reverse all the predicted link values to 0. If only one predicted link then the predicted values should be left as they are.
The resulting dataframe would look like this:
ID_1 ID_2 Predicted Link Probability 1 0 1 0.9 1 1 0 0.5 1 2 0 0 2 1 0 0.8 2 5 0 0.8 3 1 0 0 3 2 1 0.5
I am grouping via pandas.groupby, and tried some variations with numpy.select and numpy.where, but without luck. Any help much appreciated!
Advertisement
Answer
For each ID_1, you want to keep one and only one row. Thus, grouping is a good start.
First let’s construct our data :
import pandas as pd from io import StringIO csvfile = StringIO( """ID_1tID_2tPredicted LinktProbability 1t0t1t0.9 1t1t1t0.5 1t2t0t0 2t1t1t0.8 2t5t1t0.8 3t1t0t0 3t2t1t0.5""") df = pd.read_csv(csvfile, sep = 't', engine='python')
We want to a group for each value of ID_1 and then looking for the row holding the max value of Probability for that said value of ID_1. Let’s create a mask :
max_proba = df.groupby("ID_1")["Probability"].transform(lambda x : x.eq(x.max())) max_proba Out[196]: 0 True 1 False 2 False 3 True 4 True 5 False 6 True Name: Probability, dtype: bool
Considering your rules, rows 0, 1, 2 and rows 5, 6 are valid (only one max for that ID_1 value), but not the 3 and 4 rows. Let’s build a mask that consider these two conditions, True
if max value and if only one max value.
To be more accurate, for each ID_1, if a Probablity value is duplicated then it can’t be a candidate for the said max. We will then build a max that exclude duplicates Probability value for each ID_1 value
mask_unique = df.groupby(["ID_1", "Probability"])["Probability"].transform(lambda x : len(x) == 1) mask_unique Out[284]: 0 True 1 True 2 True 3 False 4 False 5 True 6 True Name: Probability, dtype: bool
Finally, let’s combine our two masks :
df.loc[:, "Predicted Link"] = 1 * (mask_max_proba & mask_unique) df Out[285]: ID_1 ID_2 Predicted Link Probability 0 1 0 1 0.9 1 1 1 0 0.5 2 1 2 0 0.0 3 2 1 0 0.8 4 2 5 0 0.8 5 3 1 0 0.0 6 3 2 1 0.5