Skip to content
Advertisement

Resolving conflicts in Pandas dataframe

I am performing record linkage on a dataframe such as:

ID_1     ID_2    Predicted Link     Probability
   1        0                 1             0.9
   1        1                 1             0.5
   1        2                 0               0
   2        1                 1             0.8
   2        5                 1             0.8
   3        1                 0               0
   3        2                 1             0.5

When my model overpredicts and links the same ID_1 to more than one ID_2 (indicated by a 1 in Predicted Link) I want to resolve the conflicts based on the Probability-value. If one predicted link has a higher probability than the other I want to keep a 1 for that, but reverse the other prediction link values for that ID_1 to 0. If the (highest) probabilities are of equal value I want to reverse all the predicted link values to 0. If only one predicted link then the predicted values should be left as they are.

The resulting dataframe would look like this:

ID_1     ID_2    Predicted Link     Probability
   1        0                 1             0.9
   1        1                 0             0.5
   1        2                 0               0
   2        1                 0             0.8
   2        5                 0             0.8
   3        1                 0               0
   3        2                 1             0.5

I am grouping via pandas.groupby, and tried some variations with numpy.select and numpy.where, but without luck. Any help much appreciated!

Advertisement

Answer

For each ID_1, you want to keep one and only one row. Thus, grouping is a good start.

First let’s construct our data :

import pandas as pd
from io import StringIO

csvfile = StringIO(
"""ID_1tID_2tPredicted LinktProbability
1t0t1t0.9
1t1t1t0.5
1t2t0t0
2t1t1t0.8
2t5t1t0.8
3t1t0t0
3t2t1t0.5""")

df = pd.read_csv(csvfile, sep = 't', engine='python')

We want to a group for each value of ID_1 and then looking for the row holding the max value of Probability for that said value of ID_1. Let’s create a mask :

max_proba = df.groupby("ID_1")["Probability"].transform(lambda x : x.eq(x.max()))

max_proba
Out[196]: 
0     True
1    False
2    False
3     True
4     True
5    False
6     True
Name: Probability, dtype: bool

Considering your rules, rows 0, 1, 2 and rows 5, 6 are valid (only one max for that ID_1 value), but not the 3 and 4 rows. Let’s build a mask that consider these two conditions, True if max value and if only one max value.

To be more accurate, for each ID_1, if a Probablity value is duplicated then it can’t be a candidate for the said max. We will then build a max that exclude duplicates Probability value for each ID_1 value

mask_unique = df.groupby(["ID_1", "Probability"])["Probability"].transform(lambda x : len(x) == 1)

mask_unique
Out[284]: 
0     True
1     True
2     True
3    False
4    False
5     True
6     True
Name: Probability, dtype: bool

Finally, let’s combine our two masks :

df.loc[:, "Predicted Link"] = 1 * (mask_max_proba & mask_unique)

df
Out[285]: 
   ID_1  ID_2  Predicted Link  Probability
0     1     0               1          0.9
1     1     1               0          0.5
2     1     2               0          0.0
3     2     1               0          0.8
4     2     5               0          0.8
5     3     1               0          0.0
6     3     2               1          0.5
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement