Skip to content
Advertisement

How to count word similarity between two pandas dataframe

Here’s my first dataframe df1

Id   Text
1    Asoy Geboy Ngebut
2    Asoy kita Geboy 
3    Bersatu kita Teguh

Here’s my second dataframe df2

Id   Text
1    Bersatu Kita
2    Asoy Geboy Jalanan

Similarity Matrix, columns is Id from df1, rows is Id from df2

       1      2      3
1      0   0.33      1  
2   0.66   0.66      0

Note:

0 value in (1,1) and (3,2) because no text similar

1 value in (3,1) is because of Bersatu and Kita' (Id 1ondf2is avalilable in Id3ondf1`

0.33 is counted because of 1 of 3 words similar

0.66 is counted because of 2 of 3 words similar

Advertisement

Answer

IIUC, you need to compute a set intersection:

l1 = [set(x.split()) for x in df1['Text'].str.lower()]
l2 = [set(x.split()) for x in df2['Text'].str.lower()]

pd.DataFrame([[len(s1&s2)/len(s1) for s1 in l1] for s2 in l2],
             columns=df1['Id'], index=df2['Id'])

output:

Id         1         2         3
Id                              
1   0.000000  0.333333  0.666667
2   0.666667  0.666667  0.000000

NB. Note that the condition on the denominator is not fully clear, for {teguh, kita, bersatu} vs {kita, bersatu} I count 2/3 = 0.666

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement