How to count word similarity between two pandas dataframe

Question

Here&#8217;s my first dataframe df1 Here&#8217;s my second dataframe df2 Similarity Matrix, columns is Id from df1, rows is Id from df2 Note: 0 value in (1,1) and (3,2) because no text similar 1 value in (3,1) is because of Bersatu and Kita&#8217; (Id 1ondf2is avalilable in Id3ondf1` 0.33 is counted because o…

Accepted Answer

IIUC, you need to compute a set intersection:l1 = [set(x.split()) for x in df1['Text'].str.lower()]l2 = [set(x.split()) for x in df2['Text'].str.lower()]pd.DataFrame([[len(s1&s2)/len(s1) for s1 in l1] for s2 in l2],             columns=df1['Id'], index=df2['Id'])output:Id         1         2         3Id                              1   0.000000  0.333333  0.6666672   0.666667  0.666667  0.000000NB. Note that the condition on the denominator is not fully clear, for {teguh, kita, bersatu} vs {kita, bersatu} I count 2/3 = 0.666

Advertisement

Answer