How to count word similarity between two pandas dataframe

Question

Here's my first dataframe df1 Here's my second dataframe df2 Similarity Matrix, columns is Id from df1, rows is Id from df2 Note: 0 value in (1,1) and (3,2) because no text similar 1 value in (3,1) is because of Bersatu and Kita' (Id 1ondf2is avalilable in Id3ondf1` 0.33 is counted because of 1 of 3 words similar 0.66 is

Accepted Answer

IIUC, you need to compute a set intersection:l1 = [set(x.split()) for x in df1['Text'].str.lower()]l2 = [set(x.split()) for x in df2['Text'].str.lower()]pd.DataFrame([[len(s1&s2)/len(s1) for s1 in l1] for s2 in l2],             columns=df1['Id'], index=df2['Id'])output:Id         1         2         3Id                              1   0.000000  0.333333  0.6666672   0.666667  0.666667  0.000000NB. Note that the condition on the denominator is not fully clear, for {teguh, kita, bersatu} vs {kita, bersatu} I count 2/3 = 0.666

Advertisement

Answer