Here’s my first dataframe df1
Id Text 1 Asoy Geboy Ngebut 2 Asoy kita Geboy 3 Bersatu kita Teguh
Here’s my second dataframe df2
Id Text 1 Bersatu Kita 2 Asoy Geboy Jalanan
Similarity Matrix, columns is Id
from df1
, rows is Id
from df2
1 2 3 1 0 0.33 1 2 0.66 0.66 0
Note:
0
value in (1,1) and (3,2) because no text similar
1
value in (3,1) is because of Bersatu
and Kita' (Id
1on
df2is avalilable in Id
3on
df1`
0.33
is counted because of 1 of 3 words similar
0.66
is counted because of 2 of 3 words similar
Advertisement
Answer
IIUC, you need to compute a set
intersection:
l1 = [set(x.split()) for x in df1['Text'].str.lower()] l2 = [set(x.split()) for x in df2['Text'].str.lower()] pd.DataFrame([[len(s1&s2)/len(s1) for s1 in l1] for s2 in l2], columns=df1['Id'], index=df2['Id'])
output:
Id 1 2 3 Id 1 0.000000 0.333333 0.666667 2 0.666667 0.666667 0.000000
NB. Note that the condition on the denominator is not fully clear, for {teguh, kita, bersatu}
vs {kita, bersatu}
I count 2/3 = 0.666