Here’s my first dataframe df1
Id Text 1 dFn 2 fiqe 3 raUw
Here’s my second dataframe df2
Id Text 1 yuw 2 dnag
Similarity Matrix, columns is Id
from df1
, rows is Id
from df2
1 2 3 1 0 0 0.66 2 0.5 0 0.25
Note:
0
value in (1,1), (2,1) and (3,2) because no letter similar
0.25
value in (3,1) is because of only 1 letter from raUw
avaliable in 4 letter `dnag’ (1/4 equals 0.25)
0.5
is counted because of 2 of 4 letter similar
0.66
is counted because of 2 of 3 words similar
Advertisement
Answer
IIUC, one option is to use set.intersection
in a nested list comprehension:
out = pd.DataFrame([[len(set(x.lower()) & set(y.lower())) / len(x) for y in df1['Text'].tolist()] for x in df2['Text'].tolist()])
Output:
0 1 2 0 0.0 0.0 0.666667 1 0.5 0.0 0.250000