Here’s my first dataframe df1
JavaScript
x
5
1
Id Text
2
1 dFn
3
2 fiqe
4
3 raUw
5
Here’s my second dataframe df2
JavaScript
1
4
1
Id Text
2
1 yuw
3
2 dnag
4
Similarity Matrix, columns is Id
from df1
, rows is Id
from df2
JavaScript
1
5
1
1 2 3
2
1 0 0 0.66
3
2 0.5 0 0.25
4
5
Note:
0
value in (1,1), (2,1) and (3,2) because no letter similar
0.25
value in (3,1) is because of only 1 letter from raUw
avaliable in 4 letter `dnag’ (1/4 equals 0.25)
0.5
is counted because of 2 of 4 letter similar
0.66
is counted because of 2 of 3 words similar
Advertisement
Answer
IIUC, one option is to use set.intersection
in a nested list comprehension:
JavaScript
1
2
1
out = pd.DataFrame([[len(set(x.lower()) & set(y.lower())) / len(x) for y in df1['Text'].tolist()] for x in df2['Text'].tolist()])
2
Output:
JavaScript
1
4
1
0 1 2
2
0 0.0 0.0 0.666667
3
1 0.5 0.0 0.250000
4