Skip to content
Advertisement

How to count letter based similarity on pandas dataframe

Here’s my first dataframe df1

Id   Text
1    dFn
2    fiqe
3    raUw

Here’s my second dataframe df2

Id   Text
1    yuw
2    dnag

Similarity Matrix, columns is Id from df1, rows is Id from df2

       1      2      3
1      0      0   0.66  
2    0.5      0   0.25

Note:

0 value in (1,1), (2,1) and (3,2) because no letter similar

0.25 value in (3,1) is because of only 1 letter from raUw avaliable in 4 letter `dnag’ (1/4 equals 0.25)

0.5 is counted because of 2 of 4 letter similar

0.66 is counted because of 2 of 3 words similar

Advertisement

Answer

IIUC, one option is to use set.intersection in a nested list comprehension:

out = pd.DataFrame([[len(set(x.lower()) & set(y.lower())) / len(x) for y in df1['Text'].tolist()] for x in df2['Text'].tolist()])

Output:

     0    1         2
0  0.0  0.0  0.666667
1  0.5  0.0  0.250000
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement