Skip to content
Advertisement

How to find and calculate common letters between words in pandas

I have a dataset with some words in it and I want to compare 2 columns and count common letters between them.

For e.g I have:

data = {'Col_1' : ['Heaven', 'Jako', 'Sm', 'apizza'],
       'Col_2' : ['Heaven', 'Jakob', 'Smart', 'pizza']}
df = pd.DataFrame(data)

| Col_1  | Col_2  |
-------------------
| Heaven | Heaven |
| Jako   | Jakob  |
| Sm     | Smart  |
| apizza | pizza  |

And I want to have smth like that:

| Col_1  | Col_2  | Match                          | Count |
------------------------------------------------------------
| Heaven | Heaven | ['H', 'e', 'a', 'v', 'e', 'n'] | 6     |
| Jako   | Jakob  | ['J', 'a', 'k', 'o']           | 4     |
| Sm     | Smart  | ['S', 'm']                     | 2     |
| apizza | pizza  | []                             | 0     |

Advertisement

Answer

You can use a list comprehension with help of itertools.takewhile:

from itertools import takewhile
df['Match'] = [[x for x,y in takewhile(lambda x: x[0]==x[1], zip(a,b))]
               for a,b in zip(df['Col_1'], df['Col_2'])]
df['Count'] = df['Match'].str.len()

output:

    Col_1   Col_2               Match  Count
0  Heaven  Heaven  [H, e, a, v, e, n]      6
1    Jako   Jakob        [J, a, k, o]      4
2      Sm   Smart              [S, m]      2
3  apizza   pizza                  []      0

NB. the logic was no fully clear, so here this stops as soon as there is a mistmatch

If you want to continue after a mistmatch (which doesn’t seems to fit the “pizza” example):

df['Match'] = [[x for x,y in zip(a,b) if x==y]
               for a,b in zip(df['Col_1'], df['Col_2'])]
df['Count'] = df['Match'].str.len()

output:

    Col_1   Col_2               Match  Count
0  Heaven  Heaven  [H, e, a, v, e, n]      6
1    Jako   Jakob        [J, a, k, o]      4
2      Sm   Smart              [S, m]      2
3  apizza   pizza                 [z]      1
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement