I have a dataset with some words in it and I want to compare 2 columns and count common letters between them.
For e.g I have:
data = {'Col_1' : ['Heaven', 'Jako', 'Sm', 'apizza'],
'Col_2' : ['Heaven', 'Jakob', 'Smart', 'pizza']}
df = pd.DataFrame(data)
| Col_1 | Col_2 |
-------------------
| Heaven | Heaven |
| Jako | Jakob |
| Sm | Smart |
| apizza | pizza |
And I want to have smth like that:
| Col_1 | Col_2 | Match | Count | ------------------------------------------------------------ | Heaven | Heaven | ['H', 'e', 'a', 'v', 'e', 'n'] | 6 | | Jako | Jakob | ['J', 'a', 'k', 'o'] | 4 | | Sm | Smart | ['S', 'm'] | 2 | | apizza | pizza | [] | 0 |
Advertisement
Answer
You can use a list comprehension with help of itertools.takewhile:
from itertools import takewhile
df['Match'] = [[x for x,y in takewhile(lambda x: x[0]==x[1], zip(a,b))]
for a,b in zip(df['Col_1'], df['Col_2'])]
df['Count'] = df['Match'].str.len()
output:
Col_1 Col_2 Match Count 0 Heaven Heaven [H, e, a, v, e, n] 6 1 Jako Jakob [J, a, k, o] 4 2 Sm Smart [S, m] 2 3 apizza pizza [] 0
NB. the logic was no fully clear, so here this stops as soon as there is a mistmatch
If you want to continue after a mistmatch (which doesn’t seems to fit the “pizza” example):
df['Match'] = [[x for x,y in zip(a,b) if x==y]
for a,b in zip(df['Col_1'], df['Col_2'])]
df['Count'] = df['Match'].str.len()
output:
Col_1 Col_2 Match Count 0 Heaven Heaven [H, e, a, v, e, n] 6 1 Jako Jakob [J, a, k, o] 4 2 Sm Smart [S, m] 2 3 apizza pizza [z] 1