I have a dataset with some words in it and I want to compare 2 columns and count common letters between them.
For e.g I have:
data = {'Col_1' : ['Heaven', 'Jako', 'Sm', 'apizza'], 'Col_2' : ['Heaven', 'Jakob', 'Smart', 'pizza']} df = pd.DataFrame(data) | Col_1 | Col_2 | ------------------- | Heaven | Heaven | | Jako | Jakob | | Sm | Smart | | apizza | pizza |
And I want to have smth like that:
| Col_1 | Col_2 | Match | Count | ------------------------------------------------------------ | Heaven | Heaven | ['H', 'e', 'a', 'v', 'e', 'n'] | 6 | | Jako | Jakob | ['J', 'a', 'k', 'o'] | 4 | | Sm | Smart | ['S', 'm'] | 2 | | apizza | pizza | [] | 0 |
Advertisement
Answer
You can use a list comprehension with help of itertools.takewhile
:
from itertools import takewhile df['Match'] = [[x for x,y in takewhile(lambda x: x[0]==x[1], zip(a,b))] for a,b in zip(df['Col_1'], df['Col_2'])] df['Count'] = df['Match'].str.len()
output:
Col_1 Col_2 Match Count 0 Heaven Heaven [H, e, a, v, e, n] 6 1 Jako Jakob [J, a, k, o] 4 2 Sm Smart [S, m] 2 3 apizza pizza [] 0
NB. the logic was no fully clear, so here this stops as soon as there is a mistmatch
If you want to continue after a mistmatch (which doesn’t seems to fit the “pizza” example):
df['Match'] = [[x for x,y in zip(a,b) if x==y] for a,b in zip(df['Col_1'], df['Col_2'])] df['Count'] = df['Match'].str.len()
output:
Col_1 Col_2 Match Count 0 Heaven Heaven [H, e, a, v, e, n] 6 1 Jako Jakob [J, a, k, o] 4 2 Sm Smart [S, m] 2 3 apizza pizza [z] 1