I have a dataset with some words in it and I want to compare 2 columns and count common letters between them.
For e.g I have:
JavaScript
x
11
11
1
data = {'Col_1' : ['Heaven', 'Jako', 'Sm', 'apizza'],
2
'Col_2' : ['Heaven', 'Jakob', 'Smart', 'pizza']}
3
df = pd.DataFrame(data)
4
5
| Col_1 | Col_2 |
6
-------------------
7
| Heaven | Heaven |
8
| Jako | Jakob |
9
| Sm | Smart |
10
| apizza | pizza |
11
And I want to have smth like that:
JavaScript
1
7
1
| Col_1 | Col_2 | Match | Count |
2
------------------------------------------------------------
3
| Heaven | Heaven | ['H', 'e', 'a', 'v', 'e', 'n'] | 6 |
4
| Jako | Jakob | ['J', 'a', 'k', 'o'] | 4 |
5
| Sm | Smart | ['S', 'm'] | 2 |
6
| apizza | pizza | [] | 0 |
7
Advertisement
Answer
You can use a list comprehension with help of itertools.takewhile
:
JavaScript
1
5
1
from itertools import takewhile
2
df['Match'] = [[x for x,y in takewhile(lambda x: x[0]==x[1], zip(a,b))]
3
for a,b in zip(df['Col_1'], df['Col_2'])]
4
df['Count'] = df['Match'].str.len()
5
output:
JavaScript
1
6
1
Col_1 Col_2 Match Count
2
0 Heaven Heaven [H, e, a, v, e, n] 6
3
1 Jako Jakob [J, a, k, o] 4
4
2 Sm Smart [S, m] 2
5
3 apizza pizza [] 0
6
NB. the logic was no fully clear, so here this stops as soon as there is a mistmatch
If you want to continue after a mistmatch (which doesn’t seems to fit the “pizza” example):
JavaScript
1
4
1
df['Match'] = [[x for x,y in zip(a,b) if x==y]
2
for a,b in zip(df['Col_1'], df['Col_2'])]
3
df['Count'] = df['Match'].str.len()
4
output:
JavaScript
1
6
1
Col_1 Col_2 Match Count
2
0 Heaven Heaven [H, e, a, v, e, n] 6
3
1 Jako Jakob [J, a, k, o] 4
4
2 Sm Smart [S, m] 2
5
3 apizza pizza [z] 1
6