I have dataframe that looks like this:
JavaScript
x
5
1
email account_name
2
0 NaN weichert, realtors mnsota
3
1 jhawkins sterling group com sterling group
4
2 lbaltz baltzchevy com baltz chevrolet
5
and I have this code that works as a solution but it takes forever on larger datasets and I know there has to be an easier way to solve it so just looking to see if anyone knows of a more concise/elegant way to do find a count of matching words between corresponding rows of both columns. Thanks
JavaScript
1
23
23
1
test = prod_nb_wcomps_2.sample(3, random_state=10).reset_index(drop = True)
2
test = test[['email','account_name']]
3
print(test)
4
lst = []
5
6
for i in test.index:
7
if not isinstance(test['email'].iloc[i], float):
8
for word in test['email'].iloc[i].split(' '):
9
if not isinstance(test['account_name'].iloc[i], float):
10
for word2 in test['account_name'].iloc[i].split(' '):
11
if word in word2:
12
lst.append({'index':i, 'bool_col': True})
13
else: lst.append({'index':i, 'bool_col': False})
14
15
df_dct = pd.DataFrame(lst)
16
df_dct = df_dct.loc[df_dct['bool_col'] == True]
17
df_dct['number of matches_per_row'] = df_dct.groupby('index')['bool_col'].transform('size')
18
df_dct.set_index('index', inplace=True, drop=True)
19
df_dct.drop(['bool_col'], inplace=True, axis =1)
20
test_ = pd.merge(test, df_dct, left_index=True, right_index=True)
21
test_
22
23
the resulting dataframe test_
looks like this
Advertisement
Answer
This solves your query.
JavaScript
1
8
1
import pandas as pd
2
3
df = pd.DataFrame({'email': ['', 'jhawkins sterling group com', 'lbaltz baltzchevy com'], 'name': ['John', 'sterling group', 'Linda']})
4
5
for index, row in df.iterrows():
6
matches = sum([1 for x in row['email'].split() if x in row['name'].split()])
7
df.loc[index, 'matches'] = matches
8
Output:
JavaScript
1
5
1
email name matches
2
0 John 0.0
3
1 jhawkins sterling group com sterling group 2.0
4
2 lbaltz baltzchevy com Linda 0.0
5