I have dataframe that looks like this:
email account_name 0 NaN weichert, realtors mnsota 1 jhawkins sterling group com sterling group 2 lbaltz baltzchevy com baltz chevrolet
and I have this code that works as a solution but it takes forever on larger datasets and I know there has to be an easier way to solve it so just looking to see if anyone knows of a more concise/elegant way to do find a count of matching words between corresponding rows of both columns. Thanks
test = prod_nb_wcomps_2.sample(3, random_state=10).reset_index(drop = True)
test = test[['email','account_name']]
print(test)
lst = []
for i in test.index:
if not isinstance(test['email'].iloc[i], float):
for word in test['email'].iloc[i].split(' '):
if not isinstance(test['account_name'].iloc[i], float):
for word2 in test['account_name'].iloc[i].split(' '):
if word in word2:
lst.append({'index':i, 'bool_col': True})
else: lst.append({'index':i, 'bool_col': False})
df_dct = pd.DataFrame(lst)
df_dct = df_dct.loc[df_dct['bool_col'] == True]
df_dct['number of matches_per_row'] = df_dct.groupby('index')['bool_col'].transform('size')
df_dct.set_index('index', inplace=True, drop=True)
df_dct.drop(['bool_col'], inplace=True, axis =1)
test_ = pd.merge(test, df_dct, left_index=True, right_index=True)
test_
the resulting dataframe test_ looks like this
Advertisement
Answer
This solves your query.
import pandas as pd
df = pd.DataFrame({'email': ['', 'jhawkins sterling group com', 'lbaltz baltzchevy com'], 'name': ['John', 'sterling group', 'Linda']})
for index, row in df.iterrows():
matches = sum([1 for x in row['email'].split() if x in row['name'].split()])
df.loc[index, 'matches'] = matches
Output:
email name matches 0 John 0.0 1 jhawkins sterling group com sterling group 2.0 2 lbaltz baltzchevy com Linda 0.0
