I’m trying to replace/ablate terms within DataFrame if they appear within another DataFrame.
For example, below is the replace DataFrame that includes an ablate column and a replace column. I’m looking to replace any word that appears within the ablate row with whatever is in the replace row.
replace_df = pd.DataFrame({ 'ablate': ['her', 'him', 'she', ' he ', 'woman', 'man'], 'replace': ['', '', 'foo', '', '', 'bar'] })
For example, if given the below DataFrame with comment text…
comment_df = pd.DataFrame({ 'comment_text': [ 'She is going to work', 'The man said to the woman, hello', 'Another way of viewing this article is to ask:', ], })
applying replace_df
would result in the below:
return_df = pd.DataFrame({ 'comment_text': [ 'foo is going to work', 'The bar said to the, hello', 'Another way of viewing this article is to ask:', ], })
Many thanks in advance!
Advertisement
Answer
A dict would be a more natural data structure for the replacements. With that in mind, how about:
replacements = dict(zip(replace_df['ablate'].tolist(), replace_df['replace'].tolist())) def replace_substrings(string): string_list = string.split(" ") for idx, substring in enumerate(string_list): for k in replacements.keys(): if substring.lower()[:len(k)] == k: string_list[idx] = substring.lower().replace(k, replacements[k]) return " ".join(string_list).replace(" ,", ",") comment_df['comment_text'].apply(replace_substrings)
which returns:
0 foo is going to work 1 The bar said to the, hello 2 Another way of viewing this article is to ask: