I’m trying to replace/ablate terms within DataFrame if they appear within another DataFrame.
For example, below is the replace DataFrame that includes an ablate column and a replace column. I’m looking to replace any word that appears within the ablate row with whatever is in the replace row.
JavaScript
x
5
1
replace_df = pd.DataFrame({
2
'ablate': ['her', 'him', 'she', ' he ', 'woman', 'man'],
3
'replace': ['', '', 'foo', '', '', 'bar']
4
})
5
For example, if given the below DataFrame with comment text…
JavaScript
1
8
1
comment_df = pd.DataFrame({
2
'comment_text': [
3
'She is going to work',
4
'The man said to the woman, hello',
5
'Another way of viewing this article is to ask:',
6
],
7
})
8
applying replace_df
would result in the below:
JavaScript
1
8
1
return_df = pd.DataFrame({
2
'comment_text': [
3
'foo is going to work',
4
'The bar said to the, hello',
5
'Another way of viewing this article is to ask:',
6
],
7
})
8
Many thanks in advance!
Advertisement
Answer
A dict would be a more natural data structure for the replacements. With that in mind, how about:
JavaScript
1
13
13
1
replacements = dict(zip(replace_df['ablate'].tolist(), replace_df['replace'].tolist()))
2
3
def replace_substrings(string):
4
string_list = string.split(" ")
5
for idx, substring in enumerate(string_list):
6
for k in replacements.keys():
7
if substring.lower()[:len(k)] == k:
8
string_list[idx] = substring.lower().replace(k, replacements[k])
9
return " ".join(string_list).replace(" ,", ",")
10
11
12
comment_df['comment_text'].apply(replace_substrings)
13
which returns:
JavaScript
1
4
1
0 foo is going to work
2
1 The bar said to the, hello
3
2 Another way of viewing this article is to ask:
4