Skip to content
Advertisement

Replace text between two DataFrames in Pandas

I’m trying to replace/ablate terms within DataFrame if they appear within another DataFrame.

For example, below is the replace DataFrame that includes an ablate column and a replace column. I’m looking to replace any word that appears within the ablate row with whatever is in the replace row.

replace_df = pd.DataFrame({
    'ablate': ['her', 'him', 'she', ' he ', 'woman', 'man'],
    'replace': ['', '', 'foo', '', '', 'bar']
})

For example, if given the below DataFrame with comment text…

comment_df = pd.DataFrame({
    'comment_text': [
         'She is going to work',
         'The man said to the woman, hello',
         'Another way of viewing this article is to ask:',
    ],
})

applying replace_df would result in the below:

return_df = pd.DataFrame({
    'comment_text': [
         'foo is going to work',
         'The bar said to the, hello',
         'Another way of viewing this article is to ask:',
    ],
})

Many thanks in advance!

Advertisement

Answer

A dict would be a more natural data structure for the replacements. With that in mind, how about:

replacements = dict(zip(replace_df['ablate'].tolist(), replace_df['replace'].tolist()))

def replace_substrings(string):
    string_list = string.split(" ")
    for idx, substring in enumerate(string_list):
        for k in replacements.keys():
            if substring.lower()[:len(k)] == k:
                string_list[idx] = substring.lower().replace(k, replacements[k])
    return " ".join(string_list).replace(" ,", ",")


comment_df['comment_text'].apply(replace_substrings)

which returns:

0                              foo is going to work
1                        The bar said to the, hello
2    Another way of viewing this article is to ask:
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement