Skip to content
Advertisement

Remove duplicate substring at the start of the string

I would like to remomve duplicate substrings at the start of a string where a duplicate exists. I sort of have the logic working for the first row (see below) but am quite new to Python so am struggling to produce code which will apply the same logic for a rows in a larger dataset.

Below is an example of:

  • Input: raw data i’ve created
  • Output: what I’d like to end up with
df = pd.DataFrame({
    'Input' : ['john only played once john only played once yesterday. he may try again today', 'she didnt like eggs', np.nan, 'george found a dog lying in george found a dog lying in front of his car'],
    'Output' : ['john only played once yesterday. he may try again today', 'she didnt like eggs', '', 'george found a dog lying in front of his car'],
})

c = 20
df["Input_adj"] = df["Input"].str[0:c]

Input_1 = df["Input"][0]
Input_adj_1 = df["Input_adj"][0]
print(Input_1)
print(Input_adj_1)

Input_1_cut = Input_1.find(Input_adj_1, Input_1.find(Input_adj_1) + 1)
print(Input_1[Input_1_cut:])

I understand there are likely other ways of doing this and I’m by no means particular with the method used as long as output is as desired.

How would I go about transforming the input values into the output values using a simpler code?

Edit One of the comments does solve this problem but it doesn’t seem to work for the below input value (the actual text has no line breaks but I’ve included some below to better illustrate the duplicated text):

“Reed Brennan arrived at Easton Academy expecting to find an idyllic private school experience — challenging classes, adorably preppy boys, and a chance to create a new life for herself. Instead, she discovered lies, deception, blackmail, and…murder. But, thankfully, the killers were caught and the nightmare is finally over. Now, with a new school year ahead of her, Re

Reed Brennan arrived at Easton Academy expecting to find an idyllic private school experience — challenging classes, adorably preppy boys, and a chance to create a new life for herself. Instead, she discovered lies, deception, blackmail, and…murder. But, thankfully, the killers were caught and the nightmare is finally over. Now, with a new school year ahead of her, Re

ed steps back on Easton’s ivy-covered campus ready to start over. So when the headmaster announces that billings is forbidden from holding their traditional, secretive initiation, Reed is relieved. She champions the new rules and the six new girls the administration has picked to live in Billings Hall: Constance, Missy, Lorna, Kiki, Astrid, and newcomer Sabine. But Reed’s fellow Billings resident and new nemesis, Cheyenne Martin, believes the changes are a mockery of Billings history. Despite the new rules, Cheyenne vows to keep the old ways alive, no matter what — or — stands in her way…”

Does anyone know how to get it working for this example?

Advertisement

Answer

You can use str.replace on the input column with a regex:

import re
df['Output'] = df['Input'].str.replace(r'^(.*)1', r'1', regex=True, flags=re.DOTALL)
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement