i have a panda dataframe as follows:
import pandas as pd
import numpy as np
d = {'col1': ['I called the c. i. a', 'the house is e. m',
'this is an e. u. call!','how is the p. o. r going?']}
df = pd.DataFrame(data=d)
I have removed the punctuations and removed the spaces between abbreviated letters:
df['col1'] = df['col1'].str.replace('[^ws]','')
df['col1'] = df['col1'].str.replace(r'(?<=bw)s*[ &]s*(?=wb)','')
the output is (e.g ‘I called the cia’) what I would like to happen is however the following (‘I called the CIA’). so I essentially like the abbreviations to be upper cased. I tried the following, but got no results
df['col1'] = df['col1'].str.replace(r'(?<=bw)s*[ &]s*(?=wb)'.upper(),'')
or
df['col1'] = df['col1'].str.replace(r'(?<=bw)s*[ &]s*(?=wb)',''.upper())
Advertisement
Answer
pandas.Series.str.replace allows 2nd argument to be callable compliant with requirements of 2nd argument of re.sub. Using that you might first uppercase your abbreviations as follows:
import pandas as pd
def make_upper(m): # where m is re.Match object
return m.group(0).upper()
d = {'col1': ['I called the c. i. a', 'the house is e. m', 'this is an e. u. call!','how is the p. o. r going?']}
df = pd.DataFrame(data=d)
df['col1'] = df['col1'].str.replace(r'bw.?b', make_upper)
print(df)
output
col1 0 I called the C. I. A 1 the house is E. M 2 this is an E. U. call! 3 how is the P. O. R going?
which then you can further processing using code you already had
df['col1'] = df['col1'].str.replace('[^ws]','')
df['col1'] = df['col1'].str.replace(r'(?<=bw)s*[ &]s*(?=wb)','')
print(df)
output
col1 0 I called the CIA 1 the house is EM 2 this is an EU call 3 how is the POR going
You might elect to improve pattern I used (r'bw.?b') if you encounter cases which it does not cover. I used word boundaries and literal dot (.), so as is it does find any single word character (w) optionally (?) followed by dot.