i have a panda dataframe as follows:
import pandas as pd import numpy as np d = {'col1': ['I called the c. i. a', 'the house is e. m', 'this is an e. u. call!','how is the p. o. r going?']} df = pd.DataFrame(data=d)
I have removed the punctuations and removed the spaces between abbreviated letters:
df['col1'] = df['col1'].str.replace('[^ws]','') df['col1'] = df['col1'].str.replace(r'(?<=bw)s*[ &]s*(?=wb)','')
the output is (e.g ‘I called the cia’) what I would like to happen is however the following (‘I called the CIA’). so I essentially like the abbreviations to be upper cased. I tried the following, but got no results
df['col1'] = df['col1'].str.replace(r'(?<=bw)s*[ &]s*(?=wb)'.upper(),'')
or
df['col1'] = df['col1'].str.replace(r'(?<=bw)s*[ &]s*(?=wb)',''.upper())
Advertisement
Answer
pandas.Series.str.replace
allows 2nd argument to be callable compliant with requirements of 2nd argument of re.sub
. Using that you might first uppercase your abbreviations as follows:
import pandas as pd def make_upper(m): # where m is re.Match object return m.group(0).upper() d = {'col1': ['I called the c. i. a', 'the house is e. m', 'this is an e. u. call!','how is the p. o. r going?']} df = pd.DataFrame(data=d) df['col1'] = df['col1'].str.replace(r'bw.?b', make_upper) print(df)
output
col1 0 I called the C. I. A 1 the house is E. M 2 this is an E. U. call! 3 how is the P. O. R going?
which then you can further processing using code you already had
df['col1'] = df['col1'].str.replace('[^ws]','') df['col1'] = df['col1'].str.replace(r'(?<=bw)s*[ &]s*(?=wb)','') print(df)
output
col1 0 I called the CIA 1 the house is EM 2 this is an EU call 3 how is the POR going
You might elect to improve pattern I used (r'bw.?b'
) if you encounter cases which it does not cover. I used word boundaries and literal dot (.
), so as is it does find any single word character (w
) optionally (?
) followed by dot.