Skip to content
Advertisement

Remove space between abbreviated letters in a string column

i have a panda dataframe as follows:

import pandas as pd
import numpy as np

d = {'col1': ['I called the c. i. a', 'the house is e. m',
 'this is an e. u. call!','how is the p. o. r going?']}
df = pd.DataFrame(data=d)

I have removed the punctuations and removed the spaces between abbreviated letters:

df['col1'] = df['col1'].str.replace('[^ws]','')
df['col1'] = df['col1'].str.replace(r'(?<=bw)s*[ &]s*(?=wb)','')

the output is (e.g ‘I called the cia’) what I would like to happen is however the following (‘I called the CIA’). so I essentially like the abbreviations to be upper cased. I tried the following, but got no results

df['col1'] = df['col1'].str.replace(r'(?<=bw)s*[ &]s*(?=wb)'.upper(),'')

or

df['col1'] = df['col1'].str.replace(r'(?<=bw)s*[ &]s*(?=wb)',''.upper())

Advertisement

Answer

pandas.Series.str.replace allows 2nd argument to be callable compliant with requirements of 2nd argument of re.sub. Using that you might first uppercase your abbreviations as follows:

import pandas as pd
def make_upper(m):  # where m is re.Match object
    return m.group(0).upper()
d = {'col1': ['I called the c. i. a', 'the house is e. m', 'this is an e. u. call!','how is the p. o. r going?']}
df = pd.DataFrame(data=d)
df['col1'] = df['col1'].str.replace(r'bw.?b', make_upper)
print(df)

output

                        col1
0       I called the C. I. A
1          the house is E. M
2     this is an E. U. call!
3  how is the P. O. R going?

which then you can further processing using code you already had

df['col1'] = df['col1'].str.replace('[^ws]','')
df['col1'] = df['col1'].str.replace(r'(?<=bw)s*[ &]s*(?=wb)','')
print(df)

output

               col1
0      I called the CIA
1       the house is EM
2    this is an EU call
3  how is the POR going

You might elect to improve pattern I used (r'bw.?b') if you encounter cases which it does not cover. I used word boundaries and literal dot (.), so as is it does find any single word character (w) optionally (?) followed by dot.

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement