I have a pandas df with mixed formatting for a specific column. It contains the qtr and year. I’m hoping to split this column into separate columns. But the formatting contains a space or a second dash between qtr and year.
I’m hoping to include a function that splits the column by a blank space or a second dash.
df = pd.DataFrame({ 'Qtr' : ['APR-JUN 2019','JAN-MAR 2019','JAN-MAR 2015','JUL-SEP-2020','OCT-DEC 2014','JUL-SEP-2015'], })
out:
Qtr 0 APR-JUN 2019 # blank 1 JAN-MAR 2019 # blank 2 JAN-MAR 2015 # blank 3 JUL-SEP-2020 # second dash 4 OCT-DEC 2014 # blank 5 JUL-SEP-2015 # second dash
split by blank
df[['Qtr', 'Year']] = df['Qtr'].str.split(' ', 1, expand=True)
split by second dash
df[['Qtr', 'Year']] = df['Qtr'].str.split('-', 1, expand=True)
intended output:
Qtr Year 0 APR-JUN 2019 1 JAN-MAR 2019 2 JAN-MAR 2015 3 JUL-SEP 2020 4 OCT-DEC 2014 5 JUL-SEP 2015
Advertisement
Answer
You can use a regular expression with the extract
function of the string accessor.
df[['Qtr', 'Year']] = df['Qtr'].str.extract(r'(w{3}-w{3}).(d{4})') print(df)
Result
Qtr Year 0 APR-JUN 2019 1 JAN-MAR 2019 2 JAN-MAR 2015 3 JUL-SEP 2020 4 OCT-DEC 2014 5 JUL-SEP 2015