I have a pandas df with mixed formatting for a specific column. It contains the qtr and year. I’m hoping to split this column into separate columns. But the formatting contains a space or a second dash between qtr and year.
I’m hoping to include a function that splits the column by a blank space or a second dash.
JavaScript
x
4
1
df = pd.DataFrame({
2
'Qtr' : ['APR-JUN 2019','JAN-MAR 2019','JAN-MAR 2015','JUL-SEP-2020','OCT-DEC 2014','JUL-SEP-2015'],
3
})
4
out:
JavaScript
1
8
1
Qtr
2
0 APR-JUN 2019 # blank
3
1 JAN-MAR 2019 # blank
4
2 JAN-MAR 2015 # blank
5
3 JUL-SEP-2020 # second dash
6
4 OCT-DEC 2014 # blank
7
5 JUL-SEP-2015 # second dash
8
split by blank
JavaScript
1
2
1
df[['Qtr', 'Year']] = df['Qtr'].str.split(' ', 1, expand=True)
2
split by second dash
JavaScript
1
2
1
df[['Qtr', 'Year']] = df['Qtr'].str.split('-', 1, expand=True)
2
intended output:
JavaScript
1
8
1
Qtr Year
2
0 APR-JUN 2019
3
1 JAN-MAR 2019
4
2 JAN-MAR 2015
5
3 JUL-SEP 2020
6
4 OCT-DEC 2014
7
5 JUL-SEP 2015
8
Advertisement
Answer
You can use a regular expression with the extract
function of the string accessor.
JavaScript
1
3
1
df[['Qtr', 'Year']] = df['Qtr'].str.extract(r'(w{3}-w{3}).(d{4})')
2
print(df)
3
Result
JavaScript
1
8
1
Qtr Year
2
0 APR-JUN 2019
3
1 JAN-MAR 2019
4
2 JAN-MAR 2015
5
3 JUL-SEP 2020
6
4 OCT-DEC 2014
7
5 JUL-SEP 2015
8