My data –
import pandas as pd
data = [['1', 'Electrical and Instrumentation -DI (Supervisor)'], ['2', 'CCM - DI'],
['3', 'Business Development and Marketing - DI'], ['4', 'MDs Office'],
['5', 'Administration - PI']]
df = pd.DataFrame(data, columns = ['sl_no', 'dept'])
df
sl_no dept
1 Electrical and Instrumentation -DI (Supervisor)
2 CCM - DI
3 Business Development and Marketing - DI
4 MDs Office
5 Administration - PI
So in this data, I need to split dept column, and need to create another column where my values should be only PI or DI. Is it possible to create a code where it accurately takes out only PI or DI from dept column
My output –
sl_no dept division
1 Electrical and -DI (Supervisor) DI
2 CCM - DI DI
3 Business Dev . - DI DI
4 MDs Office None
5 Administration - PI PI
I have tried –
new = df['dept'].str.split(' - ', expand = True)
df['division'] = new[1]
but this doesn’t serve the purpose, because ‘ – ‘ this dash is not same, somewhere it is occurring like ‘ -‘, so I have to rework on it, and also I get a few results like – PI (Supervisor) so I have to remove (Supervisor) part. The data is not same, it varies, I only need PI or DI. So I need a code in which, wherever it finds PI or DI , it will extract it and then create a new column where it only stores PI and DI only. Please help.
Advertisement
Answer
You might be able to just use str.extract
here:
df["division"] = df["dept"].str.extract(r'b(DI|PI)b')
For department values not having either DI
or PI
in them, the value of division
would be NaN
.