Skip to content
Advertisement

Chain df.str.split() in pandas dataframe

Edit: 2022NOV21

How do we chain df.col.str.split() since this returns the split columns if expand = True I am trying to split a column after performing .melt(). If I use assign I end up using the original column and the melted column actually does not even exist.

df = pd.DataFrame().from_dict({
    'id' : [1,2,3,4],
    '2022_amt' : [10.1,20.2,30.3, 40.4],
    '2022_qty' : [10,20,30,40]
})

df = ( 
    df
    .melt(
         id_vars=['id'], 
         value_vars=['2022_amt', '2022_qty'], 
         var_name='fy', 
         value_name='num' 
    )
    # can i chain any pd.Series.str.[METHOD] here
    # .assign(
    #         year=df.fy.str.split('_', expand=True)[0], 
    #         t=df.fy.str.split('_', expand=True)[1]
    # )
    )

# i can add the two columns in this way but can we use chain to expand dataframe df
df[['year', 't']] = df.fy.str.split('_', expand=True)
df = df.drop(columns = ['fy'])

Advertisement

Answer

Using expand converts it into a DataFrame, which you do not really want here; secondly with chaining, use an anonymous function to refer to the previous dataframe:

(df
.melt(id_vars='id',var_name='fy',value_name='num')
assign(year = lambda df: df.fy.str.split('_').str[0],
       t = lambda df: df.fy.str.split('_').str[1])
)

   id        fy   num  year    t
0   1  2022_amt  10.1  2022  amt
1   2  2022_amt  20.2  2022  amt
2   3  2022_amt  30.3  2022  amt
3   4  2022_amt  40.4  2022  amt
4   1  2022_qty  10.0  2022  qty
5   2  2022_qty  20.0  2022  qty
6   3  2022_qty  30.0  2022  qty
7   4  2022_qty  40.0  2022  qty

For your use case, there are simpler, more efficient ways to do this:

  • with pd.stack:
df = df.set_index('id')
df.columns = df.columns.str.split('_', expand = True)
df.columns.names = ['year', 't']
df.stack(['year', 't']).reset_index(name='num')

   id  year    t   num
0   1  2022  amt  10.1
1   1  2022  qty  10.0
2   2  2022  amt  20.2
3   2  2022  qty  20.0
4   3  2022  amt  30.3
5   3  2022  qty  30.0
6   4  2022  amt  40.4
7   4  2022  qty  40.0
  • with pivot_longer from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor as jn
df.pivot_longer(index = 'id', names_to = ('year','t'), names_sep = '_')

   id  year    t  value
0   1  2022  amt   10.1
1   2  2022  amt   20.2
2   3  2022  amt   30.3
3   4  2022  amt   40.4
4   1  2022  qty   10.0
5   2  2022  qty   20.0
6   3  2022  qty   30.0
7   4  2022  qty   40.0
User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement