I have a dataset like this
import pandas as pd 
df = pd.read_csv("music.csv") 
df
| name | date | singer | language | phase | |
|---|---|---|---|---|---|
| 1 | Yes or No | 02.01.20 | Benjamin Smith | en | 1 | 
| 2 | Parabens | 01.06.21 | Rafael Galvao;Simon Murphy | pt;en | 2 | 
| 3 | Love | 12.11.20 | Michaela Condell | en | 1 | 
| 4 | Paz | 11.07.19 | Ana Perez; Eduarda Pinto | es;pt | 3 | 
| 5 | Stop | 12.01.21 | Michael Conway;Gabriel Lee | en;en | 1 | 
| 6 | Shalom | 18.06.21 | Shimon Cohen | hebr | 1 | 
| 7 | Habibi | 22.12.19 | Fuad Khoury | ar | 3 | 
| 8 | viva | 01.08.21 | Veronica Barnes | en | 1 | 
| 9 | Buznanna | 23.09.20 | Kurt Azzopardi | mt | 1 | 
| 10 | Frieden | 21.05.21 | Gabriel Meier | dt | 1 | 
| 11 | Uruguay | 11.04.21 | Julio Ramirez | es | 1 | 
| 12 | Beautiful | 17.03.21 | Cameron Armstrong | en | 3 | 
| 13 | Holiday | 19.06.20 | Bianca Watson | en | 3 | 
| 14 | Kiwi | 21.10.20 | Lachlan McNamara | en | 1 | 
| 15 | Amore | 01.12.20 | Vasco Grimaldi | it | 1 | 
| 16 | La vie | 28.04.20 | Victor Dubois | fr | 3 | 
| 17 | Yom | 21.02.20 | Ori Azerad; Naeem al-Hindi | hebr;ar | 2 | 
| 18 | Elefthería | 15.06.19 | Nikolaos Gekas | gr | 1 | 
I convert it to 1NF.
import pandas as pd
import numpy as np
 
df = pd.read_csv("music.csv") 
df['language']=df['language'].str.split(';')
df['singer']=df['singer'].str.split(";")
df.explode(['language','singer'])
d= pd.DataFrame(df)
d
And I create a dataframe. Now I would like to find out which phase has the most singers involved.
I used this
df= df.group.by('singer')
df['phase']. value_counts(). idxmax()
But I could not get a solution
The dataframe has 42 observations, so some singers occur again
Source: convert data to 1NF
Advertisement
Answer
You do not need to split/explode, you can directly count the number of ; per row and add 1:
df['singer'].str.count(';').add(1).groupby(df['phase']).sum()
If you want the classical split/explode:
(df.assign(singer=df['singer'].str.split(';'))
   .explode('singer')
   .groupby('phase')['singer'].count()
)
output:
phase 1 12 2 4 3 6 Name: singer, dtype: int64