Skip to content
Advertisement

get the number of involved singer in a phase

I have a dataset like this

import pandas as pd 
df = pd.read_csv("music.csv") 
df
name date singer language phase
1 Yes or No 02.01.20 Benjamin Smith en 1
2 Parabens 01.06.21 Rafael Galvao;Simon Murphy pt;en 2
3 Love 12.11.20 Michaela Condell en 1
4 Paz 11.07.19 Ana Perez; Eduarda Pinto es;pt 3
5 Stop 12.01.21 Michael Conway;Gabriel Lee en;en 1
6 Shalom 18.06.21 Shimon Cohen hebr 1
7 Habibi 22.12.19 Fuad Khoury ar 3
8 viva 01.08.21 Veronica Barnes en 1
9 Buznanna 23.09.20 Kurt Azzopardi mt 1
10 Frieden 21.05.21 Gabriel Meier dt 1
11 Uruguay 11.04.21 Julio Ramirez es 1
12 Beautiful 17.03.21 Cameron Armstrong en 3
13 Holiday 19.06.20 Bianca Watson en 3
14 Kiwi 21.10.20 Lachlan McNamara en 1
15 Amore 01.12.20 Vasco Grimaldi it 1
16 La vie 28.04.20 Victor Dubois fr 3
17 Yom 21.02.20 Ori Azerad; Naeem al-Hindi hebr;ar 2
18 EleftherĂ­a 15.06.19 Nikolaos Gekas gr 1

I convert it to 1NF.

import pandas as pd
import numpy as np
 
df = pd.read_csv("music.csv") 
df['language']=df['language'].str.split(';')
df['singer']=df['singer'].str.split(";")
df.explode(['language','singer'])
d= pd.DataFrame(df)
d

And I create a dataframe. Now I would like to find out which phase has the most singers involved.

I used this

df= df.group.by('singer')
df['phase']. value_counts(). idxmax()

But I could not get a solution

The dataframe has 42 observations, so some singers occur again

Source: convert data to 1NF

Advertisement

Answer

You do not need to split/explode, you can directly count the number of ; per row and add 1:

df['singer'].str.count(';').add(1).groupby(df['phase']).sum()

If you want the classical split/explode:

(df.assign(singer=df['singer'].str.split(';'))
   .explode('singer')
   .groupby('phase')['singer'].count()
)

output:

phase
1    12
2     4
3     6
Name: singer, dtype: int64
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement