I have a dataset like this
JavaScript
x
4
1
import pandas as pd
2
df = pd.read_csv("music.csv")
3
df
4
name | date | singer | language | phase | |
---|---|---|---|---|---|
1 | Yes or No | 02.01.20 | Benjamin Smith | en | 1 |
2 | Parabens | 01.06.21 | Rafael Galvao;Simon Murphy | pt;en | 2 |
3 | Love | 12.11.20 | Michaela Condell | en | 1 |
4 | Paz | 11.07.19 | Ana Perez; Eduarda Pinto | es;pt | 3 |
5 | Stop | 12.01.21 | Michael Conway;Gabriel Lee | en;en | 1 |
6 | Shalom | 18.06.21 | Shimon Cohen | hebr | 1 |
7 | Habibi | 22.12.19 | Fuad Khoury | ar | 3 |
8 | viva | 01.08.21 | Veronica Barnes | en | 1 |
9 | Buznanna | 23.09.20 | Kurt Azzopardi | mt | 1 |
10 | Frieden | 21.05.21 | Gabriel Meier | dt | 1 |
11 | Uruguay | 11.04.21 | Julio Ramirez | es | 1 |
12 | Beautiful | 17.03.21 | Cameron Armstrong | en | 3 |
13 | Holiday | 19.06.20 | Bianca Watson | en | 3 |
14 | Kiwi | 21.10.20 | Lachlan McNamara | en | 1 |
15 | Amore | 01.12.20 | Vasco Grimaldi | it | 1 |
16 | La vie | 28.04.20 | Victor Dubois | fr | 3 |
17 | Yom | 21.02.20 | Ori Azerad; Naeem al-Hindi | hebr;ar | 2 |
18 | Elefthería | 15.06.19 | Nikolaos Gekas | gr | 1 |
I convert it to 1NF.
JavaScript
1
10
10
1
import pandas as pd
2
import numpy as np
3
4
df = pd.read_csv("music.csv")
5
df['language']=df['language'].str.split(';')
6
df['singer']=df['singer'].str.split(";")
7
df.explode(['language','singer'])
8
d= pd.DataFrame(df)
9
d
10
And I create a dataframe. Now I would like to find out which phase has the most singers involved.
I used this
JavaScript
1
3
1
df= df.group.by('singer')
2
df['phase']. value_counts(). idxmax()
3
But I could not get a solution
The dataframe has 42 observations, so some singers occur again
Source: convert data to 1NF
Advertisement
Answer
You do not need to split/explode, you can directly count the number of ;
per row and add 1:
JavaScript
1
2
1
df['singer'].str.count(';').add(1).groupby(df['phase']).sum()
2
If you want the classical split/explode:
JavaScript
1
5
1
(df.assign(singer=df['singer'].str.split(';'))
2
.explode('singer')
3
.groupby('phase')['singer'].count()
4
)
5
output:
JavaScript
1
6
1
phase
2
1 12
3
2 4
4
3 6
5
Name: singer, dtype: int64
6