I have following dataframe as an output of my python script. I would like to add another column with count per pmid and add the counter to the first row, keeping the other rows.
The dataframe looks like this:
df
JavaScript
x
12
12
1
PMID gene_symbol gene_label gene_mentions
2
0 33377242 MTHFR Matched Gene 2
3
1 33414971 CSF3R Matched Gene 13
4
2 33414971 BCR Other Gene 2
5
3 33414971 ABL1 Matched Gene 1
6
4 33414971 ESR1 Matched Gene 1
7
5 33414971 NDUFB3 Other Gene 1
8
6 33414971 CSF3 Other Gene 1
9
7 33414971 TP53 Matched Gene 2
10
8 33414971 SRC Matched Gene 1
11
9 33414971 JAK1 Matched Gene 1
12
Expected out is:
JavaScript
1
13
13
1
PMID gene_symbol gene_label gene_mentions count
2
0 33377242 MTHFR Matched Gene 2 1
3
1 33414971 CSF3R Matched Gene 13 9
4
2 33414971 BCR Other Gene 2 9
5
3 33414971 ABL1 Matched Gene 1 9
6
4 33414971 ESR1 Matched Gene 1 9
7
5 33414971 NDUFB3 Other Gene 1 9
8
6 33414971 CSF3 Other Gene 1 9
9
7 33414971 TP53 Matched Gene 2 9
10
8 33414971 SRC Matched Gene 1 9
11
9 33414971 JAK1 Matched Gene 1 9
12
10 33414972 MAK2 Matched Gene 1 1
13
How can I achieve this output?
Thanks
Advertisement
Answer
You can add count for each row with groupby().transform
:
JavaScript
1
2
1
df['count'] = df.groupby('PMID')['PMID'].transform('size')
2
Output:
JavaScript
1
12
12
1
PMID gene_symbol gene_label gene_mentions count
2
0 33377242 MTHFR Matched Gene 2 1
3
1 33414971 CSF3R Matched Gene 13 9
4
2 33414971 BCR Other Gene 2 9
5
3 33414971 ABL1 Matched Gene 1 9
6
4 33414971 ESR1 Matched Gene 1 9
7
5 33414971 NDUFB3 Other Gene 1 9
8
6 33414971 CSF3 Other Gene 1 9
9
7 33414971 TP53 Matched Gene 2 9
10
8 33414971 SRC Matched Gene 1 9
11
9 33414971 JAK1 Matched Gene 1 9
12
Now if you really want only count at the first row for each PMID
, you can use mask
:
JavaScript
1
2
1
df['count'] = df['count'].mask(df['PMID'].duplicated())
2
Then you would have:
JavaScript
1
12
12
1
PMID gene_symbol gene_label gene_mentions count
2
0 33377242 MTHFR Matched Gene 2 1.0
3
1 33414971 CSF3R Matched Gene 13 9.0
4
2 33414971 BCR Other Gene 2 NaN
5
3 33414971 ABL1 Matched Gene 1 NaN
6
4 33414971 ESR1 Matched Gene 1 NaN
7
5 33414971 NDUFB3 Other Gene 1 NaN
8
6 33414971 CSF3 Other Gene 1 NaN
9
7 33414971 TP53 Matched Gene 2 NaN
10
8 33414971 SRC Matched Gene 1 NaN
11
9 33414971 JAK1 Matched Gene 1 NaN
12