I have following dataframe as an output of my python script. I would like to add another column with count per pmid and add the counter to the first row, keeping the other rows.
The dataframe looks like this:
df
PMID gene_symbol gene_label gene_mentions 0 33377242 MTHFR Matched Gene 2 1 33414971 CSF3R Matched Gene 13 2 33414971 BCR Other Gene 2 3 33414971 ABL1 Matched Gene 1 4 33414971 ESR1 Matched Gene 1 5 33414971 NDUFB3 Other Gene 1 6 33414971 CSF3 Other Gene 1 7 33414971 TP53 Matched Gene 2 8 33414971 SRC Matched Gene 1 9 33414971 JAK1 Matched Gene 1
Expected out is:
PMID gene_symbol gene_label gene_mentions count 0 33377242 MTHFR Matched Gene 2 1 1 33414971 CSF3R Matched Gene 13 9 2 33414971 BCR Other Gene 2 9 3 33414971 ABL1 Matched Gene 1 9 4 33414971 ESR1 Matched Gene 1 9 5 33414971 NDUFB3 Other Gene 1 9 6 33414971 CSF3 Other Gene 1 9 7 33414971 TP53 Matched Gene 2 9 8 33414971 SRC Matched Gene 1 9 9 33414971 JAK1 Matched Gene 1 9 10 33414972 MAK2 Matched Gene 1 1
How can I achieve this output?
Thanks
Advertisement
Answer
You can add count for each row with groupby().transform
:
df['count'] = df.groupby('PMID')['PMID'].transform('size')
Output:
PMID gene_symbol gene_label gene_mentions count 0 33377242 MTHFR Matched Gene 2 1 1 33414971 CSF3R Matched Gene 13 9 2 33414971 BCR Other Gene 2 9 3 33414971 ABL1 Matched Gene 1 9 4 33414971 ESR1 Matched Gene 1 9 5 33414971 NDUFB3 Other Gene 1 9 6 33414971 CSF3 Other Gene 1 9 7 33414971 TP53 Matched Gene 2 9 8 33414971 SRC Matched Gene 1 9 9 33414971 JAK1 Matched Gene 1 9
Now if you really want only count at the first row for each PMID
, you can use mask
:
df['count'] = df['count'].mask(df['PMID'].duplicated())
Then you would have:
PMID gene_symbol gene_label gene_mentions count 0 33377242 MTHFR Matched Gene 2 1.0 1 33414971 CSF3R Matched Gene 13 9.0 2 33414971 BCR Other Gene 2 NaN 3 33414971 ABL1 Matched Gene 1 NaN 4 33414971 ESR1 Matched Gene 1 NaN 5 33414971 NDUFB3 Other Gene 1 NaN 6 33414971 CSF3 Other Gene 1 NaN 7 33414971 TP53 Matched Gene 2 NaN 8 33414971 SRC Matched Gene 1 NaN 9 33414971 JAK1 Matched Gene 1 NaN