I have the following data frame:
df = { 'name': ['A', 'A', 'B', 'B', 'B', 'C', 'C'], 'name_ID' : [1, 1, 2, 2, 2, 3, 3], 'score' : [400, 500, 3000, 1000, 4000, 600, 750], 'score_number' : [1, 2, 1, 2, 3, 1, 2] } df = pd.DataFrame(df)
Note that the df is grouped by name
/ name_ID
. names
can have n scores, e.g. A
has 2 scores, whereas B
has 3 scores. I want an additional column, that indicates the first score per name
/ name_ID
. The reference_score
for the first scores of a name should be NaN. Like this:
I have tried:
df_v2['first_fund'] =df_v2['fund_size'].groupby(df_v2['firm_ID']).first()
,
also with .nth
but it didn’t work.
Thanks in advance.
Advertisement
Answer
Let’s use groupby.transform
to get first row value then mask
the first row as NaN
with condition ~df.duplicated('name', keep='first')
.
# sort the dataframe first if score number is not ascending # df = df.sort_values(['name_ID', 'score_number']) df['reference_score'] = (df.groupby('name')['score'] .transform('first') .mask(~df.duplicated('name', keep='first')))
print(df) name name_ID score score_number reference_score 0 A 1 400 1 NaN 1 A 1 500 2 400.0 2 B 2 3000 1 NaN 3 B 2 1000 2 3000.0 4 B 2 4000 3 3000.0 5 C 3 600 1 NaN 6 C 3 750 2 600.0
Or we can compare score_number
with 1
to define the first row in each group.
df['reference_score'] = (df.groupby('name')['score'] .transform('first') .mask(df['score_number'].eq(1))