df=(pd.DataFrame({'user_id':[1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4],'survey_id':[1,1,1,1,2,2,3,4,4,4,5,5,6,6,7,8,8,9,9,9,9,10,10,11,11,12,12],
              'answer':["no","yes","no","no","yes","no","no","yes","no","yes","no","no","yes","no","no","no","yes","yes","yes","no","no","no","yes","no","yes","no","yes"]}))
df
    user_id     survey_id   answer
0   1   1   no
1   1   1   yes
2   1   1   no
3   1   1   no
4   1   2   yes
5   1   2   no
6   1   3   no
7   2   4   yes
8   2   4   no
9   2   4   yes
10  2   5   no
11  2   5   no
12  2   6   yes
13  2   6   no
14  3   7   no
15  3   8   no
16  3   8   yes
17  3   9   yes
18  3   9   yes
19  3   9   no
20  3   9   no
21  4   10  no
22  4   10  yes
23  4   11  no
24  4   11  yes
25  4   12  no
26  4   12  yes
I want to group by user_id, then get the first element of survey_id, and get all elements related to this selection
df_head=
    user_id     survey_id   answer
0   1   1   no
1   1   1   yes
2   1   1   no
3   1   1   no
4   2   4   yes
5   2   4   no
6   2   4   yes
7   3   7   no
8   4   10  no
9   4   10  yes
In the same way I want to group by user_id, then get the last element of survey_id, and get all elements related to this selection
df_tail=
    user_id     survey_id   answer
0   1   3   no
1   2   6   yes
2   2   6   no
3   3   9   yes
4   3   9   yes
5   3   9   no
6   3   9   no
7   4   12  no
8   4   12  yes
Is there a quick groupby command to get this? I can do this by merging dataframes but I think there is some better way to do this in less command lines. Thank you in advance
Advertisement
Answer
Solution with no merging:
df_head = df[df.survey_id.eq(df.groupby('user_id').transform('min').survey_id)]
result:
user_id survey_id answer 0 1 1 no 1 1 1 yes 2 1 1 no 3 1 1 no 7 2 4 yes 8 2 4 no 9 2 4 yes 14 3 7 no 21 4 10 no 22 4 10 yes
df_tail = df[df.survey_id.eq(df.groupby('user_id').transform('max').survey_id)]
result:
user_id survey_id answer 6 1 3 no 12 2 6 yes 13 2 6 no 17 3 9 yes 18 3 9 yes 19 3 9 no 20 3 9 no 25 4 12 no 26 4 12 yes
Idea is to calculate min / max of survey_id per user_id and compare it to survey_id at row level of df. Please note that original index of dataframe is preserved. If You need new index just add at the end:
df_head = df_head.reset_index(drop = True) df_tail = df_tail.reset_index(drop = True)