I have data that looks like this
JavaScript
x
5
1
df = pd.DataFrame({'ID': [1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2],
2
'DATE': ['1/1/2015','1/2/2015', '1/3/2015','1/4/2015','1/5/2015','1/6/2015','1/7/2015','1/8/2015',
3
'1/9/2016','1/2/2015','1/3/2015','1/4/2015','1/5/2015','1/6/2015','1/7/2015'],
4
'CD': ['A','A','A','A','B','B','A','A','C','A','A','A','A','A','A']})
5
What I would like to do is group by ID and CD and get the start and stop change for each change. I tried using groupby and agg function but it will group all A together even though they needs to be separated since there is B in between 2 A.
JavaScript
1
6
1
df1 = df.groupby(['ID','CD'])
2
df1 = df1.agg(
3
Start_Date = ('Date',np.min),
4
End_Date=('Date', np.min)
5
).reset_index()
6
What I get is :
I was hoping if some one could help me get the result I need. What I am looking for is :
Advertisement
Answer
make grouper
for grouping
JavaScript
1
2
1
grouper = df['CD'].ne(df['CD'].shift(1)).cumsum()
2
grouper
:
JavaScript
1
17
17
1
0 1
2
1 1
3
2 1
4
3 1
5
4 2
6
5 2
7
6 3
8
7 3
9
8 4
10
9 5
11
10 5
12
11 5
13
12 5
14
13 5
15
14 5
16
Name: CD, dtype: int32
17
then use groupby with grouper
JavaScript
1
2
1
df.groupby(['ID', grouper, 'CD'])['DATE'].agg([min, max]).droplevel(1)
2
output:
JavaScript
1
8
1
min max
2
ID CD
3
1 A 1/1/2015 1/4/2015
4
B 1/5/2015 1/6/2015
5
A 1/7/2015 1/8/2015
6
C 1/9/2016 1/9/2016
7
2 A 1/2/2015 1/7/2015
8
change column name and use reset_index and so on..for your desired output
JavaScript
1
5
1
(df.groupby(['ID', grouper, 'CD'])['DATE'].agg([min, max]).droplevel(1)
2
.set_axis(['Start_Date', 'End_Date'], axis=1)
3
.reset_index()
4
.assign(CD=lambda x: x.pop('CD')))
5
result
JavaScript
1
7
1
ID Start_Date End_Date CD
2
0 1 1/1/2015 1/4/2015 A
3
1 1 1/5/2015 1/6/2015 B
4
2 1 1/7/2015 1/8/2015 A
5
3 1 1/9/2016 1/9/2016 C
6
4 2 1/2/2015 1/7/2015 A
7