I have a data frame like as shown below
df = pd.DataFrame({'person_id': [11,11,11,21,21,21,31,31,31,31,31],
'time' :[-1,5,17,11,25,39,46,4,100,150,1],
'value':[101,102,121,120,143,153,160,170,96,97,99]})
What I would like to do is
a) FIll in the missing time
by generating a sequence number (ex:1,2,3,4) and copy the value (for all other columns) from the previous row
I was trying something like below
df.groupby(['person_id']).cumcount() + 1
df['sequence'] = g.cumcount() + 1
But this doesn’t help me get the expected output
I expect my output to be like as shown below (SAMPLE OF 1 SUBJECT IS SHOWN BELOW)
Advertisement
Answer
Let’s set the time
column as the index of dataframe then groupby
the dataframe on person_id
then for each group classified by person_id
reindex
the group to conform its index
with the range of values specified in time
column, finally concat
all the groups to get the desired dataframe:
grp = df.set_index('time').groupby('person_id')
groups = [g.reindex(range(g.index.min(), g.index.max() + 1)).ffill().reset_index() for _, g in grp]
out = pd.concat(groups, ignore_index=True).reindex(df.columns, axis=1)
Alternatively you can first create tuple pairs for each person_id
and corresponding range of values specified in time
column then reindex
the dataframe:
grp = df.groupby('person_id')['time']
idx = [(k, n) for k, t in grp for n in range(t.min(), t.max() + 1)]
out = df.set_index(['person_id', 'time']).reindex(idx).ffill().reset_index()
Result (for person_id 11
):
person_id time value
0 11.0 -1 101.0
1 11.0 0 101.0
2 11.0 1 101.0
3 11.0 2 101.0
4 11.0 3 101.0
5 11.0 4 101.0
6 11.0 5 102.0
7 11.0 6 102.0
8 11.0 7 102.0
9 11.0 8 102.0
10 11.0 9 102.0
11 11.0 10 102.0
12 11.0 11 102.0
13 11.0 12 102.0
14 11.0 13 102.0
15 11.0 14 102.0
16 11.0 15 102.0
17 11.0 16 102.0
18 11.0 17 121.0