Skip to content
Advertisement

Compare two DataFrames and find missing timestamps

I have the following two dataframes:

df1=

   date                col1
0  2023-01-01 16:00:00 100
1  2023-01-01 16:15:00 120
2  2023-01-01 16:30:00 140
3  2023-01-01 16:45:00 160
4  2023-01-01 17:00:00 200
5  2023-01-01 17:15:00 430
6  2023-01-01 17:30:00 890

df2 =

   date                col2 col3 
0  2023-01-01 16:00:00 100  200
1  2023-01-01 16:15:00 120  400
2  2023-01-01 17:00:00 200  500

and in df2 I have some missing timestamps compared to df1. I am able to find those timestamps using the following code:

df1[~df1['date'].isin(df2['date'])]

I want to populate those missing timestamps in df2 and fill in the values of the columns with the average value of the two previous rows.

So the new df2 should look like this:

df2 =

   date                col2    col3 
0  2023-01-01 16:00:00 100     200
1  2023-01-01 16:15:00 120     400
2  2023-01-01 16:30:00 110     300
3  2023-01-01 16:45:00 115     350
4  2023-01-01 17:00:00 200     500
5  2023-01-01 17:15:00 257.5   425
6  2023-01-01 17:30:00 228.75  462.5

Advertisement

Answer

Not ideal solution via iteration:

df1 = [
    ['2023-01-01 16:00:00', 100],
    ['2023-01-01 16:15:00', 120],
    ['2023-01-01 16:30:00', 140],
    ['2023-01-01 16:45:00', 160],
    ['2023-01-01 17:00:00', 200],
    ['2023-01-01 17:15:00', 430],
    ['2023-01-01 17:30:00', 890],
]

df2 = [
    ['2023-01-01 16:00:00', 100,  200],
    ['2023-01-01 16:15:00', 120,  400],
    ['2023-01-01 17:00:00', 200,  500],
]

df1= pd.DataFrame(df1, columns = ['date', 'col1'])

df2= pd.DataFrame(df2, columns = ['date', 'col2', 'col3'])

missing = df1[~df1['date'].isin(df2['date'])]
missing = missing.drop(['col1'], axis=1)

merged = pd.concat([df2, missing])
merged.sort_values('date', inplace=True, ignore_index=True)

for index, row in merged.iterrows():
    if np.isnan(row['col2']):
        merged['col2'].at[index] = merged['col2'].iloc[[index-1, index-2]].mean()
    if np.isnan(row['col3']):
        merged['col3'].at[index] = merged['col3'].iloc[[index-1, index-2]].mean()

print(merged)

Output:

date col2 col3
2023-01-01 16:00:00 100.00 200.0
2023-01-01 16:15:00 120.00 400.0
2023-01-01 16:30:00 110.00 300.0
2023-01-01 16:45:00 115.00 350.0
2023-01-01 17:00:00 200.00 500.0
2023-01-01 17:15:00 157.50 425.0
2023-01-01 17:30:00 178.75 462.5
7 People found this is helpful
Advertisement