Skip to content
Advertisement

Add missing timestamps for each different ID in dataframe

I have two dataframes (simple examples shown below):

df1                                df2                          
time column                        time column           ID column   Value
2022-01-01 00:00:00                2022-01-01 00:00:00   1           10
2022-01-01 00:15:00                2022-01-01 00:30:00   1           9
2022-01-01 00:30:00                2022-01-02 00:30:00   1           5
2022-01-01 00:45:00                2022-01-02 00:45:00   1           15      
2022-01-02 00:00:00                2022-01-01 00:00:00   2           6
2022-01-02 00:15:00                2022-01-01 00:15:00   2           2
2022-01-02 00:30:00                2022-01-02 00:45:00   2           7
2022-01-02 00:45:00

df1 shows every timestamp I am interested in. df2 shows data sorted by timestamp and ID. What I need to do is add every single timestamp from df1 that is not in df2 for each unique ID and add zero to the value column.

This is the outcome I’m interested in

df3
time column           ID column   Value
2022-01-01 00:00:00   1           10
2022-01-01 00:15:00   1           0
2022-01-01 00:30:00   1           9
2022-01-01 00:45:00   1           0
2022-01-02 00:00:00   1           0
2022-01-02 00:15:00   1           0
2022-01-02 00:30:00   1           5
2022-01-02 00:45:00   1           15
2022-01-01 00:00:00   2           6
2022-01-01 00:15:00   2           2
2022-01-01 00:30:00   2           0
2022-01-01 00:45:00   2           0
2022-01-02 00:00:00   2           0
2022-01-02 00:15:00   2           0
2022-01-02 00:30:00   2           0
2022-01-02 00:45:00   2           7

My df2 is much larger (hundreds of thousands of rows, and more than 500 unique IDs) so manually doing this isn’t feasible. I’ve search for hours for something that could help, but everything has fallen flat. This data will ultimately be fed into a NN.

I am open to other libraries and can work in python or R. Any help is greatly appreciated.

Advertisement

Answer

Try:

x = (
    df2.groupby("ID column")
    .apply(lambda x: x.merge(df1, how="outer").fillna(0))
    .drop(columns="ID column")
    .droplevel(1)
    .reset_index()
    .sort_values(by=["ID column", "time column"])
)
print(x)

Prints:

    ID column         time column  Value
0           1 2022-01-01 00:00:00   10.0
4           1 2022-01-01 00:15:00    0.0
1           1 2022-01-01 00:30:00    9.0
5           1 2022-01-01 00:45:00    0.0
6           1 2022-01-02 00:00:00    0.0
7           1 2022-01-02 00:15:00    0.0
2           1 2022-01-02 00:30:00    5.0
3           1 2022-01-02 00:45:00   15.0
8           2 2022-01-01 00:00:00    6.0
9           2 2022-01-01 00:15:00    2.0
11          2 2022-01-01 00:30:00    0.0
12          2 2022-01-01 00:45:00    0.0
13          2 2022-01-02 00:00:00    0.0
14          2 2022-01-02 00:15:00    0.0
15          2 2022-01-02 00:30:00    0.0
10          2 2022-01-02 00:45:00    7.0
User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement