I have two dataframes (simple examples shown below):
df1 df2 time column time column ID column Value 2022-01-01 00:00:00 2022-01-01 00:00:00 1 10 2022-01-01 00:15:00 2022-01-01 00:30:00 1 9 2022-01-01 00:30:00 2022-01-02 00:30:00 1 5 2022-01-01 00:45:00 2022-01-02 00:45:00 1 15 2022-01-02 00:00:00 2022-01-01 00:00:00 2 6 2022-01-02 00:15:00 2022-01-01 00:15:00 2 2 2022-01-02 00:30:00 2022-01-02 00:45:00 2 7 2022-01-02 00:45:00
df1 shows every timestamp I am interested in. df2 shows data sorted by timestamp and ID. What I need to do is add every single timestamp from df1 that is not in df2 for each unique ID and add zero to the value column.
This is the outcome I’m interested in
df3 time column ID column Value 2022-01-01 00:00:00 1 10 2022-01-01 00:15:00 1 0 2022-01-01 00:30:00 1 9 2022-01-01 00:45:00 1 0 2022-01-02 00:00:00 1 0 2022-01-02 00:15:00 1 0 2022-01-02 00:30:00 1 5 2022-01-02 00:45:00 1 15 2022-01-01 00:00:00 2 6 2022-01-01 00:15:00 2 2 2022-01-01 00:30:00 2 0 2022-01-01 00:45:00 2 0 2022-01-02 00:00:00 2 0 2022-01-02 00:15:00 2 0 2022-01-02 00:30:00 2 0 2022-01-02 00:45:00 2 7
My df2 is much larger (hundreds of thousands of rows, and more than 500 unique IDs) so manually doing this isn’t feasible. I’ve search for hours for something that could help, but everything has fallen flat. This data will ultimately be fed into a NN.
I am open to other libraries and can work in python or R. Any help is greatly appreciated.
Advertisement
Answer
Try:
x = ( df2.groupby("ID column") .apply(lambda x: x.merge(df1, how="outer").fillna(0)) .drop(columns="ID column") .droplevel(1) .reset_index() .sort_values(by=["ID column", "time column"]) ) print(x)
Prints:
ID column time column Value 0 1 2022-01-01 00:00:00 10.0 4 1 2022-01-01 00:15:00 0.0 1 1 2022-01-01 00:30:00 9.0 5 1 2022-01-01 00:45:00 0.0 6 1 2022-01-02 00:00:00 0.0 7 1 2022-01-02 00:15:00 0.0 2 1 2022-01-02 00:30:00 5.0 3 1 2022-01-02 00:45:00 15.0 8 2 2022-01-01 00:00:00 6.0 9 2 2022-01-01 00:15:00 2.0 11 2 2022-01-01 00:30:00 0.0 12 2 2022-01-01 00:45:00 0.0 13 2 2022-01-02 00:00:00 0.0 14 2 2022-01-02 00:15:00 0.0 15 2 2022-01-02 00:30:00 0.0 10 2 2022-01-02 00:45:00 7.0