Skip to content
Advertisement

Merging two dataframes on timestamp while preserving all data

I want to merge two dataframes to create a single time-series with two variables.

I have a function that does this by iterating over each dataframe using itterows()… which is terribly slow and doesn’t take advantage of the vectorization that pandas and numpy provide…

Would you be able to help?

This code illustrates what I am trying to do:

a = pd.DataFrame(data={'timestamp':[1,2,5,6,10],'x':[2,6,3,4,2]})
b = pd.DataFrame(data={'timestamp':[2,3,4,10],'y':[3,1,2,1]})

#z = Magical line of code/function call here
#z output: {'timestamp':[1,2,3,4,5,6,10],'x':[2,6,6,6,3,4,2], 'y':[NaN,3,1,2,2,2,1] }

Advertisement

Answer

This can be broken down into 2 steps:

  • The first step is the equivalent of an outer join in SQL, where create a table containing keys of both source tables. This is done with merge(..., how="outer")
  • The second is filling the NaN with the previous non-NaN values, which can done with ffill
z = a.merge(b, on="timestamp", how="outer").sort_values("timestamp").ffill()
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement