I want to merge two dataframes to create a single time-series with two variables.
I have a function that does this by iterating over each dataframe using itterows()… which is terribly slow and doesn’t take advantage of the vectorization that pandas and numpy provide…
Would you be able to help?
This code illustrates what I am trying to do:
a = pd.DataFrame(data={'timestamp':[1,2,5,6,10],'x':[2,6,3,4,2]}) b = pd.DataFrame(data={'timestamp':[2,3,4,10],'y':[3,1,2,1]}) #z = Magical line of code/function call here #z output: {'timestamp':[1,2,3,4,5,6,10],'x':[2,6,6,6,3,4,2], 'y':[NaN,3,1,2,2,2,1] }
Advertisement
Answer
This can be broken down into 2 steps:
- The first step is the equivalent of an outer join in SQL, where create a table containing keys of both source tables. This is done with
merge(..., how="outer")
- The second is filling the NaN with the previous non-NaN values, which can done with
ffill
z = a.merge(b, on="timestamp", how="outer").sort_values("timestamp").ffill()