I have two dataframes. The first one is just a column of daily datetime, whereas the second one has both dates and data. This is an example: What I want to do is to merge df1 and df2 to get a dataframe (dataset) where: when the data exist it takes the date position; when it doesn't exist, it just gets

How to merge two dataframes where the second one has different column names and length?

I have two dataframes. The first one is just a column of daily datetime, whereas the second one has both dates and data. This is an example:

df1 = pd.DataFrame({'Date': pd.date_range(start="2002-01-01",end="2021-04-14")})

df2 = pd.DataFrame.from_dict({'Date1': {0: '2002-01-01 00:00:00',
  1: '2002-01-02 00:00:00',
  2: '2002-01-03 00:00:00',
  3: '2002-01-04 00:00:00',
  4: '2002-01-07 00:00:00',
  5: '2002-01-08 00:00:00',
  6: '2002-01-09 00:00:00',
  7: '2002-01-10 00:00:00',
  8: '2002-01-11 00:00:00',
  9: '2002-01-14 00:00:00',
  10: '2002-01-15 00:00:00',
  11: '2002-01-16 00:00:00',
  12: '2002-01-17 00:00:00',
  13: '2002-01-18 00:00:00',
  14: '2002-01-21 00:00:00'},
 'GSEAFCI': {0: 98.85,
  1: 98.96,
  2: 98.88,
  3: 98.85,
  4: 98.83,
  5: 98.86,
  6: 98.87,
  7: 98.9,
  8: 98.87,
  9: 98.91,
  10: 98.84,
  11: 98.87,
  12: 98.84,
  13: 98.84,
  14: 98.87},
 'Date2': {0: '2004-03-03 00:00:00',
  1: '2004-03-04 00:00:00',
  2: '2004-03-05 00:00:00',
  3: '2004-03-08 00:00:00',
  4: '2004-03-09 00:00:00',
  5: '2004-03-10 00:00:00',
  6: '2004-03-11 00:00:00',
  7: '2004-03-12 00:00:00',
  8: '2004-03-15 00:00:00',
  9: '2004-03-16 00:00:00',
  10: '2004-03-17 00:00:00',
  11: '2004-03-18 00:00:00',
  12: '2004-03-19 00:00:00',
  13: '2004-03-22 00:00:00',
  14: '2004-03-23 00:00:00'},
 '2Y2YINF': {0: 3.9,
  1: 3.832,
  2: 3.668,
  3: 3.63,
  4: 3.649,
  5: 3.636,
  6: 3.63,
  7: 3.633,
  8: 3.628,
  9: 3.665,
  10: 3.621,
  11: 3.62,
  12: 3.616,
  13: 3.588,
  14: 3.5949999999999998},
 'Date3': {0: '2002-01-02 00:00:00',
  1: '2002-01-03 00:00:00',
  2: '2002-01-04 00:00:00',
  3: '2002-01-07 00:00:00',
  4: '2002-01-08 00:00:00',
  5: '2002-01-09 00:00:00',
  6: '2002-01-10 00:00:00',
  7: '2002-01-11 00:00:00',
  8: '2002-01-14 00:00:00',
  9: '2002-01-15 00:00:00',
  10: '2002-01-16 00:00:00',
  11: '2002-01-17 00:00:00',
  12: '2002-01-18 00:00:00',
  13: '2002-01-21 00:00:00',
  14: '2002-01-22 00:00:00'},
 '5Y5YINF': {0: 5.9274000000000004,
  1: 5.8767,
  2: 5.8977,
  3: 5.763,
  4: 5.8056,
  5: 5.825,
  6: 5.7287,
  7: 5.6922,
  8: 5.6222,
  9: 5.6145,
  10: 5.6655999999999995,
  11: 5.76,
  12: 5.7455,
  13: 5.718,
  14: 5.7589}})

JavaScript
​x
 
df1 = pd.DataFrame({'Date': pd.date_range(start="2002-01-01",end="2021-04-14")})
​
df2 = pd.DataFrame.from_dict({'Date1': {0: '2002-01-01 00:00:00',
'2002-01-02 00:00:00',
'2002-01-03 00:00:00',
'2002-01-04 00:00:00',
'2002-01-07 00:00:00',
'2002-01-08 00:00:00',
'2002-01-09 00:00:00',
'2002-01-10 00:00:00',
'2002-01-11 00:00:00',
'2002-01-14 00:00:00',
'2002-01-15 00:00:00',
'2002-01-16 00:00:00',
'2002-01-17 00:00:00',
'2002-01-18 00:00:00',
'2002-01-21 00:00:00'},
 'GSEAFCI': {0: 98.85,
98.96,
98.88,
98.85,
98.83,
98.86,
98.87,
98.9,
98.87,
98.91,
98.84,
98.87,
98.84,
98.84,
98.87},
 'Date2': {0: '2004-03-03 00:00:00',
'2004-03-04 00:00:00',
'2004-03-05 00:00:00',
'2004-03-08 00:00:00',
'2004-03-09 00:00:00',
'2004-03-10 00:00:00',
'2004-03-11 00:00:00',
'2004-03-12 00:00:00',
'2004-03-15 00:00:00',
'2004-03-16 00:00:00',
'2004-03-17 00:00:00',
'2004-03-18 00:00:00',
'2004-03-19 00:00:00',
'2004-03-22 00:00:00',
'2004-03-23 00:00:00'},
 '2Y2YINF': {0: 3.9,
3.832,
3.668,
3.63,
3.649,
3.636,
3.63,
3.633,
3.628,
3.665,
3.621,
3.62,
3.616,
3.588,
3.5949999999999998},
 'Date3': {0: '2002-01-02 00:00:00',
'2002-01-03 00:00:00',
'2002-01-04 00:00:00',
'2002-01-07 00:00:00',
'2002-01-08 00:00:00',
'2002-01-09 00:00:00',
'2002-01-10 00:00:00',
'2002-01-11 00:00:00',
'2002-01-14 00:00:00',
'2002-01-15 00:00:00',
'2002-01-16 00:00:00',
'2002-01-17 00:00:00',
'2002-01-18 00:00:00',
'2002-01-21 00:00:00',
'2002-01-22 00:00:00'},
 '5Y5YINF': {0: 5.9274000000000004,
5.8767,
5.8977,
5.763,
5.8056,
5.825,
5.7287,
5.6922,
5.6222,
5.6145,
5.6655999999999995,
5.76,
5.7455,
5.718,
5.7589}})
​
​

What I want to do is to merge df1 and df2 to get a dataframe (dataset) where: when the data exist it takes the date position; when it doesn’t exist, it just gets NaN. This is what I did:

dataset = df1.merge(df2, how = 'left', left_on= ['Date', 'Date','Date'],
                  right_on=['Date1','Date2','Date3'])

JavaScript
 
dataset = df1.merge(df2, how = 'left', left_on= ['Date', 'Date','Date'],
                  right_on=['Date1','Date2','Date3'])
​
​

Unfortunately, I get only NaNs or NaTs everywhere.

Instead, I would like to get:

          Date   GSEAFCI          2Y2YINF         5Y5YINF
0   2002-01-01 00:00:00    98.85   NaN               5.9274
1   2002-01-02 00:00:00    98.96   NaN               5.8767
2   2002-01-03 00:00:00    98.88   NaN               5.8977
3   2002-01-04 00:00:00    98.85   NaN               5.7630
4    2002-01-05 00:00:00    NaN    NaN                 NaN
5    2002-01-06 00:00:00    NaN    NaN                 NaN
6   2002-01-07 00:00:00    98.83   NaN               5.8056
7   2002-01-08 00:00:00    98.86   NaN               5.8250
8   2002-01-09 00:00:00    98.87   NaN               5.7287
...
...

JavaScript
 
          Date   GSEAFCI          2Y2YINF         5Y5YINF
0   2002-01-01 00:00:00    98.85   NaN               5.9274
1   2002-01-02 00:00:00    98.96   NaN               5.8767
2   2002-01-03 00:00:00    98.88   NaN               5.8977
3   2002-01-04 00:00:00    98.85   NaN               5.7630
4    2002-01-05 00:00:00    NaN    NaN                 NaN
5    2002-01-06 00:00:00    NaN    NaN                 NaN
6   2002-01-07 00:00:00    98.83   NaN               5.8056
7   2002-01-08 00:00:00    98.86   NaN               5.8250
8   2002-01-09 00:00:00    98.87   NaN               5.7287
...
...
​
​
​

Can anyone help me with this?

Thanks!

Answer

If there si same pattern – datetimes with data columns is possible use list comprehension with concat, then convert values to datetimes in index and DataFrame.merge by first DataFrame (by index name Date and column name Date):

a = df2.columns[::2]
b = df2.columns[1::2]
dfs = [df2.loc[:, x].set_index(x[0], drop=False)[x[1]] for x in zip(a, b)]
df = pd.concat(dfs, axis=1, sort=False).rename_axis('Date')
df.index = pd.to_datetime(df.index)

df = df1.merge(df, on='Date', how='left')
print (df.head(10))
        Date  GSEAFCI  2Y2YINF  5Y5YINF
0 2002-01-01    98.85      NaN      NaN
1 2002-01-02    98.96      NaN   5.9274
2 2002-01-03    98.88      NaN   5.8767
3 2002-01-04    98.85      NaN   5.8977
4 2002-01-05      NaN      NaN      NaN
5 2002-01-06      NaN      NaN      NaN
6 2002-01-07    98.83      NaN   5.7630
7 2002-01-08    98.86      NaN   5.8056
8 2002-01-09    98.87      NaN   5.8250
9 2002-01-10    98.90      NaN   5.7287

JavaScript
 
a = df2.columns[::2]
b = df2.columns[1::2]
dfs = [df2.loc[:, x].set_index(x[0], drop=False)[x[1]] for x in zip(a, b)]
df = pd.concat(dfs, axis=1, sort=False).rename_axis('Date')
df.index = pd.to_datetime(df.index)
​
df = df1.merge(df, on='Date', how='left')
print (df.head(10))
        Date  GSEAFCI  2Y2YINF  5Y5YINF
0 2002-01-01    98.85      NaN      NaN
1 2002-01-02    98.96      NaN   5.9274
2 2002-01-03    98.88      NaN   5.8767
3 2002-01-04    98.85      NaN   5.8977
4 2002-01-05      NaN      NaN      NaN
5 2002-01-06      NaN      NaN      NaN
6 2002-01-07    98.83      NaN   5.7630
7 2002-01-08    98.86      NaN   5.8056
8 2002-01-09    98.87      NaN   5.8250
9 2002-01-10    98.90      NaN   5.7287
​

Advertisement

Answer