I have two dataframes. The first one is just a column of daily datetime, whereas the second one has both dates and data. This is an example:
JavaScript
x
94
94
1
df1 = pd.DataFrame({'Date': pd.date_range(start="2002-01-01",end="2021-04-14")})
2
3
df2 = pd.DataFrame.from_dict({'Date1': {0: '2002-01-01 00:00:00',
4
1: '2002-01-02 00:00:00',
5
2: '2002-01-03 00:00:00',
6
3: '2002-01-04 00:00:00',
7
4: '2002-01-07 00:00:00',
8
5: '2002-01-08 00:00:00',
9
6: '2002-01-09 00:00:00',
10
7: '2002-01-10 00:00:00',
11
8: '2002-01-11 00:00:00',
12
9: '2002-01-14 00:00:00',
13
10: '2002-01-15 00:00:00',
14
11: '2002-01-16 00:00:00',
15
12: '2002-01-17 00:00:00',
16
13: '2002-01-18 00:00:00',
17
14: '2002-01-21 00:00:00'},
18
'GSEAFCI': {0: 98.85,
19
1: 98.96,
20
2: 98.88,
21
3: 98.85,
22
4: 98.83,
23
5: 98.86,
24
6: 98.87,
25
7: 98.9,
26
8: 98.87,
27
9: 98.91,
28
10: 98.84,
29
11: 98.87,
30
12: 98.84,
31
13: 98.84,
32
14: 98.87},
33
'Date2': {0: '2004-03-03 00:00:00',
34
1: '2004-03-04 00:00:00',
35
2: '2004-03-05 00:00:00',
36
3: '2004-03-08 00:00:00',
37
4: '2004-03-09 00:00:00',
38
5: '2004-03-10 00:00:00',
39
6: '2004-03-11 00:00:00',
40
7: '2004-03-12 00:00:00',
41
8: '2004-03-15 00:00:00',
42
9: '2004-03-16 00:00:00',
43
10: '2004-03-17 00:00:00',
44
11: '2004-03-18 00:00:00',
45
12: '2004-03-19 00:00:00',
46
13: '2004-03-22 00:00:00',
47
14: '2004-03-23 00:00:00'},
48
'2Y2YINF': {0: 3.9,
49
1: 3.832,
50
2: 3.668,
51
3: 3.63,
52
4: 3.649,
53
5: 3.636,
54
6: 3.63,
55
7: 3.633,
56
8: 3.628,
57
9: 3.665,
58
10: 3.621,
59
11: 3.62,
60
12: 3.616,
61
13: 3.588,
62
14: 3.5949999999999998},
63
'Date3': {0: '2002-01-02 00:00:00',
64
1: '2002-01-03 00:00:00',
65
2: '2002-01-04 00:00:00',
66
3: '2002-01-07 00:00:00',
67
4: '2002-01-08 00:00:00',
68
5: '2002-01-09 00:00:00',
69
6: '2002-01-10 00:00:00',
70
7: '2002-01-11 00:00:00',
71
8: '2002-01-14 00:00:00',
72
9: '2002-01-15 00:00:00',
73
10: '2002-01-16 00:00:00',
74
11: '2002-01-17 00:00:00',
75
12: '2002-01-18 00:00:00',
76
13: '2002-01-21 00:00:00',
77
14: '2002-01-22 00:00:00'},
78
'5Y5YINF': {0: 5.9274000000000004,
79
1: 5.8767,
80
2: 5.8977,
81
3: 5.763,
82
4: 5.8056,
83
5: 5.825,
84
6: 5.7287,
85
7: 5.6922,
86
8: 5.6222,
87
9: 5.6145,
88
10: 5.6655999999999995,
89
11: 5.76,
90
12: 5.7455,
91
13: 5.718,
92
14: 5.7589}})
93
94
What I want to do is to merge df1
and df2
to get a dataframe (dataset
) where: when the data exist it takes the date position; when it doesn’t exist, it just gets NaN. This is what I did:
JavaScript
1
4
1
dataset = df1.merge(df2, how = 'left', left_on= ['Date', 'Date','Date'],
2
right_on=['Date1','Date2','Date3'])
3
4
Unfortunately, I get only NaNs or NaTs everywhere.
Instead, I would like to get:
JavaScript
1
15
15
1
Date GSEAFCI 2Y2YINF 5Y5YINF
2
0 2002-01-01 00:00:00 98.85 NaN 5.9274
3
1 2002-01-02 00:00:00 98.96 NaN 5.8767
4
2 2002-01-03 00:00:00 98.88 NaN 5.8977
5
3 2002-01-04 00:00:00 98.85 NaN 5.7630
6
4 2002-01-05 00:00:00 NaN NaN NaN
7
5 2002-01-06 00:00:00 NaN NaN NaN
8
6 2002-01-07 00:00:00 98.83 NaN 5.8056
9
7 2002-01-08 00:00:00 98.86 NaN 5.8250
10
8 2002-01-09 00:00:00 98.87 NaN 5.7287
11
12
13
14
15
Can anyone help me with this?
Thanks!
Advertisement
Answer
If there si same pattern – datetimes with data columns is possible use list comprehension with concat
, then convert values to datetimes in index and DataFrame.merge
by first DataFrame (by index name Date
and column name Date
):
JavaScript
1
20
20
1
a = df2.columns[::2]
2
b = df2.columns[1::2]
3
dfs = [df2.loc[:, x].set_index(x[0], drop=False)[x[1]] for x in zip(a, b)]
4
df = pd.concat(dfs, axis=1, sort=False).rename_axis('Date')
5
df.index = pd.to_datetime(df.index)
6
7
df = df1.merge(df, on='Date', how='left')
8
print (df.head(10))
9
Date GSEAFCI 2Y2YINF 5Y5YINF
10
0 2002-01-01 98.85 NaN NaN
11
1 2002-01-02 98.96 NaN 5.9274
12
2 2002-01-03 98.88 NaN 5.8767
13
3 2002-01-04 98.85 NaN 5.8977
14
4 2002-01-05 NaN NaN NaN
15
5 2002-01-06 NaN NaN NaN
16
6 2002-01-07 98.83 NaN 5.7630
17
7 2002-01-08 98.86 NaN 5.8056
18
8 2002-01-09 98.87 NaN 5.8250
19
9 2002-01-10 98.90 NaN 5.7287
20