I want to take the last data before the specified time from different time intervals df, my code is as follows:
JavaScript
x
55
55
1
import numpy as np
2
import datetime
3
4
import pandas as pd
5
6
np.random.seed(2022)
7
durations = ['T', '5T', '15T', '30T', 'H', '2H', 'D', 'W', 'BM']
8
datas = {}
9
time_selected = None
10
11
12
def generate_data():
13
global durations, datas
14
start_dt = '2018-01-01'
15
end_dt = '2022-05-02'
16
for duration in durations:
17
datas[duration] = pd.DataFrame(index=pd.date_range(start_dt, end_dt, freq=duration))
18
datas[duration]['duration'] = duration
19
datas[duration]['data'] = np.random.random(len(datas[duration])) * 100
20
21
return
22
23
24
def selecte_time():
25
global time_selected
26
start_dt = datetime.datetime(2018, 3, 1)
27
end_dt = datetime.datetime(2022, 5, 2)
28
idx = pd.date_range(start_dt, end_dt, freq='T')
29
time_selected = np.random.choice(idx)
30
return time_selected
31
32
33
def get_result_df():
34
global durations, datas, time_selected
35
t_df = {}
36
col = ['duration', 'data']
37
for duration in durations:
38
df = datas[duration]
39
t_df[duration] = df[df.index <= time_selected][col].iloc[-1]
40
df = pd.DataFrame(t_df[duration] for duration in durations)
41
42
return df
43
44
45
def main():
46
generate_data()
47
selecte_time()
48
df = get_result_df()
49
print(df)
50
51
52
if __name__ == '__main__':
53
main()
54
55
On my computer, the running time of get_result_df()
is 204ms
, how can I speed up the running speed of get_result_df()
?
JavaScript
1
3
1
%timeit get_result_df()
2
204 ms ± 4.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
3
I optimized it, and the running time was reduced to 53ms
. Is there any room for improvement?
JavaScript
1
12
12
1
def get_result_df():
2
global durations, datas, time_selected
3
t_df = {}
4
col = ['duration', 'data']
5
for duration in durations:
6
df = datas[duration]
7
dt = df.index.to_numpy()
8
dt1 = dt[dt <= time_selected][-1]
9
t_df[duration] = df[df.index == dt1][col].iloc[-1]
10
df = pd.DataFrame(t_df[duration] for duration in durations)
11
return df
12
JavaScript
1
3
1
%timeit get_result_df()
2
53.3 ms ± 7.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
3
Answers to my questions on code view SE:
JavaScript
1
3
1
%timeit get_result_df(datas, time_selected)
2
5.81 ms ± 178 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3
Advertisement
Answer
My times are roughly halved, but I see the same behavior. Faster using argmin
from np. See below.
JavaScript
1
6
1
In [1]: %timeit get_result_df()
2
115 ms ± 3.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
3
4
In [2]: %timeit get_result_df2()
5
26.2 ms ± 387 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
6
Argmin + iloc directly it is faster:
JavaScript
1
16
16
1
def get_result_df3():
2
global durations, datas, time_selected
3
t_df = {}
4
col = ['duration', 'data']
5
for duration in durations:
6
df = datas[duration]
7
dt = df.index.to_numpy()
8
idx = np.argmin([dt <= time_selected])-1
9
t_df[duration] = df.iloc[idx][col]
10
df = pd.DataFrame(t_df[duration] for duration in durations)
11
return df
12
13
In [2]: %timeit get_result_df3()
14
9.62 ms ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
15
16