I am experimenting with polars
and would like to understand why using polars
is slower than using pandas
on a particular example:
JavaScript
x
15
15
1
import pandas as pd
2
import polars as pl
3
4
n=10_000_000
5
df1 = pd.DataFrame(range(n), columns=['a'])
6
df2 = pd.DataFrame(range(n), columns=['b'])
7
df1p = pl.from_pandas(df1.reset_index())
8
df2p = pl.from_pandas(df2.reset_index())
9
10
# takes ~60 ms
11
df1.join(df2)
12
13
# takes ~950 ms
14
df1p.join(df2p, on='index')
15
Advertisement
Answer
A pandas join
uses the indexes, which are cached.
A comparison where they do the same:
JavaScript
1
10
10
1
# pandas
2
# CPU times: user 1.64 s, sys: 867 ms, total: 2.5 s
3
# Wall time: 2.52 s
4
df1.merge(df2, left_on="a", right_on="b")
5
6
# polars
7
# CPU times: user 5.59 s, sys: 199 ms, total: 5.79 s
8
# Wall time: 780 ms
9
df1p.join(df2p, left_on="a", right_on="b")
10