I am experimenting with polars and would like to understand why using polars is slower than using pandas on a particular example: Answer A pandas join uses the indexes, which are cached. A comparison where they do the same:

Joining dataframes using rust polars in Python

I am experimenting with polars and would like to understand why using polars is slower than using pandas on a particular example:

import pandas as pd
import polars as pl

n=10_000_000
df1 = pd.DataFrame(range(n), columns=['a'])
df2 = pd.DataFrame(range(n), columns=['b'])
df1p = pl.from_pandas(df1.reset_index())
df2p = pl.from_pandas(df2.reset_index())

# takes ~60 ms
df1.join(df2)

# takes ~950 ms
df1p.join(df2p, on='index')

JavaScript
​x
 
import pandas as pd
import polars as pl
​
n=10_000_000
df1 = pd.DataFrame(range(n), columns=['a'])
df2 = pd.DataFrame(range(n), columns=['b'])
df1p = pl.from_pandas(df1.reset_index())
df2p = pl.from_pandas(df2.reset_index())
​
# takes ~60 ms
df1.join(df2)
​
# takes ~950 ms
df1p.join(df2p, on='index')
​

Answer

A pandas join uses the indexes, which are cached.

A comparison where they do the same:

# pandas 
# CPU times: user 1.64 s, sys: 867 ms, total: 2.5 s
# Wall time: 2.52 s
df1.merge(df2, left_on="a", right_on="b")

# polars
# CPU times: user 5.59 s, sys: 199 ms, total: 5.79 s
# Wall time: 780 ms
df1p.join(df2p, left_on="a", right_on="b")

JavaScript
 
# pandas 
# CPU times: user 1.64 s, sys: 867 ms, total: 2.5 s
# Wall time: 2.52 s
df1.merge(df2, left_on="a", right_on="b")
​
# polars
# CPU times: user 5.59 s, sys: 199 ms, total: 5.79 s
# Wall time: 780 ms
df1p.join(df2p, left_on="a", right_on="b")
​

Advertisement

Answer