I run a query from AWS Athena console and takes 10s. The same query run from Sagemaker using PyAthena takes 155s. Is PyAthena slowing it down or is the data transfer from Athena to sagemaker so time consuming?
What could I do to speed this up?
Advertisement
Answer
Just figure out a way of boosting the queries:
Before I was trying:
import pandas as pd from pyathena import connect conn = connect(s3_staging_dir=STAGIN_DIR, region_name=REGION) pd.read_sql(QUERY, conn) # takes 160s
Figured out that using a PandasCursor instead of a connection is way faster
import pandas as pd pyathena import connect from pyathena.pandas.cursor import PandasCursor cursor = connect(s3_staging_dir=STAGIN_DIR, region_name=REGION, cursor_class=PandasCursor).cursor() df = cursor.execute(QUERY).as_pandas() # takes 12s