I have a pandas dataframe which has 10 columns and 10 million rows.
I have created an empty table in pgadmin4 (an application to manage databases like MSSQL server) for this data to be stored.
However, when running the following command:
my_dataframe.to_sql('name_of_sql_table',connection, index = False, if_exists = 'append', method="multi")
It takes a very long time in order to run and often crashes my jupyter kernel given that the process is so long/runs out of memory.
Is there any advisable methods for speeding up the “sending pandas to sql table”?
Some things i can think of would be to split the data into say 1million row chunks then send them one at a time – appending the rows as you run the to_sql()
method.
I do not have the option of directly loading the data into pgadmin4 – my only method is to send data from python to pgadmin.
Advertisement
Answer
Have a look at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html
If this applies to your version of pandas
, use
df.to_sql("table_name", connection, index=False, if_exists='append', chunksize=25000, method=None)
Your query might be crashing because you’re using method='multi'
, as this does the following:
method : {None, ‘multi’, callable}, default None
Controls the SQL insertion clause used:
‘multi’: Pass multiple values in a single INSERT clause. callable with signature (pd_table, conn, keys, data_iter). Details and a sample callable implementation can be found in the section insert method.
Which means that pandas
would construct the statement in memory for all rows. Using chunksize
and one INSERT
statement per row will allow pandas
to chunk-up the save to db.