Skip to content
Advertisement

Best method for sending large pandas dataframe to SQL database?

I have a pandas dataframe which has 10 columns and 10 million rows.

I have created an empty table in pgadmin4 (an application to manage databases like MSSQL server) for this data to be stored.

However, when running the following command:

my_dataframe.to_sql('name_of_sql_table',connection, index = False, if_exists = 'append', method="multi") 

It takes a very long time in order to run and often crashes my jupyter kernel given that the process is so long/runs out of memory.

Is there any advisable methods for speeding up the “sending pandas to sql table”?

Some things i can think of would be to split the data into say 1million row chunks then send them one at a time – appending the rows as you run the to_sql() method.

I do not have the option of directly loading the data into pgadmin4 – my only method is to send data from python to pgadmin.

Advertisement

Answer

Have a look at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html

If this applies to your version of pandas, use

df.to_sql("table_name", 
          connection, 
          index=False, 
          if_exists='append',
          chunksize=25000,
          method=None)

Your query might be crashing because you’re using method='multi', as this does the following:

method : {None, ‘multi’, callable}, default None

Controls the SQL insertion clause used:

‘multi’: Pass multiple values in a single INSERT clause. callable with signature (pd_table, conn, keys, data_iter). Details and a sample callable implementation can be found in the section insert method.

Which means that pandas would construct the statement in memory for all rows. Using chunksize and one INSERT statement per row will allow pandas to chunk-up the save to db.

Advertisement