Why does matplotlib.pyplot.savefig() mess up image outputs for very large pandas.plotting.scatter_matrix()?

Question

I was trying to compute the pandas.plotting.scatter_matrix() values for very large pandas.DataFrame() (relatively speaking for this specific operation, most libraries either run OOM most of the time or implement a row count check of 50000, see vaex-scatter). The 'Time series' DataFrame shape I have is (10000000, 41). Every value is either a float or an integer. Q1: So the first

Accepted Answer

For future readers, the process I opted was to use datashader.org as @JodyKlymak suggested in his comment(Thanks) with pandas.DataFrame.please bear in mind that this approach answers both the questions.Convert your modin.pandas.DataFrame to pandas.DataFrame with the private modin.pandas.DataFrame._to_pandas()plot the graphs first to an xarray image like so xarray-imshow.import datashader as dsimport datashader.transfer_functions as tfcols = dataset_1.columnsplots = {}for idx in range(41): # generating 40 plots on the fly    if idx == 0:        pass    else:        x = cols[idx]        y = cols[idx-1]        plots['some_unique_key'] = tf.shade(cvs.points(dataset_1[[x, y]], x, y))# traverse the dictionary to use the xarray.plot.imshow() plots['some_unique_key'].plot.imshow()TimeCPU times: user 723 ms, sys: 43 ms, total: 766 msWall time: 757 msImage

Why does matplotlib.pyplot.savefig() mess up image outputs for very large pandas.plotting.scatter_matrix()?

What I tried for Q1

Current workaround for Q1 and new Q2

Samples of the rendered image since file is too large to post here

Advertisement

Answer

Time

Image