Data Quality check with Python Dask

Question

Currently trying to write code to check for data quality of a 7 gb data file. I tried googling exactly but to no avail. Initially, the purpose of the code is to check how many are nulls/NaNs and later on to join it with another datafile and compare the quality between each. We are expecting the second is the …

Accepted Answer

I would suggest the following approach:try to define how you would check quality on small dataset and implement it in Pandastry to generalize the process in a way that if each &#8220;part of file&#8221; or partition is of good quality, than whole dataset can be considered of good quality.use Dask&#8217;s map_partitions to parralelize this processing over your dataset&#8217;s partition.

Advertisement

Answer