Sort the digits of a 1GB file containing a single number efficiently

Question

I&#8217;m trying to print in ascending order a 1GB file containing a randomly generated big number. This is the code that I&#8217;m using to generate the random number for my test (found it here). The following python code works OK and takes a bit less than 4 minutes. But I was told this can be accomplished i…

Accepted Answer

As everyone indicates, the expected answer is a counting sort.It takes a little extra effort, though, to make a python-implemented counting sort beat the built-in string.sort(), which is written in C++.  It&#8217;s especially important to avoid creating a new python string object for each character of data.One solution is to use the built-in string.sort(), followed by 10 calls to string.index() to get the counts for each chunk.I decided to use 10 calls to string.count().  Here&#8217;s the implementation:from collections import defaultdictcounts=defaultdict(int)with open("./Random.txt") as infile:  while True:    data = infile.read(1000000)    if not data:      break    for digit in "0123456789":      counts[digit] = counts[digit] + data.count(digit)with open("./fastout.txt", mode="w") as outfile:  for digit in "0123456789":    count = counts[digit]    while count > 1000000:      outfile.write(digit*1000000)      count -= 1000000    if count > 0:      outfile.write(digit*count)Your original results:$ time python3 original.pyreal    3m22.689suser    3m10.143ssys 0m9.797sMy results:$ time python3 new.pyreal    0m14.001suser    0m13.297ssys 0m0.471sI also noticed that your output file is a little longer than the input file, so you have a bug in there somewhere that I didn&#8217;t bother finding.

Advertisement

Answer