Skip to content
Advertisement

Sort the digits of a 1GB file containing a single number efficiently

I’m trying to print in ascending order a 1GB file containing a randomly generated big number. This is the code that I’m using to generate the random number for my test (found it here).

import random
import math

afile = open("./Random.txt", "w" )

digits=1000000000
finalNumber = ""
for i in range(digits // 16):
  finalNumber = finalNumber + str(math.floor(random.random() * 10000000000000000))
finalNumber = finalNumber + str(math.floor(random.random() * (10 ** (digits % 16))))

afile.write(finalNumber)
afile.close()

The following python code works OK and takes a bit less than 4 minutes. But I was told this can be accomplished in about 15 seconds and that I might not even need to sort the number at all for printing it in order which is kind of puzzling.

This is not a homework but a question I was asked in a job interview that i didn’t manage to solve and is killing me not knowing the solution. I was not asked to do this on any specific language but I decided to use python because I’m familiar with the language. I gave a quick test of using bash but it was already struggling my script with a 10MB file.

# Sort integers in ascending order
import sys
import os
import shutil

# Count for the chunks
count = 0

# Generate 100MB chunks
def read_in_chunks(file_object, chunk_size=1024*102400):
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data

#Do a mergesort of the chunks.
def merge_files(outfile):
    words = []
    for f in os.listdir('./tmp/'):
        if os.path.isfile('./tmp/' + f):
            file_ = open('./tmp/' + f)
            words.append(file_.readlines())

    # Sort in-memory         
    words.sort()  

    with open(outfile, 'w') as out:
        out.write(str(words))

with open(sys.argv[1]) as line:
    #If tmp folder not exist create it
    if os.path.exists('./tmp/'):
       shutil.rmtree('./tmp/')
    os.mkdir('./tmp/')
    for chunk in read_in_chunks(line):
         
        # Sort each chunk
        ascending = "".join(sorted(str(chunk)))
        #write chunk to disk
        with open(f"./tmp/out_{count}.txt", mode="w") as fw:
                fw.writelines(ascending)
        count += 1 
    #merge all chunks into a single file with mergesort    
    merge_files('finalout.txt')
    shutil.rmtree('./tmp/')

This basically chunks the file on temp files of 100MB, sort each chunk and then do a merge sort to append them. Just running a sort on the file will end up in a “MemoryError”

I also tried to just read the file once with a for and do a series of if/else to append each value to 10 different variables and then print them in order from 0 to 10 but this is very inefficient and slower than my initial method.

Clearly there needs to be a “trick” to solve this problem.

Advertisement

Answer

As everyone indicates, the expected answer is a counting sort.

It takes a little extra effort, though, to make a python-implemented counting sort beat the built-in string.sort(), which is written in C++. It’s especially important to avoid creating a new python string object for each character of data.

One solution is to use the built-in string.sort(), followed by 10 calls to string.index() to get the counts for each chunk.

I decided to use 10 calls to string.count(). Here’s the implementation:

from collections import defaultdict
counts=defaultdict(int)

with open("./Random.txt") as infile:
  while True:
    data = infile.read(1000000)
    if not data:
      break
    for digit in "0123456789":
      counts[digit] = counts[digit] + data.count(digit)

with open("./fastout.txt", mode="w") as outfile:
  for digit in "0123456789":
    count = counts[digit]
    while count > 1000000:
      outfile.write(digit*1000000)
      count -= 1000000
    if count > 0:
      outfile.write(digit*count)

Your original results:

$ time python3 original.py

real    3m22.689s
user    3m10.143s
sys 0m9.797s

My results:

$ time python3 new.py

real    0m14.001s
user    0m13.297s
sys 0m0.471s

I also noticed that your output file is a little longer than the input file, so you have a bug in there somewhere that I didn’t bother finding.

Advertisement