I’m trying to print in ascending order a 1GB file containing a randomly generated big number. This is the code that I’m using to generate the random number for my test (found it here).
import random
import math
afile = open("./Random.txt", "w" )
digits=1000000000
finalNumber = ""
for i in range(digits // 16):
finalNumber = finalNumber + str(math.floor(random.random() * 10000000000000000))
finalNumber = finalNumber + str(math.floor(random.random() * (10 ** (digits % 16))))
afile.write(finalNumber)
afile.close()
The following python code works OK and takes a bit less than 4 minutes. But I was told this can be accomplished in about 15 seconds and that I might not even need to sort the number at all for printing it in order which is kind of puzzling.
This is not a homework but a question I was asked in a job interview that i didn’t manage to solve and is killing me not knowing the solution. I was not asked to do this on any specific language but I decided to use python because I’m familiar with the language. I gave a quick test of using bash but it was already struggling my script with a 10MB file.
# Sort integers in ascending order
import sys
import os
import shutil
# Count for the chunks
count = 0
# Generate 100MB chunks
def read_in_chunks(file_object, chunk_size=1024*102400):
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
#Do a mergesort of the chunks.
def merge_files(outfile):
words = []
for f in os.listdir('./tmp/'):
if os.path.isfile('./tmp/' + f):
file_ = open('./tmp/' + f)
words.append(file_.readlines())
# Sort in-memory
words.sort()
with open(outfile, 'w') as out:
out.write(str(words))
with open(sys.argv[1]) as line:
#If tmp folder not exist create it
if os.path.exists('./tmp/'):
shutil.rmtree('./tmp/')
os.mkdir('./tmp/')
for chunk in read_in_chunks(line):
# Sort each chunk
ascending = "".join(sorted(str(chunk)))
#write chunk to disk
with open(f"./tmp/out_{count}.txt", mode="w") as fw:
fw.writelines(ascending)
count += 1
#merge all chunks into a single file with mergesort
merge_files('finalout.txt')
shutil.rmtree('./tmp/')
This basically chunks the file on temp files of 100MB, sort each chunk and then do a merge sort to append them. Just running a sort on the file will end up in a “MemoryError”
I also tried to just read the file once with a for and do a series of if/else to append each value to 10 different variables and then print them in order from 0 to 10 but this is very inefficient and slower than my initial method.
Clearly there needs to be a “trick” to solve this problem.
Advertisement
Answer
As everyone indicates, the expected answer is a counting sort.
It takes a little extra effort, though, to make a python-implemented counting sort beat the built-in string.sort()
, which is written in C++. It’s especially important to avoid creating a new python string object for each character of data.
One solution is to use the built-in string.sort()
, followed by 10 calls to string.index()
to get the counts for each chunk.
I decided to use 10 calls to string.count()
. Here’s the implementation:
from collections import defaultdict
counts=defaultdict(int)
with open("./Random.txt") as infile:
while True:
data = infile.read(1000000)
if not data:
break
for digit in "0123456789":
counts[digit] = counts[digit] + data.count(digit)
with open("./fastout.txt", mode="w") as outfile:
for digit in "0123456789":
count = counts[digit]
while count > 1000000:
outfile.write(digit*1000000)
count -= 1000000
if count > 0:
outfile.write(digit*count)
Your original results:
$ time python3 original.py
real 3m22.689s
user 3m10.143s
sys 0m9.797s
My results:
$ time python3 new.py
real 0m14.001s
user 0m13.297s
sys 0m0.471s
I also noticed that your output file is a little longer than the input file, so you have a bug in there somewhere that I didn’t bother finding.