I’m trying to print in ascending order a 1GB file containing a randomly generated big number. This is the code that I’m using to generate the random number for my test (found it here).
import random import math afile = open("./Random.txt", "w" ) digits=1000000000 finalNumber = "" for i in range(digits // 16): finalNumber = finalNumber + str(math.floor(random.random() * 10000000000000000)) finalNumber = finalNumber + str(math.floor(random.random() * (10 ** (digits % 16)))) afile.write(finalNumber) afile.close()
The following python code works OK and takes a bit less than 4 minutes. But I was told this can be accomplished in about 15 seconds and that I might not even need to sort the number at all for printing it in order which is kind of puzzling.
This is not a homework but a question I was asked in a job interview that i didn’t manage to solve and is killing me not knowing the solution. I was not asked to do this on any specific language but I decided to use python because I’m familiar with the language. I gave a quick test of using bash but it was already struggling my script with a 10MB file.
# Sort integers in ascending order import sys import os import shutil # Count for the chunks count = 0 # Generate 100MB chunks def read_in_chunks(file_object, chunk_size=1024*102400): while True: data = file_object.read(chunk_size) if not data: break yield data #Do a mergesort of the chunks. def merge_files(outfile): words = [] for f in os.listdir('./tmp/'): if os.path.isfile('./tmp/' + f): file_ = open('./tmp/' + f) words.append(file_.readlines()) # Sort in-memory words.sort() with open(outfile, 'w') as out: out.write(str(words)) with open(sys.argv[1]) as line: #If tmp folder not exist create it if os.path.exists('./tmp/'): shutil.rmtree('./tmp/') os.mkdir('./tmp/') for chunk in read_in_chunks(line): # Sort each chunk ascending = "".join(sorted(str(chunk))) #write chunk to disk with open(f"./tmp/out_{count}.txt", mode="w") as fw: fw.writelines(ascending) count += 1 #merge all chunks into a single file with mergesort merge_files('finalout.txt') shutil.rmtree('./tmp/')
This basically chunks the file on temp files of 100MB, sort each chunk and then do a merge sort to append them. Just running a sort on the file will end up in a “MemoryError”
I also tried to just read the file once with a for and do a series of if/else to append each value to 10 different variables and then print them in order from 0 to 10 but this is very inefficient and slower than my initial method.
Clearly there needs to be a “trick” to solve this problem.
Advertisement
Answer
As everyone indicates, the expected answer is a counting sort.
It takes a little extra effort, though, to make a python-implemented counting sort beat the built-in string.sort()
, which is written in C++. It’s especially important to avoid creating a new python string object for each character of data.
One solution is to use the built-in string.sort()
, followed by 10 calls to string.index()
to get the counts for each chunk.
I decided to use 10 calls to string.count()
. Here’s the implementation:
from collections import defaultdict counts=defaultdict(int) with open("./Random.txt") as infile: while True: data = infile.read(1000000) if not data: break for digit in "0123456789": counts[digit] = counts[digit] + data.count(digit) with open("./fastout.txt", mode="w") as outfile: for digit in "0123456789": count = counts[digit] while count > 1000000: outfile.write(digit*1000000) count -= 1000000 if count > 0: outfile.write(digit*count)
Your original results:
$ time python3 original.py real 3m22.689s user 3m10.143s sys 0m9.797s
My results:
$ time python3 new.py real 0m14.001s user 0m13.297s sys 0m0.471s
I also noticed that your output file is a little longer than the input file, so you have a bug in there somewhere that I didn’t bother finding.