How to create a big file quickly with Python

Question

I have the following code for producing a big text file: But it seems to be pretty slow to even generate 5GB of this. How can I make it better? I wish the output to be like: Answer Well, of course, the whole thing is I/O bound. You can't output the file faster than the storage device can write it.

Accepted Answer

Well, of course, the whole thing is I/O bound. You can&#8217;t output the filefaster than the storage device can write it. Leaving that aside, thereare some optimizations that could be made.Your method of building up a long string from several shorter strings issuboptimal. You&#8217;re saying, essentially, s = s1 + s2. When you tellPython to do this, it concatenates two string objects to make a newstring object. This is slow, especially when repeated.A much better way is to collect the individual string objects in a listor other iterable, then use the join method to run them together. Forexample:>>> ''.join(['a', 'b', 'c'])'abc'>>> ', '.join(['a', 'b', 'c'])'a, b, c'Instead of n-1 string concatenations to join n strings, this doesthe whole thing in one step.There&#8217;s also a lot of repeated code that could be combined. Here&#8217;s acleaner design, still using the loops.import randomd = 3n = 1000f = open('input.txt', 'w')for i in range(n):    nums = []    for j in range(d):        nums.append(str(round(random.uniform(0, 1000), 3)))    s = ' '.join(nums)    f.write(s)    f.write('n')f.close()A cleaner, briefer, more Pythonic way is to use a list comprehension:import randomd = 3n = 1000f = open('input.txt', 'w')for i in range(n):    nums = [str(round(random.uniform(0, 1000), 3)) for j in range(d)]    f.write(' '.join(nums))    f.write('n')f.close()Note that in both cases, I wrote the newline separately. That should befaster than concatenating it to the string, since I/O is bufferedanyway. If I were joining a list of strings without separators, I&#8217;d justtack on a newline as the last string before joining.As Daniel&#8217;s answer says, numpy is probably faster, but maybe you don&#8217;twant to get into numpy yet; it sounds like you&#8217;re kind of a beginner atthis point.

Advertisement

Answer