Skip to content
Advertisement

How to create a big file quickly with Python

I have the following code for producing a big text file:

d = 3
n = 100000
f = open("input.txt",'a')
s = ""
for j in range(0, d-1):
    s += str(round(random.uniform(0,1000), 3))+" "
s += str(round(random.uniform(0,1000), 3))
f.write(s)
for i in range(0, n-1):
    s = ""
    for j in range(0, d-1):
        s += str(round(random.uniform(0,1000), 3))+" "
    s += str(round(random.uniform(0,1000), 3))
    f.write("n"+s)
f.close()

But it seems to be pretty slow to even generate 5GB of this.

How can I make it better? I wish the output to be like:

796.802 691.462 803.664
849.483 201.948 452.155
144.174 526.745 826.565
986.685 238.462 49.885
137.617 416.243 515.474
366.199 687.629 423.929

Advertisement

Answer

Well, of course, the whole thing is I/O bound. You can’t output the file faster than the storage device can write it. Leaving that aside, there are some optimizations that could be made.

Your method of building up a long string from several shorter strings is suboptimal. You’re saying, essentially, s = s1 + s2. When you tell Python to do this, it concatenates two string objects to make a new string object. This is slow, especially when repeated.

A much better way is to collect the individual string objects in a list or other iterable, then use the join method to run them together. For example:

>>> ''.join(['a', 'b', 'c'])
'abc'
>>> ', '.join(['a', 'b', 'c'])
'a, b, c'

Instead of n-1 string concatenations to join n strings, this does the whole thing in one step.

There’s also a lot of repeated code that could be combined. Here’s a cleaner design, still using the loops.

import random

d = 3
n = 1000

f = open('input.txt', 'w')

for i in range(n):
    nums = []
    for j in range(d):
        nums.append(str(round(random.uniform(0, 1000), 3)))
    s = ' '.join(nums)
    f.write(s)
    f.write('n')

f.close()

A cleaner, briefer, more Pythonic way is to use a list comprehension:

import random

d = 3
n = 1000

f = open('input.txt', 'w')

for i in range(n):
    nums = [str(round(random.uniform(0, 1000), 3)) for j in range(d)]
    f.write(' '.join(nums))
    f.write('n')

f.close()

Note that in both cases, I wrote the newline separately. That should be faster than concatenating it to the string, since I/O is buffered anyway. If I were joining a list of strings without separators, I’d just tack on a newline as the last string before joining.

As Daniel’s answer says, numpy is probably faster, but maybe you don’t want to get into numpy yet; it sounds like you’re kind of a beginner at this point.

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement