I have the following code for producing a big text file:
d = 3 n = 100000 f = open("input.txt",'a') s = "" for j in range(0, d-1): s += str(round(random.uniform(0,1000), 3))+" " s += str(round(random.uniform(0,1000), 3)) f.write(s) for i in range(0, n-1): s = "" for j in range(0, d-1): s += str(round(random.uniform(0,1000), 3))+" " s += str(round(random.uniform(0,1000), 3)) f.write("n"+s) f.close()
But it seems to be pretty slow to even generate 5GB of this.
How can I make it better? I wish the output to be like:
796.802 691.462 803.664 849.483 201.948 452.155 144.174 526.745 826.565 986.685 238.462 49.885 137.617 416.243 515.474 366.199 687.629 423.929
Advertisement
Answer
Well, of course, the whole thing is I/O bound. You can’t output the file faster than the storage device can write it. Leaving that aside, there are some optimizations that could be made.
Your method of building up a long string from several shorter strings is
suboptimal. You’re saying, essentially, s = s1 + s2
. When you tell
Python to do this, it concatenates two string objects to make a new
string object. This is slow, especially when repeated.
A much better way is to collect the individual string objects in a list
or other iterable, then use the join
method to run them together. For
example:
>>> ''.join(['a', 'b', 'c']) 'abc' >>> ', '.join(['a', 'b', 'c']) 'a, b, c'
Instead of n-1 string concatenations to join n strings, this does the whole thing in one step.
There’s also a lot of repeated code that could be combined. Here’s a cleaner design, still using the loops.
import random d = 3 n = 1000 f = open('input.txt', 'w') for i in range(n): nums = [] for j in range(d): nums.append(str(round(random.uniform(0, 1000), 3))) s = ' '.join(nums) f.write(s) f.write('n') f.close()
A cleaner, briefer, more Pythonic way is to use a list comprehension:
import random d = 3 n = 1000 f = open('input.txt', 'w') for i in range(n): nums = [str(round(random.uniform(0, 1000), 3)) for j in range(d)] f.write(' '.join(nums)) f.write('n') f.close()
Note that in both cases, I wrote the newline separately. That should be faster than concatenating it to the string, since I/O is buffered anyway. If I were joining a list of strings without separators, I’d just tack on a newline as the last string before joining.
As Daniel’s answer says, numpy is probably faster, but maybe you don’t want to get into numpy yet; it sounds like you’re kind of a beginner at this point.