I am building a CSV chunk by chunk using the csv
module from the standard library.
This means that I am adding rows one by one in a loop. Each row that I add contains information for each column of my dataframe.
So, I have this CSV:
A B C D
And I am adding rows one by one:
A B C D aaaaa bbb ccccc ddddd a1a1a b1b1 c1c1c1 d1d1d1 a2a2a b2b2 c2c2c2 d2d2d2
And so on.
My problem is that sometimes, the row that I am adding contains MORE information (that is, information that does not have a column). For example:
A B C D aaaaa bbb ccccc ddddd a1a1a b1b1 c1c1c1 d1d1d1 a2a2a b2b2 c2c2c2 d2d2d2 a3a3a b3b3 c3c3c3 d3d3d3 e3e3e3 #this row has extra information
My question is: Is there any way to make the CSV grow (during runtime) when that happens? (with ‘grow’ I mean to add the “extra” columns)
So basically I want this to happen:
A B C D E # this column was added because aaaaa bbb ccccc ddddd # of the extra column found a1a1a b1b1 c1c1c1 d1d1d1 # in the new row a2a2a b2b2 c2c2c2 d2d2d2 a3a3a b3b3 c3c3c3 d3d3d3 e3e3e3
I am adding the rows using the csv
module from the standard library, the with
statement and a dictionary:
import csv addThis = {A:'a3a3a', B:'b3b3', C:'c3c3c3', D:'d3d3d3', E:'e3e3e3'} with open('csvFile', 'a') as f: writer = csv.writer(f) writer.writerow(addThis)
As you can see, in the dictionary that I’m adding, I specify the name of the new column. What happens when I try that is that I get this exception:
ValueError: dict contains fields not in fieldnames: 'E'
I have tried adding the “extra” fieldname to the csv
before adding the row like this:
fields = writer.__getattribute__('fieldnames') writer.fieldnames = fields + ['E']
Note: It seems from this example that I already now that E
will be added but that is not the case. I showed it like this just for the example. I don’t know what the “extra” data will be until I get the “extra” rows (which I get over a period of time from a web scrape).
That manages to evade the exception, but does not add the extra column, so I end up with something like this:
A B C D aaaaa bbb ccccc ddddd a1a1a b1b1 c1c1c1 d1d1d1 a2a2a b2b2 c2c2c2 d2d2d2 a3a3a b3b3 c3c3c3 d3d3d3 e3e3e3 # value is added but the column # name is not there
I am not using Pandas because I understand that Pandas is designed to load fully populated DataFrames, but I am open to using something besides the csv
module if you suggest it. Any ideas regarding that?
Thanks for your help and sorry for the long question, I tried to be as clear as possible.
Advertisement
Answer
I think you would need to rewrite the entire file when that happens. Currently you are opening the file with a
so you can only append stuff at the end, and not add something in the middle of the file. I don’t think there is an easy solution to add something in the middle of a file.
The easiest solution would then be to read the entire file into memory, add the new column to the header row and then rewrite the complete file.
See this question for an example of how you could do that.