Seem to be knocking my head off a newbie error and I am not a newbie.
I have a 1.2G known-good zipfile ‘train.zip’ containing a 3.5G file ‘train.csv’.
I open the zipfile and file itself without any exceptions (no LargeZipFile), but the resulting filestream appears to be empty. (UNIX ‘unzip -c …’ confirms it is good)
The file objects returned by Python ZipFile.open()
are not seek’able or tell’able, so I can’t check that.
Python distribution is 2.7.3 EPD-free 7.3-1 (32-bit) ; but should be ok for large zips. OS is MacOS 10.6.6
import csv import zipfile as zf zip_pathname = os.path.join('/my/data/path/.../', 'train.zip') #with zf.ZipFile(zip_pathname).open('train.csv') as z: z = zf.ZipFile(zip_pathname, 'r', zf.ZIP_DEFLATED, allowZip64=True) # I tried all permutations z.debug = 1 z.testzip() # zipfile integrity is ok z1 = z.open('train.csv', 'r') # our file keeps coming up empty? # Check the info to confirm z1 is indeed a valid 3.5Gb file... z1i = z.getinfo(file_name) for att in ('filename', 'file_size', 'compress_size', 'compress_type', 'date_time', 'CRC', 'comment'): print '%s:t' % att, getattr(z1i,att) # ... and it looks ok. compress_type = 9 ok? #filename: train.csv #file_size: 3729150126 #compress_size: 1284613649 #compress_type: 9 #date_time: (2012, 8, 20, 15, 30, 4) #CRC: 1679210291 # All attempts to read z1 come up empty?! # z1.readline() gives '' # z1.readlines() gives [] # z1.read() takes ~60sec but also returns '' ? # code I would want to run is: reader = csv.reader(z1) header = reader.next() return reader
Advertisement
Answer
The cause is the combination of:
- this file’s compression type is type 9: Deflate64/Enhanced Deflate (PKWare’s proprietary format, as opposed to the more common type 8)
- and a zipfile bug: it will not throw an exception for unsupported compression-types. It used to just silently return a bad file object [Section 4.4.5 compression method]. Aargh. How bogus. UPDATE: I filed bug 14313 and it was fixed back in 2012 so it now raises NotImplementedError when the compression type is unknown.
A command-line Workaround is to unzip, then rezip, to get a plain type 8: Deflated.
zipfile will throw an exception in 2.7 , 3.2+ I guess zipfile will never be able to actually handle type 9, for legal reasons. The Python doc makes no mention whatsoever that zipfile cannot handle other compression types :(