I am trying to stream data through a subprocess, gzip it and write to a file. The following works. I wonder if it is possible to use python’s native gzip library instead.
fid = gzip.open(self.ipFile, 'rb') # input data oFid = open(filtSortFile, 'wb') # output file sort = subprocess.Popen(args="sort | gzip -c ", shell=True, stdin=subprocess.PIPE, stdout=oFid) # set up the pipe processlines(fid, sort.stdin, filtFid) # pump data into the pipe
THE QUESTION: How do I do this instead .. where the gzip package of python is used? I’m mostly curious to know why the following gives me a text files (instead of a compressed binary version) … very odd.
fid = gzip.open(self.ipFile, 'rb') oFid = gzip.open(filtSortFile, 'wb') sort = subprocess.Popen(args="sort ", shell=True, stdin=subprocess.PIPE, stdout=oFid) processlines(fid, sort.stdin, filtFid)
Advertisement
Answer
subprocess
writes to oFid.fileno()
but gzip
returns fd of underlying file object:
def fileno(self): """Invoke the underlying file object's fileno() method.""" return self.fileobj.fileno()
To enable compression use gzip
methods directly:
import gzip from subprocess import Popen, PIPE from threading import Thread def f(input, output): for line in iter(input.readline, ''): output.write(line) p = Popen(["sort"], bufsize=-1, stdin=PIPE, stdout=PIPE) Thread(target=f, args=(p.stdout, gzip.open('out.gz', 'wb'))).start() for s in "cafebabe": p.stdin.write(s+"n") p.stdin.close()
Example
$ python gzip_subprocess.py && od -c out.gz && zcat out.gz 0000000 037 213 b b 251 E t N 002 377 o u t K 344 0000020 J 344 J 002 302 d 256 T L 343 002 j 017 j 0000040 k 020 0000045 a a b b c e e f