I am trying to stream data through a subprocess, gzip it and write to a file. The following works. I wonder if it is possible to use python’s native gzip library instead.
JavaScript
x
5
1
fid = gzip.open(self.ipFile, 'rb') # input data
2
oFid = open(filtSortFile, 'wb') # output file
3
sort = subprocess.Popen(args="sort | gzip -c ", shell=True, stdin=subprocess.PIPE, stdout=oFid) # set up the pipe
4
processlines(fid, sort.stdin, filtFid) # pump data into the pipe
5
THE QUESTION: How do I do this instead .. where the gzip package of python is used? I’m mostly curious to know why the following gives me a text files (instead of a compressed binary version) … very odd.
JavaScript
1
5
1
fid = gzip.open(self.ipFile, 'rb')
2
oFid = gzip.open(filtSortFile, 'wb')
3
sort = subprocess.Popen(args="sort ", shell=True, stdin=subprocess.PIPE, stdout=oFid)
4
processlines(fid, sort.stdin, filtFid)
5
Advertisement
Answer
subprocess
writes to oFid.fileno()
but gzip
returns fd of underlying file object:
JavaScript
1
4
1
def fileno(self):
2
"""Invoke the underlying file object's fileno() method."""
3
return self.fileobj.fileno()
4
To enable compression use gzip
methods directly:
JavaScript
1
15
15
1
import gzip
2
from subprocess import Popen, PIPE
3
from threading import Thread
4
5
def f(input, output):
6
for line in iter(input.readline, ''):
7
output.write(line)
8
9
p = Popen(["sort"], bufsize=-1, stdin=PIPE, stdout=PIPE)
10
Thread(target=f, args=(p.stdout, gzip.open('out.gz', 'wb'))).start()
11
12
for s in "cafebabe":
13
p.stdin.write(s+"n")
14
p.stdin.close()
15
Example
JavaScript
1
14
14
1
$ python gzip_subprocess.py && od -c out.gz && zcat out.gz
2
0000000 037 213 b b 251 E t N 002 377 o u t K 344
3
0000020 J 344 J 002 302 d 256 T L 343 002 j 017 j
4
0000040 k 020
5
0000045
6
a
7
a
8
b
9
b
10
c
11
e
12
e
13
f
14