Skip to main content

Wow, the gzip module kinda sucks

I needed to scan some pretty massive gzipped text files, so my first try was the obvious "for line in gzip.open(...)." This worked but seemed way slower than expected. So I wrote "pyzcat" as a test and ran it against a file with 100k lines:

#!/usr/bin/python

import sys, gzip

for fname in sys.argv[1:]:
  for line in gzip.open(fname):
      print line,

Results:

$ time zcat testzcat.gz > /dev/null
real    0m0.329s

$ time ./pyzcat testzcat.gz > /dev/null
real    0m3.792s

10x slower -- ouch! Well, if zcat is so much better, let's try using zcat to do the reads for us:

def gziplines(fname):
  from subprocess import Popen, PIPE
  f = Popen(['zcat', fname], stdout=PIPE)
  for line in f.stdout:
      yield line

for fname in sys.argv[1:]:
  for line in gziplines(fname):
      print line,

Results:

$ time ./pyzcat2 testzcat.gz |wc
real    0m0.750s

So, reading from a zcat subprocess is 5x faster than using the gzip module. cGzipFile anyone?

Comments

junklight said…
nice to know some else has found the same issue. I got someone to write me a C library to deal with the gzip files I am working with but it is targeted at arc files (as used by the internet archive) and is not general enough for a cGzipFile. Would be nice to see this fixed in the core library though
Anonymous said…
gzip was made faster for python 2.5 (something about 30%-40% speed improvements in gzip.readline)
Anonymous said…
Note that the gzip module is already implemented in C and it calls libz for the actual work.

Just increase the block size in which you do I/O to 1 MB and the performance will be close.

#pyzcat-large-block:
#!/usr/bin/env python

import sys
import gzip

BLOCK_SIZE = 2**20

f = gzip.open(sys.argv[1])
for i in iter(lambda: f.read(BLOCK_SIZE), ''):
sys.stdout.write(i)

# timing results on a 150 MB gzip file
$ /usr/bin/time zcat t.gz > /dev/null
2.12user 0.11system 0:02.68elapsed 83%CPU (0avgtext+0avgdata 0maxresident)k

/usr/bin/time ./pyzcat t.gz > /dev/null
2.40user 0.57system 0:03.02elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (2major+90189minor)pagefaults 0swaps
Jonathan Ellis said…
libz is C, but the gzip module is python.

feeding your 1MB chunks to a cStringIO for turning into individual lines might speed things up vs the python code in GzipFile, but it's going to take some work when lines span your chunks.
Anonymous said…
Your comparison of zcat and pyzcat are testing completely different things. If you make pyzcat do what zcat does you'll find the timing isn't that different.

zcat doesn't give you a line-oriented interface but your pyzcat is doing buffering and scanning to make that happen.

#!/usr/bin/python
import sys, gzip, shutil
for fname in sys.argv[1:]:
shutil.copyfileobj(gzip.open(fname), sys.stdout))

That gives me 0.369s for zcat and 0.530 for a more similar pyzcat.
Jonathan Ellis said…
Still, the point that GzipFile's readline is hella slow vs zcat via subprocess remains.
Anonymous said…
of course, zcat does NOT search for newlines and split lines. You can't really compare orange and apples.
Jonathan Ellis said…
I'm comparing code to generate lines via GzipFile to code to generate lines via subprocess + zcat. That is certainly apples to apples.

I guess most people commenting here didn't read past the first block of code.

Popular posts from this blog

Why schema definition belongs in the database

Earlier, I wrote about how ORM developers shouldn't try to re-invent SQL . It doesn't need to be done, and you're not likely to end up with an actual improvement. SQL may be designed by committee, but it's also been refined from thousands if not millions of man-years of database experience. The same applies to DDL. (Data Definition Langage -- the part of the SQL standard that deals with CREATE and ALTER.) Unfortunately, a number of Python ORMs are trying to replace DDL with a homegrown Python API. This is a Bad Thing. There are at least four reasons why: Standards compliance Completeness Maintainability Beauty Standards compliance SQL DDL is a standard. That means if you want something more sophisticated than Emacs, you can choose any of half a dozen modeling tools like ERwin or ER/Studio to generate and edit your DDL. The Python data definition APIs, by contrast, aren't even compatibile with other Python tools. You can't take a table definition

Python at Mozy.com

At my day job, I write code for a company called Berkeley Data Systems. (They found me through this blog, actually. It's been a good place to work.) Our first product is free online backup at mozy.com . Our second beta release was yesterday; the obvious problems have been fixed, so I feel reasonably good about blogging about it. Our back end, which is the most algorithmically complex part -- as opposed to fighting-Microsoft-APIs complex, as we have to in our desktop client -- is 90% in python with one C extension for speed. We (well, they, since I wasn't at the company at that point) initially chose Python for speed of development, and it's definitely fulfilled that expectation. (It's also lived up to its reputation for readability, in that the Python code has had 3 different developers -- in serial -- with very quick ramp-ups in each case. Python's succinctness and and one-obvious-way-to-do-it philosophy played a big part in this.) If you try it out, pleas

A review of 6 Python IDEs

(March 2006: you may also be interested the updated review I did for PyCon -- http://spyced.blogspot.com/2006/02/pycon-python-ide-review.html .) For September's meeting, the Utah Python User Group hosted an IDE shootout. 5 presenters reviewed 6 IDEs: PyDev 0.9.8.1 Eric3 3.7.1 Boa Constructor 0.4.4 BlackAdder 1.1 Komodo 3.1 Wing IDE 2.0.3 (The windows version was tested for all but Eric3, which was tested on Linux. Eric3 is based on Qt, which basically means you can't run it on Windows unless you've shelled out $$$ for a commerical Qt license, since there is no GPL version of Qt for Windows. Yes, there's Qt Free , but that's not exactly production-ready software.) Perhaps the most notable IDEs not included are SPE and DrPython. Alas, nobody had time to review these, but if you're looking for a free IDE perhaps you should include these in your search, because PyDev was the only one of the 3 free ones that we'd consider using. And if you aren