直接从Perl进行Gzipping数据

2015年3月27日 by brian d foy

Perl可以通过其IO层读取和写入gzip流。Nicholas Clark最近更新了PerlIO::gzip（包含来自Zefram的补丁），自上次发布以来已有九年。现在它支持Perl v5.20和即将推出的v5.22，尽管在Windows上仍然存在问题。但正如我们习惯的那样，做这件事的方法不止一种。

管道方式

Perl功能强大，就像Unix胶带一样，从标准文件句柄中读取或写入很容易。你可能知道关于三个参数的open，但我可以给它提供任意多的参数。对于管道打开，我可以将模式设置为第二个参数，将命令作为列表传递，就像在system中做的那样（参见Mastering Perl)的“安全编程章节”。我记得在哪里放-在|旁边，命令会在这里

$ENV{PATH} = '';

open my $z, '-|', '/usr/bin/gunzip', '-c', 'moby_dick.txt.gz';

while( <$z> ) {
    print;
    }

close $z 
    or die "There was a problem with the pipe open!";

我也可以反过来，通过管道将打印输出到命令，该命令会为我gzip数据。在-翻转到了|的另一边，我使用shell重定向将gzip的结果移动到一个文件。我不使用列表形式，因为我希望命令中的>是特殊的（如果gzip有一个设置输出文件名的开关就好了）

$ENV{PATH} = '';

open my $z, '|-', '/usr/bin/gzip > data.gz';

while(  ) {
    print { $z } $_;
    }

close $z 
    or die "There was a problem with the pipe open!";

这是我可以用任何命令使用的通用形式。它有多个进程和对外部命令的依赖等缺点。如果我可以直接在Perl进程中完成，我就没有这些缺点。幸运的是，我可以做到，因为Perl就是这样。

读取gzip数据

在Perl中读取gzip文件，我可以用gzip I/O层（参见perlopen）。一旦打开文件，我就可以像读取“普通”文本文件一样读取其行（假设它是文本）

use PerlIO::gzip;
open my $fh, '<:gzip', $filename 
    or die "Could not read from $filename: $!";

while( <$fh> ) {
    print;
    }

或者，如果数据不是文本，我可以读取字节

use PerlIO::gzip;
open my $fh, '<:gzip', $filename 
    or die "Could not read from $filename: $!";

while( read( $fh, $buffer, 1024 ) ) {
    ...; # do something with $buffer (... is a v5.12 feature!)
    }

如果我不能使用I/O层，也许是因为操作系统不支持它或者在我的Perl版本上它坏了，我可以使用IO::Compress模块代替。此示例使用其对象接口创建写入文件句柄

use IO::Compress::Gunzip;

my $z = IO::Compress::Gunzip->new( $filename )
    or die "Could not read from $filename: $GunzipError";

while( <$z> ) {
    print;
    }

I/O层比模块快，但PerlIO文档指出我们不应该信任它。人们一直在使用它而没有遇到主要问题，但你可能是那个丢失所有数据的人。Sinan Ünür在Large gzipped files, long lines, extracting columns etc中写了关于性能的内容。

写入gzip数据

我还可以直接将gzip数据写入文件。这与我之前的示例类似，只是文件句柄的位置改变了。这个例子使用I/O层

open my $fh, '>:gzip' $filename 
    or die "Could not write to $filename: $!";

while(  ) {
    print { $fh } $_;
    }

这个例子使用IO::Compress::Gzip

use IO::Compress::Gzip;

my $z = IO::Compress::Gzip->new( $filename )
    or die "Could not write to $filename: $GzipError";

while(  ) {
    print { $z } $_;
    }

一个高级技巧

我可以用单个文件句柄读取多个gzip数据流。在IO::Compress::Gunzip中的MultiStream选项允许解压缩器在认为它检测到新的流时重置自己，并继续提供输出

use IO::Uncompress::Gunzip qw($GunzipError);

my $z = IO::Uncompress::Gunzip->new( *STDIN, MultiStream => 1 )
    or die "Could not make uncompress object: $GunzipError";
    
while( <$z> ) {
    print;
    }