I was writing data on Hadoop and hive in parquet format using spark. I want to enable compression but i can only find 2 types on compression - snappy and Gzip being used most of the times. Does parquet support any other compression like Deflate and lzo also?

2

Best Answer


The supported compression types for Apache Parquet are specified in the parquet-format repository:

/*** Supported compression algorithms.** Codecs added in 2.4 can be read by readers based on 2.4 and later.* Codec support may vary between readers based on the format version and* libraries available at runtime. Gzip, Snappy, and LZ4 codecs are* widely available, while Zstd and Brotli require additional libraries.*/enum CompressionCodec {UNCOMPRESSED = 0;SNAPPY = 1;GZIP = 2;LZO = 3;BROTLI = 4; // Added in 2.4LZ4 = 5; // Added in 2.4ZSTD = 6; // Added in 2.4}

https://github.com/apache/parquet-format/blob/54e6133e887a6ea90501ddd72fff5312b7038a7c/src/main/thrift/parquet.thrift#L461

Snappy and Gzip are the most commonly used ones and are supported by all implementations. LZ4 and ZSTD yield better results the former two but are a rather new addition to the format, so they are only supported in the newer versions of some of the implementations.

In Spark 2.1

From the Spark source code, branch 2.1:

You can set the following Parquet-specific option(s) for writingParquet files:

compression (default is the value specified in spark.sql.parquet.compression.codec): compression codec to use whensaving to file. This can be one of the known case-insensitive shortennames (none, snappy, gzip, and lzo).
This willoverridespark.sql.parquet.compression.codec
...

In Spark 2.4 / 3.0

overall supported compresssions are: none, uncompressed, snappy, gzip, lzo, brotli, lz4, and zstd