What are the compression types supported in parquet

Question

I was writing data on Hadoop and hive in parquet format using spark. I want to enable compression but i can only find 2 types on compression - snappy and Gzip being used most of the times. Does parquet support any other compression like Deflate and lzo also?

Best Answer

In Spark 2.1

From the Spark source code, branch 2.1:

You can set the following Parquet-specific option(s) for writingParquet files:

compression (default is the value specified in spark.sql.parquet.compression.codec): compression codec to use whensaving to file. This can be one of the known case-insensitive shortennames (none, snappy, gzip, and lzo).
This willoverridespark.sql.parquet.compression.codec
...

In Spark 2.4 / 3.0

overall supported compresssions are: none, uncompressed, snappy, gzip, lzo, brotli, lz4, and zstd

Accepted Answer

The supported compression types for Apache Parquet are specified in the parquet-format repository:

/*** Supported compression algorithms.** Codecs added in 2.4 can be read by readers based on 2.4 and later.* Codec support may vary between readers based on the format version and* libraries available at runtime. Gzip, Snappy, and LZ4 codecs are* widely available, while Zstd and Brotli require additional libraries.*/enum CompressionCodec {UNCOMPRESSED = 0;SNAPPY = 1;GZIP = 2;LZO = 3;BROTLI = 4; // Added in 2.4LZ4 = 5; // Added in 2.4ZSTD = 6; // Added in 2.4}

https://github.com/apache/parquet-format/blob/54e6133e887a6ea90501ddd72fff5312b7038a7c/src/main/thrift/parquet.thrift#L461

Snappy and Gzip are the most commonly used ones and are supported by all implementations. LZ4 and ZSTD yield better results the former two but are a rather new addition to the format, so they are only supported in the newer versions of some of the implementations.

What are the compression types supported in parquet

Best Answer

In Spark 2.1

In Spark 2.4 / 3.0

Random Posts