Apache Parquet is a columnar data storage format. It was originally developed by Cloudera and Twitter and was donated to the Apache Software Foundation in 2013. While it began as part of the Hadoop ecosystem, it has since become a core building block of modern data infrastructure.

Today Parquet is heavily used in data warehousing, data lakes, cloud storage, and machine learning pipelines.

One of the key reasons for its popularity is its efficiency: its columnar layout makes data highly compressible and enables fast, selective queries. Often, the data written to Parquet has already been denormalized (that is, flattened to reduce the need for costly joins) which further amplifies the benefits of compression and query efficiency.

I wanted to more deeply understand how Parquet stores and compresses data to achieve these efficiencies, but I struggled to find the right level of detail. It often felt like most explanations stopped at “columnar is more efficient!” or dove straight into very technical documentation. It took some work.

As I explored, I spent a fair amount of time comparing and contrasting two of Parquet’s key methods for reducing file size: Dictionary Encoding and Snappy Compression.

On the surface Dictionary Encoding and Snappy Compression seem similar. Both look for repeated patterns within the data and replace them with references with the intention of reducing file size. However, they operate at different layers of the process.

Dictionary Encoding typically would happen first, reducing structural repetition of data within columns. Then Snappy would be applied to further compress the remaining data at the byte level.

The techniques are complementary and are often used together.

An Example

Dictionary Encoding allows Parquet (and other columnar formats like ORC or Apache Arrow) to replace repeated values with dictionary keys. It’s essentially building a lookup table for values that appear repeatedly in a column.

For example, if you have a table that updates to append the episode name and season each time your kids watch specific Bluey episodes, then dictionary encoding could replace the episode names with a numeric key every time the episode name appears.

This example is probably overkill for home viewing (though my kids really do watch a lot of Bluey) but imagine you’re doing analytics for the show at large. In 2024 in the United States alone people watched 35 billion (with a b) minutes of Bluey. At that scale, compressing title names makes a lot of sense. (Caveat: I don’t know how the Disney team actually does analytics; I used Bluey as my example purely because my children love it. However, this combination of semantic encoding followed by fast byte-level compression is a common pattern in big data pipelines where columnar storage formats aim to balance efficiency with scalability.)

Snappy Compression is lower-level. It finds repeated byte sequences within the data and replaces it with a reference. Snappy “aims for very high speeds and reasonable compression.”

In Parquet, Snappy is typically applied to pages, which are the smallest units of encoded column data. Each column is stored as a column chunk, and that chunk is broken into pages. While each page belongs to a single column chunk, Parquet may compress complex, nested structures that include what appear to be multiple logical fields within a single byte stream. Snappy itself is unaware of these semantics; it just compresses bytes.

In our example, Dictionary Encoding has already replaced The Sign with, say, the number 5. On the day the new 30-minute episode released, Snappy likely encountered many 5’s showing up frequently in the dictionary encoded column. Let’s say 1 million houses all watched the new episode when it dropped. Instead of storing those 5s separately Snappy would store a reference to repeat the number 5 one million times, using a technique called Run Length Encoding. Conceptually that’s Snappy compression, except all of this is happening at the byte level, not the numeric level; none of it is designed to be human readable. Snappy doesn’t care that its underlying data is in reference to The Sign at all; to Snappy everything is just a byte.

So: Dictionary Encoding is semantic. It operates on finding meaningfully similar values in the columns and replacing them with a compact key. Snappy is not semantic. It does not care what the data means, its only objective is compressing sequences of bytes.

I thought this was interesting and wanted to share what I learned. This is part of a larger piece of forthcoming research about data storage and table format specifications.

Alt + E S V

How Apache Parquet Uses Dictionary Encoding and Snappy Compression for Efficient Storage

An Example

No Comments

Leave a Reply Cancel reply

About

Subscribe via Email

Recent Posts

Categories

Archives

Recent Comments