Trimming Parquet files to a single row

When writing tests, it can be useful to take data files from external systems and use those in your tests. These files can be quite large, so it makes sense to trim them down to a single record before checking them into git. This is simple to do by hand for human readable file formats like csv and json, but not for binary file formats like Parquet. To my knowledge, there are no pre-built CLI’s that do this.

In this post, we’ll install a command line tool and register it to your shell. This way all you have to do is open a terminal and run the following command.

trimparquet file.parquet

This will generate a new file file.parquet_trimmed, containing the same schema and just the first record of the input file.

  • Python3 (which is pre-installed on most systems)

Run this command to download and install trimparquet:

curl https://gitlab.com/snippets/1918012/raw | bash

The output of this command will contain something like this:

Add this to ~/.bashrc or ~/.zshrc:
alias trimparquet="/home/ruurtjan/trimparquet/trim_parquet_env/bin/python /home/ruurtjan/p/trimparquet/trim.py"
Then open a new terminal and use it like this:
trimparquet file.parquet

Copy the line starting with ‘alias’ from the output and add it to your shell rc, which is ~/.bashrc if you use Bash and ~/.zshrc if you use Z shell.

That’s it! When you open a new terminal, you can now trim parquet files like this:

trimparquet file.parquet

If you already have pandas and fastparquet installed in some Python environment, you can of course use that environment to save some disk space. In that case, after installation, remove the trim_parquet_env directory and change the alias to point to that environment.

Data engineer at BigData Republic