When writing tests, it can be useful to take data files from external systems and use those in your tests. These files can be quite large, so it makes sense to trim them down to a single record before checking them into git. This is simple to do by hand for human readable file formats like csv and json, but not for binary file formats like Parquet. To my knowledge, there are no pre-built CLI’s that do this.
In this post, we’ll install a command line tool and register it to your shell. This way all you have to do is open a terminal and run the following command.
This will generate a new file
file.parquet_trimmed, containing the same schema and just the first record of the input file.
- Python3 (which is pre-installed on most systems)
Run this command to download and install
curl https://gitlab.com/snippets/1918012/raw | bash
The output of this command will contain something like this:
Add this to ~/.bashrc or ~/.zshrc:
alias trimparquet="/home/ruurtjan/trimparquet/trim_parquet_env/bin/python /home/ruurtjan/p/trimparquet/trim.py"
Then open a new terminal and use it like this:
Copy the line starting with ‘alias’ from the output and add it to your shell rc, which is
~/.bashrc if you use Bash and
~/.zshrc if you use Z shell.
That’s it! When you open a new terminal, you can now trim parquet files like this:
If you already have pandas and fastparquet installed in some Python environment, you can of course use that environment to save some disk space. In that case, after installation, remove the
trim_parquet_env directory and change the alias to point to that environment.