I'm excited to release what I've been cooking up the past few months at @arcinstitute BINSEQ is a family of binary file formats for sequencing data built with paired records and parallel processing in mind with big performance gains (2x-40x) over gzip-fastq with similar storage
BINSEQ: A Family of High-Performance Binary Formats for Nucleotide Sequences biorxiv.org/content/10.1101/… #biorxiv_bioinfo
7
41
6
273
BINSEQ and VBINSEQ are the only family members (for now). They are both built around two-bit encoded nucleotides and each support paired records natively. We provide rust libraries for IO, C and C++ bindings to BINSEQ, and a CLI tool to easily manipulate them.
1
5
Where these formats really shine is in their ability to scale. Because they don't require being parsed sequentially they continue being performant as sequential formats (FASTQ) plateau. Simple tasks like kmer-counting show gains as early as 8 threads.
1
1
6
Interestingly we also observe the behavior for more complex tasks like sequence alignment. We demonstrate that BINSEQ continues to scale linearly with thread usage with both minimap2 and STAR.

Apr 15, 2025 · 2:35 PM UTC

1
1
BINSEQ is very simple and is built for fixed length records (think single-cell sequencing data). Each record or record pair has true random access and the file can be memory mapped and processed in parallel. github.com/arcinstitute/bins…
2
3
VBINSEQ is a flavor of BINSEQ with support for variable length records and quality scores. It is is built around independent record blocks. Each block stores two-bit encoded records and has random access (via an associated index). github.com/arcinstitute/vbin…
1
2
To make working with BINSEQ files easy we’ve also built a command-line tool to convert to and from the format - bqtools. This tool uses both above libraries and can handle various operations such as encoding, decoding, grepping, subsampling, etc. github.com/arcinstitute/bqto…
1
2
We also show how easy it is to integrate BINSEQ and VBINSEQ into existing tools. Here is a CLI which makes use of the minimap2 rust bindings that accepts FASTQ, BINSEQ, and VBINSEQ inputs. github.com/arcinstitute/mmr
1
4
This work is the first step in a list of upcoming projects that @arcinstitute has for high-throughput genomics tools so be on the lookout over the next few months for some more exciting bioinformatics tools!
1
6