I'm excited to release what I've been cooking up the past few months at @arcinstitute
BINSEQ is a family of binary file formats for sequencing data built with paired records and parallel processing in mind with big performance gains (2x-40x) over gzip-fastq with similar storage
BINSEQ: A Family of High-Performance Binary Formats for Nucleotide Sequences biorxiv.org/content/10.1101/… #biorxiv_bioinfo
BINSEQ and VBINSEQ are the only family members (for now). They are both built around two-bit encoded nucleotides and each support paired records natively.
We provide rust libraries for IO, C and C++ bindings to BINSEQ, and a CLI tool to easily manipulate them.
Interestingly we also observe the behavior for more complex tasks like sequence alignment. We demonstrate that BINSEQ continues to scale linearly with thread usage with both minimap2 and STAR.
Apr 15, 2025 · 2:35 PM UTC
BINSEQ is very simple and is built for fixed length records (think single-cell sequencing data). Each record or record pair has true random access and the file can be memory mapped and processed in parallel.
github.com/arcinstitute/bins…
VBINSEQ is a flavor of BINSEQ with support for variable length records and quality scores. It is is built around independent record blocks. Each block stores two-bit encoded records and has random access (via an associated index).
github.com/arcinstitute/vbin…
To make working with BINSEQ files easy we’ve also built a command-line tool to convert to and from the format - bqtools.
This tool uses both above libraries and can handle various operations such as encoding, decoding, grepping, subsampling, etc.
github.com/arcinstitute/bqto…
We also show how easy it is to integrate BINSEQ and VBINSEQ into existing tools. Here is a CLI which makes use of the minimap2 rust bindings that accepts FASTQ, BINSEQ, and VBINSEQ inputs.
github.com/arcinstitute/mmr
We’ve also built C and C++ bindings to BINSEQ (VBINSEQ upcoming!) so that you can use the libraries without needing to be a rustacean (yet) 🦀.
github.com/arcinstitute/bins…
This work is the first step in a list of upcoming projects that @arcinstitute has for high-throughput genomics tools so be on the lookout over the next few months for some more exciting bioinformatics tools!
Here is a link to all the tools and a description of the work:
arcinstitute.org/tools/binse…



