blini

module

v1.0.0 Latest Latest Go to latest Published: Dec 15, 2025 License: MIT

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/fluhus/blini

Links

Open Source Insights

README ¶

Blini

Lightweight nucleotide sequence searching and dereplication.

Requirements

None. Download and get started!

Usage (basic)

Searching

With both -q and -r set, Blini looks up the query entries in the reference entries. The reference may either be a fasta or a pre-sketched index.

blini -q query.fasta -r reference.fasta -o output.csv
# Or
blini -q query.fasta -r reference.blini -o output.csv

Sketching

With only -r set, Blini pre-sketches the given reference for use in search operations. This makes lookup operations quicker.

blini -r reference.fasta -o reference.blini

Clustering

With only -q set, Blini dereplicates (clusters) the query set.

blini -q input.fasta -o output_prefix

The outputs are a fasta file with the representatives, and a JSON file with the cluster assignments.

Other options

-h display help on the available flags.
-c for searching, calculate containment of query in the reference rather than full match.
-m for searching and clustering, minimal similarity for a match.
-s scale; use 1/s of kmers for similarity.
-u for search, include unmatched queries in the output.

Usage (advanced)

Choosing the scale value (`-s`)

Scale should be at most 1/25 the length of the sequnces analyzed.

Scale is the k-mer subsampling ratio. A scale of 100 means that 1/100 of the k-mers are used for distance calculations. Doubling the scale halves RAM and CPU usage, but also loses some accuracy. For accurate results, sketches of size 25 and above are needed. This means that the scale needs to be up to sequence length / 25.

The default scale of 100 is effective for sequences of length 2500 and above. For sequences of length 1000, for example, the scale needs to be at most 40.

Parallelizing reference sketching

Sketch files (.blini) can be concatenated if they were created using the same scale. This is equivalent to having the different original datasets sketched together in one run. Therefore, big reference datasets can be broken down and sketched in parallel.

Limitations

Blini supports nucleotide sequences only. Amino-acids are currently not supported.
Blini runs on a single file with sequences, where each sequence is a separate species. Support for multiple files and multiple sequences per species will be added in the future.
No multi-threading at the moment. Still fast, innit?

Testing

The testdata directory contains mock data for testing Blini. The tests run automatically with each push to this repository, but you can also run them locally. See the directory's README for further instructions.

Join the conversation

Got a question? Feedback? Found a bug?

Feel free to open an issue or start a discussion.

Directories ¶

Path	Synopsis
blini
paper
gentestdata command Picks out random genomes from an input reference.	Picks out random genomes from an input reference.
gentestdatabig command Generates the big test dataset from the bacterial reference.	Generates the big test dataset from the bacterial reference.
simul Package simul provides sequence simulation functions.	Package simul provides sequence simulation functions.
testclust command Tests clustering results of blini and mmseqs.	Tests clustering results of blini and mmseqs.
testsearch command Tests search results of blini and sourmash.	Tests search results of blini and sourmash.
sketching Package sketching provides sequence sketching and distance calculation functionality.	Package sketching provides sequence sketching and distance calculation functionality.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL