sa-check
sa-check
sa-check is a Crystal / hts.cr prototype for checking whether SA:Z tags in a coordinate-sorted BAM can reconstruct the primary + supplementary alignment set observed in a samtools sort -N BAM.
It is designed for large ONT/Dorado-style BAMs. A 90 GB BAM is expected to be handled by streaming the name-sorted BAM and retaining only sampled QNAME-group signatures in memory.
Inputs
You need two BAMs derived from the same alignment set:
reads.name.bam # samtools sort -N BAM, used as QNAME-group truth
reads.coord.bam # coordinate-sorted BAM, with .bai or .csi
The name-sorted BAM is treated as the oracle for each QNAME group. The coordinate BAM is queried by regions extracted from SA tags.
Install
Requirements:
- Crystal
- HTSlib development package available to
pkg-config shards
shards install
shards build --release
This creates:
bin/sa-check
Quick use
Default: reservoir-sample 10,000 supplementary-containing QNAME groups while streaming the entire name-sorted BAM. This avoids diluting the SA audit with primary-only reads.
bin/sa-check \
--name-bam reads.name.bam \
--coord-bam reads.coord.bam \
-@ 8 \
--json sa-check.json \
--tsv sa-check.tsv
Fast smoke test on the first 1,000 QNAME groups:
bin/sa-check \
--name-bam reads.name.bam \
--coord-bam reads.coord.bam \
--first 1000 \
-@ 8 \
-v
Check specific read names:
bin/sa-check \
--name-bam reads.name.bam \
--coord-bam reads.coord.bam \
--qnames read_ids.txt \
--json selected.sa-check.json
What it checks
For each sampled QNAME group:
- Read all records for that QNAME from the name-sorted BAM.
- Define the truth set as mapped canonical records: primary + supplementary.
- Choose a seed record, preferring primary mapped alignment.
- Find the seed in the coordinate-sorted BAM.
- Traverse the SA graph in the coordinate BAM:
- parse
SA:Z:rname,pos,strand,CIGAR,mapQ,NM;... - query the coordinate BAM for each SA region
- keep records with matching QNAME, position, strand and CIGAR
- parse
- Compare reachable canonical records with the truth set.
The reported scope is deliberately:
primary + supplementary only
SA tags are not expected to recover all secondary or unmapped records.
Important options for 90 GB BAMs
Sampling modes
--sample N Reservoir sample N informative QNAME groups; scans name BAM to EOF.
--first N Check first N informative groups only; quick smoke test.
--qnames FILE Check specific QNAMEs; one per line; does not apply the supplementary-only filter.
--max-groups N Stop name-BAM scan after N groups.
--all-groups Include primary-only groups in sampling.
For a serious run on a 90 GB BAM, start with:
--first 1000
then use:
--sample 10000
or larger.
Strictness
By default, record identity is:
QNAME, reference name, 1-based position, strand, CIGAR, flags for supplementary/secondary/unmapped
Optional stricter modes:
--strict-nm Also require NM to match SA tag.
--strict-mapq Also require mapQ to match SA tag.
These are useful for auditing but may create false negatives after downstream filtering or transformation.
Output metrics
Key metrics:
canonical_recall reachable_truth_hits / truth_canonical
canonical_precision reachable_truth_hits / reachable_records
sa_resolution_rate resolved SA entries / total SA entries
groups_full QNAME groups perfectly reconstructed
groups_partial QNAME groups partly reconstructed
groups_failed QNAME groups not reconstructed
verdict usable-for-sa-expansion | partial | unsafe | not-applicable
Why two BAM handles?
The tool streams reads.name.bam and performs coordinate queries against reads.coord.bam. These are separate HTS::Bam handles so coordinate queries do not disturb the streaming iterator.
Limitations
- This is a validation/audit tool, not a replacement for BNI/BRI.
- It checks whether SA-assisted expansion can recover primary+supplementary records, not secondary or unmapped records.
- Coordinate BAM must have a valid BAI/CSI index.
- If the coordinate BAM was filtered after alignment, SA entries can point to records no longer present in the file.
- If
--sampleis used, the result is a sample estimate, not a full proof.
Relation to bni
bni remains the robust path for read-name lookup in samtools sort -N BAM. This tool answers a different question:
Given a coordinate-sorted BAM, are SA tags complete enough to recover the same primary+supplementary record group that the name-sorted BAM shows?
sa-check
- 0
- 0
- 0
- 0
- 1
- about 6 hours ago
- June 13, 2026
MIT License
Sat, 13 Jun 2026 06:41:30 GMT