sa-check

sa-check

sa-check is a Crystal / hts.cr prototype for checking whether SA:Z tags in a coordinate-sorted BAM can reconstruct the primary + supplementary alignment set observed in a samtools sort -N BAM.

It is designed for large ONT/Dorado-style BAMs. A 90 GB BAM is expected to be handled by streaming the name-sorted BAM and retaining only sampled QNAME-group signatures in memory.

Inputs

You need two BAMs derived from the same alignment set:

reads.name.bam   # samtools sort -N BAM, used as QNAME-group truth
reads.coord.bam  # coordinate-sorted BAM, with .bai or .csi

The name-sorted BAM is treated as the oracle for each QNAME group. The coordinate BAM is queried by regions extracted from SA tags.

Install

Requirements:

  • Crystal
  • HTSlib development package available to pkg-config
  • shards
shards install
shards build --release

This creates:

bin/sa-check

Quick use

Default: reservoir-sample 10,000 supplementary-containing QNAME groups while streaming the entire name-sorted BAM. This avoids diluting the SA audit with primary-only reads.

bin/sa-check \
  --name-bam reads.name.bam \
  --coord-bam reads.coord.bam \
  -@ 8 \
  --json sa-check.json \
  --tsv sa-check.tsv

Fast smoke test on the first 1,000 QNAME groups:

bin/sa-check \
  --name-bam reads.name.bam \
  --coord-bam reads.coord.bam \
  --first 1000 \
  -@ 8 \
  -v

Check specific read names:

bin/sa-check \
  --name-bam reads.name.bam \
  --coord-bam reads.coord.bam \
  --qnames read_ids.txt \
  --json selected.sa-check.json

What it checks

For each sampled QNAME group:

  1. Read all records for that QNAME from the name-sorted BAM.
  2. Define the truth set as mapped canonical records: primary + supplementary.
  3. Choose a seed record, preferring primary mapped alignment.
  4. Find the seed in the coordinate-sorted BAM.
  5. Traverse the SA graph in the coordinate BAM:
    • parse SA:Z:rname,pos,strand,CIGAR,mapQ,NM;...
    • query the coordinate BAM for each SA region
    • keep records with matching QNAME, position, strand and CIGAR
  6. Compare reachable canonical records with the truth set.

The reported scope is deliberately:

primary + supplementary only

SA tags are not expected to recover all secondary or unmapped records.

Important options for 90 GB BAMs

Sampling modes

--sample N       Reservoir sample N informative QNAME groups; scans name BAM to EOF.
--first N        Check first N informative groups only; quick smoke test.
--qnames FILE    Check specific QNAMEs; one per line; does not apply the supplementary-only filter.
--max-groups N   Stop name-BAM scan after N groups.
--all-groups     Include primary-only groups in sampling.

For a serious run on a 90 GB BAM, start with:

--first 1000

then use:

--sample 10000

or larger.

Strictness

By default, record identity is:

QNAME, reference name, 1-based position, strand, CIGAR, flags for supplementary/secondary/unmapped

Optional stricter modes:

--strict-nm      Also require NM to match SA tag.
--strict-mapq    Also require mapQ to match SA tag.

These are useful for auditing but may create false negatives after downstream filtering or transformation.

Output metrics

Key metrics:

canonical_recall       reachable_truth_hits / truth_canonical
canonical_precision    reachable_truth_hits / reachable_records
sa_resolution_rate     resolved SA entries / total SA entries
groups_full            QNAME groups perfectly reconstructed
groups_partial         QNAME groups partly reconstructed
groups_failed          QNAME groups not reconstructed
verdict                usable-for-sa-expansion | partial | unsafe | not-applicable

Why two BAM handles?

The tool streams reads.name.bam and performs coordinate queries against reads.coord.bam. These are separate HTS::Bam handles so coordinate queries do not disturb the streaming iterator.

Limitations

  • This is a validation/audit tool, not a replacement for BNI/BRI.
  • It checks whether SA-assisted expansion can recover primary+supplementary records, not secondary or unmapped records.
  • Coordinate BAM must have a valid BAI/CSI index.
  • If the coordinate BAM was filtered after alignment, SA entries can point to records no longer present in the file.
  • If --sample is used, the result is a sample estimate, not a full proof.

Relation to bni

bni remains the robust path for read-name lookup in samtools sort -N BAM. This tool answers a different question:

Given a coordinate-sorted BAM, are SA tags complete enough to recover the same primary+supplementary record group that the name-sorted BAM shows?

Repository

sa-check

Owner
Statistic
  • 0
  • 0
  • 0
  • 0
  • 1
  • about 6 hours ago
  • June 13, 2026
License

MIT License

Links
Synced at

Sat, 13 Jun 2026 06:41:30 GMT

Languages