Interpreting quality and run information

When you run the kb command, the pipeline generates several metadata files that record provenance, tool versions, command parameters, and basic quality metrics. These files are useful both for validating that your run executed correctly and for checking the quality of your sequencing data.

This page explains three important JSON outputs:

  • run_info.json — generated by kallisto

  • inspect.json — generated by bustools

  • kb_info.json — generated by kb-python itself

These files allow users to verify:

  • whether the pipeline executed the correct workflow and chemistry

  • whether input FASTQs are intact

  • whether alignment performed as expected

  • whether barcodes and UMIs appear valid

  • the exact commands and tool versions used for reproducibility

run_info.json — Kallisto Summary

run_info.json is generated by kallisto during pseudoalignment. It summarizes how many reads were processed, how many pseudoaligned, and which index and version were used.

Typical contents:

{
  "n_targets": 86288,
  "n_bootstraps": 0,
  "n_processed": 13456789,
  "n_pseudoaligned": 12034567,
  "p_pseudoaligned": 89.4,
  "p_unique": 24.4,
  "kallisto_version": "0.51.1",
  "index_version": 11,
  "index_kmer_length": 31,
  "start_time": "...",
  "call": "kallisto bus ..."
}

Key fields

Field

Description

What to check

n_targets

Number of transcript targets in the index

Matches expected transcriptome build

n_processed

Total reads processed

Roughly equals FASTQ read count

n_pseudoaligned

Reads that pseudoaligned to the reference

Much lower than expected → low quality or wrong index

p_pseudoaligned

Percent of reads pseudoaligned

Good data often >60–70%

p_unique

Percent of reads mapping uniquely to one target

< 20% may indicate low library complexity or poor quality reads

call

Full kallisto invocation

Confirms correct parameters were used

Signs of potential problems

  • p_pseudoaligned < 40% (often wrong index, chemistry mismatch, or poor-quality reads)

  • n_processed far below expected FASTQ size (truncated or corrupted FASTQs)

  • index_version incompatible with kallisto version

inspect.json — bustools inspect summary

inspect.json is produced by bustools inspect and provides aggregate statistics about the BUS file: how many BUS records and reads are present, how many distinct barcodes and UMIs were observed, summaries of reads-per-barcode and UMIs-per-barcode, and how many barcodes/reads match the supplied on-list. Below is an example snippet and a field-by-field explanation.

Example (abridged) contents:

{
  "numRecords": 117354584,
  "numReads": 507909041,
  "numBarcodes": 3904019,
  "medianReadsPerBarcode": 3.000000,
  "meanReadsPerBarcode": 130.099019,
  "numUMIs": 16529783,
  "numBarcodeUMIs": 96095799,
  "medianUMIsPerBarcode": 1.000000,
  "meanUMIsPerBarcode": 24.614583,
  "gtRecords": 28752978,
  "numBarcodesOnOnlist": 1138841,
  "percentageBarcodesOnOnlist": 29.170990,
  "numReadsOnOnlist": 488498272,
  "percentageReadsOnOnlist": 96.178298
}

Field definitions and interpretation

Field

What it reports

How to interpret / check

numRecords

Total number of BUS records inspected by bustools inspect.

A BUS record typically corresponds to a pseudoaligned record in the BUS file. Use this to sanity-check that the BUS file is not empty.

numReads

The total number of raw reads represented by those BUS records.

Should be close to the total number of input reads (or read pairs) after any technology-specific collapsing. Large discrepancies can indicate truncated or mispaired FASTQs or an incorrect kallisto run.

numBarcodes

Number of distinct barcodes observed in the BUS file.

Compare to expected cell barcode space (e.g., ~million for high-throughput experiments). Very low values may indicate a chemistry mismatch.

medianReadsPerBarcode

The median number of reads assigned to a barcode (across all barcodes).

Useful to see the typical sequencing depth per barcode. Median is robust to very-high-depth barcode outliers.

meanReadsPerBarcode

The average number of reads per barcode.

If mean ≫ median, a few barcodes have extremely high read counts; that may be expected (ambient RNA, multiplets) or indicate problems.

numUMIs

Count of distinct UMI sequences observed (unique UMIs aggregated across file).

Lower-than-expected values can indicate short/low-quality UMIs or wrong chemistry parameters.

numBarcodeUMIs

Count of distinct barcode+UMI pairs observed (i.e., unique (barcode, UMI) combinations across the file).

This is typically larger than numUMIs because each UMI can appear under multiple barcodes; use both fields to understand complexity.

medianUMIsPerBarcode

Median number of unique UMIs observed per barcode.

Typical values depend heavily on protocol and sequencing depth; single-cell libraries often have low medians (1–10) for shallow sequencing runs.

meanUMIsPerBarcode

Average number of unique UMIs per barcode.

As with reads, if mean ≫ median, a small set of barcodes carries most UMIs.

gtRecords

Number of BUS records with a valid gene/transcript assignment (i.e., record maps to a transcript/gene in the index) — records that contribute to gene-level signals.

Use to estimate how many records will be useful for downstream counting.

numBarcodesOnOnlist

Number of observed barcodes that are present in the provided on-list (the “whitelist” / official barcode list for the technology).

This indicates how many barcodes match the expected barcode set.

percentageBarcodesOnOnlist

Fraction of observed barcodes that are on-list (percent).

For 10x-style experiments a substantial fraction of reads should be on-list, but the fraction of distinct observed barcodes on the on-list can be lower because many sequencing errors create unique off-list barcodes.

numReadsOnOnlist

Number of reads whose barcode is on the on-list.

This is often the most informative single metric: high percentage (e.g. > 80–90%) indicates that most reads came from legitimate barcodes.

percentageReadsOnOnlist

Fraction of reads whose barcode is on the on-list (percent).

High values are expected for correctly-specified chemistry and high-quality data.

Practical checks and recommendations

  • Sanity-check sizes: numReads and numRecords should be large and in the ballpark of what you expect from your input FASTQs and from the kallisto run. If either is very small, check that kallisto succeeded and that FASTQs are intact.

  • on-list checks:

    • Check percentageReadsOnOnlist first: if a large majority of reads are on the on-list (e.g., >80%), the barcode on-list was likely correct and most reads are assignable to expected barcodes.

    • If percentageReadsOnOnlist is high but percentageBarcodesOnOnlist is low, that usually means many low-frequency erroneous barcodes exist (normal).

    • If both read- and barcode-level on-list percentages are low → verify the --technology on-list used with kallisto/bustools.

  • Reads/UMI per barcode:

    • Compare median vs mean. If mean ≫ median this indicates heavy skew: a few barcodes hold many reads/UMIs (possible multiplets, ambient RNA, or barcode collisions).

    • Very low medians (e.g., medians near 1) indicate shallow sequencing per barcode — that could be expected for some experimental designs.

  • UMI / barcodeUMI counts:

    • numBarcodeUMIs >> numUMIs is expected: the same UMI sequence may occur across many barcodes; what matters for per-cell counting is the per-barcode UMI distribution (e.g., medianUMIsPerBarcode).

  • gtRecords:

    • If gtRecords is much smaller than numRecords (i.e., most records do not map to transcripts/genes), this may indicate an index mismatch or incorrect reference (kallisto index). Confirm that the index matches the species/annotation used for your reads.

Troubleshooting guidance

  • Low percentageReadsOnOnlist

    • Check that you supplied the correct technology/on-list (-x, -w).

    • Verify that the on-list file provided matches the barcodes present in your experiment (custom chemistry needs a custom on-list).

  • Very low numReads or numRecords

    • Confirm kallisto completed without errors (look at run_info.json and kallisto logs).

    • Inspect input FASTQs for truncation or missing pairs.

  • Extreme skew in mean vs median

    • A small set of barcodes dominating reads may be multiplets or barcode synthesis artifacts. Consider additional filtering, ambient RNA correction, or multiplet detection in downstream analysis.

kb_info.json — kb-python run provenance and runtime

kb_info.json is produced by kb-python and records run-level provenance, tool versions, the exact commands executed, timing information, and per-step runtimes. It is the authoritative record of how the pipeline was run and is essential for reproducibility and diagnosing pipeline problems.

Example:

{
    "workdir": "/home/.../",
    "version": "0.29.3",
    "kallisto": {
        "path": "/.../kallisto",
        "version": "0.51.1"
    },
    "bustools": {
        "path": "/.../bustools",
        "version": "0.45.0"
    },
    "start_time": "2025-10-20T18:31:59.761408",
    "end_time": "2025-10-20T19:48:38.041715",
    "elapsed": 4598.280307,
    "call": "/home/.../kb count --overwrite --h5ad -i index.idx -g t2g.txt -x 10XV3 -o ... --workflow=nac -c1 cdna.txt -c2 nascent.txt ...",
    "commands": [
        "kallisto bus -i index.idx -o ... -x 10XV3 -t 16 ...",
        "bustools sort -o ... -T ... -t 16 -m 4G ...",
        "bustools inspect -o ... -w 10x_version_onlist.txt ...",
        "bustools correct -o ... -w 10x_version3_onlist.txt ...",
        "bustools sort -o ... -T ... -t 16 -m 4G ...",
        "bustools count -o ... -g t2g.txt -e ... -t ... -s nascent.txt --genecounts --umi-gene ..."
    ],
    "runtimes": [
        4194.1722021102905,
        105.48708367347717,
        35.703389406204224,
        44.34646153450012,
        31.489474534988403,
        163.7225775718689
    ]
}

Checklist for successful runs

Use the following as a quick verification workflow:

  1. Check kallisto alignment quality using run_info.json:

    • p_pseudoaligned is within expected range

    • n_processed matches FASTQ size

  2. Check barcode/UMI integrity using inspect.json:

    • Majority of readsare on on-list

    • Barcode/UMI lengths match the chemistry

  3. Confirm pipeline parameters using kb_info.json:

    • Correct workflow (standard / kite / nac/ custom)

    • Correct technology (10xv2 / 10xv3 / custom)

    • Correct references and t2g file

    • Commands match expected configuration

  4. Archive all three JSON files for reproducibility.

Troubleshooting Common Problems

  • Low pseudoalignment rate:

    • Usually wrong transcriptome index or poor read quality.

    • An incorrect strandedness setting may cause low mapping rates. By default, many technologies are run in forward strand-specific mapping mode. However, some assays may not have the same strand-specificity. In this case, the default option will not apply. You can try all of --strand=forward, --strand=unstranded, and --strand=reverse to determine the optimal option.

  • Barcode structure mismatches: Often caused by incorrect -x or -w argument.

  • Missing or empty output files: Indicates truncated FASTQs, corrupted BUS file, or interrupted run.

  • Inconsistent reference versions: Verify that all reference files were generated together using kb ref.


These output files provide you with almost everything needed to ensure that kb-python ran correctly and that your data exhibit expected structure and quality. You are strongly encouraged to inspect all three files before proceeding to downstream analysis.