Interpreting quality and run information

When you run the kb command, the pipeline generates several metadata files that record provenance, tool versions, command parameters, and basic quality metrics. These files are useful both for validating that your run executed correctly and for checking the quality of your sequencing data.

This page explains three important JSON outputs:

run_info.json — generated by kallisto
inspect.json — generated by bustools
kb_info.json — generated by kb-python itself

These files allow users to verify:

whether the pipeline executed the correct workflow and chemistry
whether input FASTQs are intact
whether alignment performed as expected
whether barcodes and UMIs appear valid
the exact commands and tool versions used for reproducibility

run_info.json — Kallisto Summary

run_info.json is generated by kallisto during pseudoalignment. It summarizes how many reads were processed, how many pseudoaligned, and which index and version were used.

Typical contents:

{
  "n_targets": 86288,
  "n_bootstraps": 0,
  "n_processed": 13456789,
  "n_pseudoaligned": 12034567,
  "p_pseudoaligned": 89.4,
  "p_unique": 24.4,
  "kallisto_version": "0.51.1",
  "index_version": 11,
  "index_kmer_length": 31,
  "start_time": "...",
  "call": "kallisto bus ..."
}

Key fields

Field	Description	What to check
`n_targets`	Number of transcript targets in the index	Matches expected transcriptome build
`n_processed`	Total reads processed	Roughly equals FASTQ read count
`n_pseudoaligned`	Reads that pseudoaligned to the reference	Much lower than expected → low quality or wrong index
`p_pseudoaligned`	Percent of reads pseudoaligned	Good data often >60–70%
`p_unique`	Percent of reads mapping uniquely to one target	< 20% may indicate low library complexity or poor quality reads
`call`	Full kallisto invocation	Confirms correct parameters were used

Signs of potential problems

p_pseudoaligned < 40% (often wrong index, chemistry mismatch, or poor-quality reads)
n_processed far below expected FASTQ size (truncated or corrupted FASTQs)
index_version incompatible with kallisto version

inspect.json — bustools inspect summary

inspect.json is produced by bustools inspect and provides aggregate statistics about the BUS file: how many BUS records and reads are present, how many distinct barcodes and UMIs were observed, summaries of reads-per-barcode and UMIs-per-barcode, and how many barcodes/reads match the supplied on-list. Below is an example snippet and a field-by-field explanation.

Example (abridged) contents:

{
  "numRecords": 117354584,
  "numReads": 507909041,
  "numBarcodes": 3904019,
  "medianReadsPerBarcode": 3.000000,
  "meanReadsPerBarcode": 130.099019,
  "numUMIs": 16529783,
  "numBarcodeUMIs": 96095799,
  "medianUMIsPerBarcode": 1.000000,
  "meanUMIsPerBarcode": 24.614583,
  "gtRecords": 28752978,
  "numBarcodesOnOnlist": 1138841,
  "percentageBarcodesOnOnlist": 29.170990,
  "numReadsOnOnlist": 488498272,
  "percentageReadsOnOnlist": 96.178298
}

Field definitions and interpretation

Field	What it reports	How to interpret / check
`numRecords`	Total number of BUS records inspected by `bustools inspect`.	A BUS record typically corresponds to a pseudoaligned record in the BUS file. Use this to sanity-check that the BUS file is not empty.
`numReads`	The total number of raw reads represented by those BUS records.	Should be close to the total number of input reads (or read pairs) after any technology-specific collapsing. Large discrepancies can indicate truncated or mispaired FASTQs or an incorrect kallisto run.
`numBarcodes`	Number of distinct barcodes observed in the BUS file.	Compare to expected cell barcode space (e.g., ~million for high-throughput experiments). Very low values may indicate a chemistry mismatch.
`medianReadsPerBarcode`	The median number of reads assigned to a barcode (across all barcodes).	Useful to see the typical sequencing depth per barcode. Median is robust to very-high-depth barcode outliers.
`meanReadsPerBarcode`	The average number of reads per barcode.	If mean ≫ median, a few barcodes have extremely high read counts; that may be expected (ambient RNA, multiplets) or indicate problems.
`numUMIs`	Count of distinct UMI sequences observed (unique UMIs aggregated across file).	Lower-than-expected values can indicate short/low-quality UMIs or wrong chemistry parameters.
`numBarcodeUMIs`	Count of distinct barcode+UMI pairs observed (i.e., unique (barcode, UMI) combinations across the file).	This is typically larger than `numUMIs` because each UMI can appear under multiple barcodes; use both fields to understand complexity.
`medianUMIsPerBarcode`	Median number of unique UMIs observed per barcode.	Typical values depend heavily on protocol and sequencing depth; single-cell libraries often have low medians (1–10) for shallow sequencing runs.
`meanUMIsPerBarcode`	Average number of unique UMIs per barcode.	As with reads, if mean ≫ median, a small set of barcodes carries most UMIs.
`gtRecords`	Number of BUS records with a valid gene/transcript assignment (i.e., record maps to a transcript/gene in the index) — records that contribute to gene-level signals.	Use to estimate how many records will be useful for downstream counting.
`numBarcodesOnOnlist`	Number of observed barcodes that are present in the provided on-list (the “whitelist” / official barcode list for the technology).	This indicates how many barcodes match the expected barcode set.
`percentageBarcodesOnOnlist`	Fraction of observed barcodes that are on-list (percent).	For 10x-style experiments a substantial fraction of reads should be on-list, but the fraction of distinct observed barcodes on the on-list can be lower because many sequencing errors create unique off-list barcodes.
`numReadsOnOnlist`	Number of reads whose barcode is on the on-list.	This is often the most informative single metric: high percentage (e.g. > 80–90%) indicates that most reads came from legitimate barcodes.
`percentageReadsOnOnlist`	Fraction of reads whose barcode is on the on-list (percent).	High values are expected for correctly-specified chemistry and high-quality data.

Practical checks and recommendations

Sanity-check sizes: numReads and numRecords should be large and in the ballpark of what you expect from your input FASTQs and from the kallisto run. If either is very small, check that kallisto succeeded and that FASTQs are intact.
on-list checks:
- Check percentageReadsOnOnlist first: if a large majority of reads are on the on-list (e.g., >80%), the barcode on-list was likely correct and most reads are assignable to expected barcodes.
- If percentageReadsOnOnlist is high but percentageBarcodesOnOnlist is low, that usually means many low-frequency erroneous barcodes exist (normal).
- If both read- and barcode-level on-list percentages are low → verify the --technology on-list used with kallisto/bustools.
Reads/UMI per barcode:
- Compare median vs mean. If mean ≫ median this indicates heavy skew: a few barcodes hold many reads/UMIs (possible multiplets, ambient RNA, or barcode collisions).
- Very low medians (e.g., medians near 1) indicate shallow sequencing per barcode — that could be expected for some experimental designs.
UMI / barcodeUMI counts:
- numBarcodeUMIs >> numUMIs is expected: the same UMI sequence may occur across many barcodes; what matters for per-cell counting is the per-barcode UMI distribution (e.g., medianUMIsPerBarcode).
gtRecords:
- If gtRecords is much smaller than numRecords (i.e., most records do not map to transcripts/genes), this may indicate an index mismatch or incorrect reference (kallisto index). Confirm that the index matches the species/annotation used for your reads.

Troubleshooting guidance

Low percentageReadsOnOnlist
- Check that you supplied the correct technology/on-list (-x, -w).
- Verify that the on-list file provided matches the barcodes present in your experiment (custom chemistry needs a custom on-list).
Very low numReads or numRecords
- Confirm kallisto completed without errors (look at run_info.json and kallisto logs).
- Inspect input FASTQs for truncation or missing pairs.
Extreme skew in mean vs median
- A small set of barcodes dominating reads may be multiplets or barcode synthesis artifacts. Consider additional filtering, ambient RNA correction, or multiplet detection in downstream analysis.

kb_info.json — kb-python run provenance and runtime

kb_info.json is produced by kb-python and records run-level provenance, tool versions, the exact commands executed, timing information, and per-step runtimes. It is the authoritative record of how the pipeline was run and is essential for reproducibility and diagnosing pipeline problems.

Example:

{
    "workdir": "/home/.../",
    "version": "0.29.3",
    "kallisto": {
        "path": "/.../kallisto",
        "version": "0.51.1"
    },
    "bustools": {
        "path": "/.../bustools",
        "version": "0.45.0"
    },
    "start_time": "2025-10-20T18:31:59.761408",
    "end_time": "2025-10-20T19:48:38.041715",
    "elapsed": 4598.280307,
    "call": "/home/.../kb count --overwrite --h5ad -i index.idx -g t2g.txt -x 10XV3 -o ... --workflow=nac -c1 cdna.txt -c2 nascent.txt ...",
    "commands": [
        "kallisto bus -i index.idx -o ... -x 10XV3 -t 16 ...",
        "bustools sort -o ... -T ... -t 16 -m 4G ...",
        "bustools inspect -o ... -w 10x_version_onlist.txt ...",
        "bustools correct -o ... -w 10x_version3_onlist.txt ...",
        "bustools sort -o ... -T ... -t 16 -m 4G ...",
        "bustools count -o ... -g t2g.txt -e ... -t ... -s nascent.txt --genecounts --umi-gene ..."
    ],
    "runtimes": [
        4194.1722021102905,
        105.48708367347717,
        35.703389406204224,
        44.34646153450012,
        31.489474534988403,
        163.7225775718689
    ]
}

Checklist for successful runs

Use the following as a quick verification workflow:

Check kallisto alignment quality using run_info.json:
- p_pseudoaligned is within expected range
- n_processed matches FASTQ size
Check barcode/UMI integrity using inspect.json:
- Majority of readsare on on-list
- Barcode/UMI lengths match the chemistry
Confirm pipeline parameters using kb_info.json:
- Correct workflow (standard / kite / nac/ custom)
- Correct technology (10xv2 / 10xv3 / custom)
- Correct references and t2g file
- Commands match expected configuration
Archive all three JSON files for reproducibility.

Troubleshooting Common Problems

Low pseudoalignment rate:
- Usually wrong transcriptome index or poor read quality.
- An incorrect strandedness setting may cause low mapping rates. By default, many technologies are run in forward strand-specific mapping mode. However, some assays may not have the same strand-specificity. In this case, the default option will not apply. You can try all of --strand=forward, --strand=unstranded, and --strand=reverse to determine the optimal option.
Barcode structure mismatches: Often caused by incorrect -x or -w argument.
Missing or empty output files: Indicates truncated FASTQs, corrupted BUS file, or interrupted run.
Inconsistent reference versions: Verify that all reference files were generated together using kb ref.

These output files provide you with almost everything needed to ensure that kb-python ran correctly and that your data exhibit expected structure and quality. You are strongly encouraged to inspect all three files before proceeding to downstream analysis.