Interpreting quality and run information
When you run the kb command, the pipeline generates several metadata
files that record provenance, tool versions, command parameters, and
basic quality metrics. These files are useful both for validating
that your run executed correctly and for checking the quality of your
sequencing data.
This page explains three important JSON outputs:
run_info.json — generated by kallisto
inspect.json — generated by bustools
kb_info.json — generated by kb-python itself
These files allow users to verify:
whether the pipeline executed the correct workflow and chemistry
whether input FASTQs are intact
whether alignment performed as expected
whether barcodes and UMIs appear valid
the exact commands and tool versions used for reproducibility
run_info.json — Kallisto Summary
run_info.json is generated by kallisto during pseudoalignment. It summarizes how many reads were processed, how many pseudoaligned, and which index and version were used.
Typical contents:
{
"n_targets": 86288,
"n_bootstraps": 0,
"n_processed": 13456789,
"n_pseudoaligned": 12034567,
"p_pseudoaligned": 89.4,
"p_unique": 24.4,
"kallisto_version": "0.51.1",
"index_version": 11,
"index_kmer_length": 31,
"start_time": "...",
"call": "kallisto bus ..."
}
Key fields
Field |
Description |
What to check |
|---|---|---|
|
Number of transcript targets in the index |
Matches expected transcriptome build |
|
Total reads processed |
Roughly equals FASTQ read count |
|
Reads that pseudoaligned to the reference |
Much lower than expected → low quality or wrong index |
|
Percent of reads pseudoaligned |
Good data often >60–70% |
|
Percent of reads mapping uniquely to one target |
< 20% may indicate low library complexity or poor quality reads |
|
Full kallisto invocation |
Confirms correct parameters were used |
Signs of potential problems
p_pseudoaligned< 40% (often wrong index, chemistry mismatch, or poor-quality reads)n_processedfar below expected FASTQ size (truncated or corrupted FASTQs)index_versionincompatible withkallistoversion
inspect.json — bustools inspect summary
inspect.json is produced by bustools inspect and provides aggregate
statistics about the BUS file: how many BUS records and reads are present,
how many distinct barcodes and UMIs were observed, summaries of reads-per-barcode
and UMIs-per-barcode, and how many barcodes/reads match the supplied on-list.
Below is an example snippet and a field-by-field explanation.
Example (abridged) contents:
{
"numRecords": 117354584,
"numReads": 507909041,
"numBarcodes": 3904019,
"medianReadsPerBarcode": 3.000000,
"meanReadsPerBarcode": 130.099019,
"numUMIs": 16529783,
"numBarcodeUMIs": 96095799,
"medianUMIsPerBarcode": 1.000000,
"meanUMIsPerBarcode": 24.614583,
"gtRecords": 28752978,
"numBarcodesOnOnlist": 1138841,
"percentageBarcodesOnOnlist": 29.170990,
"numReadsOnOnlist": 488498272,
"percentageReadsOnOnlist": 96.178298
}
Field definitions and interpretation
Field |
What it reports |
How to interpret / check |
|---|---|---|
|
Total number of BUS records inspected by |
A BUS record typically corresponds to a pseudoaligned record in the BUS file. Use this to sanity-check that the BUS file is not empty. |
|
The total number of raw reads represented by those BUS records. |
Should be close to the total number of input reads (or read pairs) after any technology-specific collapsing. Large discrepancies can indicate truncated or mispaired FASTQs or an incorrect kallisto run. |
|
Number of distinct barcodes observed in the BUS file. |
Compare to expected cell barcode space (e.g., ~million for high-throughput experiments). Very low values may indicate a chemistry mismatch. |
|
The median number of reads assigned to a barcode (across all barcodes). |
Useful to see the typical sequencing depth per barcode. Median is robust to very-high-depth barcode outliers. |
|
The average number of reads per barcode. |
If mean ≫ median, a few barcodes have extremely high read counts; that may be expected (ambient RNA, multiplets) or indicate problems. |
|
Count of distinct UMI sequences observed (unique UMIs aggregated across file). |
Lower-than-expected values can indicate short/low-quality UMIs or wrong chemistry parameters. |
|
Count of distinct barcode+UMI pairs observed (i.e., unique (barcode, UMI) combinations across the file). |
This is typically larger than |
|
Median number of unique UMIs observed per barcode. |
Typical values depend heavily on protocol and sequencing depth; single-cell libraries often have low medians (1–10) for shallow sequencing runs. |
|
Average number of unique UMIs per barcode. |
As with reads, if mean ≫ median, a small set of barcodes carries most UMIs. |
|
Number of BUS records with a valid gene/transcript assignment (i.e., record maps to a transcript/gene in the index) — records that contribute to gene-level signals. |
Use to estimate how many records will be useful for downstream counting. |
|
Number of observed barcodes that are present in the provided on-list (the “whitelist” / official barcode list for the technology). |
This indicates how many barcodes match the expected barcode set. |
|
Fraction of observed barcodes that are on-list (percent). |
For 10x-style experiments a substantial fraction of reads should be on-list, but the fraction of distinct observed barcodes on the on-list can be lower because many sequencing errors create unique off-list barcodes. |
|
Number of reads whose barcode is on the on-list. |
This is often the most informative single metric: high percentage (e.g. > 80–90%) indicates that most reads came from legitimate barcodes. |
|
Fraction of reads whose barcode is on the on-list (percent). |
High values are expected for correctly-specified chemistry and high-quality data. |
Practical checks and recommendations
Sanity-check sizes:
numReadsandnumRecordsshould be large and in the ballpark of what you expect from your input FASTQs and from the kallisto run. If either is very small, check that kallisto succeeded and that FASTQs are intact.on-list checks:
Check
percentageReadsOnOnlistfirst: if a large majority of reads are on the on-list (e.g., >80%), the barcode on-list was likely correct and most reads are assignable to expected barcodes.If
percentageReadsOnOnlistis high butpercentageBarcodesOnOnlistis low, that usually means many low-frequency erroneous barcodes exist (normal).If both read- and barcode-level on-list percentages are low → verify the
--technologyon-list used with kallisto/bustools.
Reads/UMI per barcode:
Compare median vs mean. If mean ≫ median this indicates heavy skew: a few barcodes hold many reads/UMIs (possible multiplets, ambient RNA, or barcode collisions).
Very low medians (e.g., medians near 1) indicate shallow sequencing per barcode — that could be expected for some experimental designs.
UMI / barcodeUMI counts:
numBarcodeUMIs>>numUMIsis expected: the same UMI sequence may occur across many barcodes; what matters for per-cell counting is the per-barcode UMI distribution (e.g.,medianUMIsPerBarcode).
gtRecords:
If
gtRecordsis much smaller thannumRecords(i.e., most records do not map to transcripts/genes), this may indicate an index mismatch or incorrect reference (kallisto index). Confirm that the index matches the species/annotation used for your reads.
Troubleshooting guidance
Low
percentageReadsOnOnlistCheck that you supplied the correct technology/on-list (
-x,-w).Verify that the on-list file provided matches the barcodes present in your experiment (custom chemistry needs a custom on-list).
Very low
numReadsornumRecordsConfirm kallisto completed without errors (look at run_info.json and kallisto logs).
Inspect input FASTQs for truncation or missing pairs.
Extreme skew in mean vs median
A small set of barcodes dominating reads may be multiplets or barcode synthesis artifacts. Consider additional filtering, ambient RNA correction, or multiplet detection in downstream analysis.
kb_info.json — kb-python run provenance and runtime
kb_info.json is produced by kb-python and records run-level provenance, tool versions, the exact commands executed, timing information, and per-step runtimes. It is the authoritative record of how the pipeline was run and is essential for reproducibility and diagnosing pipeline problems.
Example:
{
"workdir": "/home/.../",
"version": "0.29.3",
"kallisto": {
"path": "/.../kallisto",
"version": "0.51.1"
},
"bustools": {
"path": "/.../bustools",
"version": "0.45.0"
},
"start_time": "2025-10-20T18:31:59.761408",
"end_time": "2025-10-20T19:48:38.041715",
"elapsed": 4598.280307,
"call": "/home/.../kb count --overwrite --h5ad -i index.idx -g t2g.txt -x 10XV3 -o ... --workflow=nac -c1 cdna.txt -c2 nascent.txt ...",
"commands": [
"kallisto bus -i index.idx -o ... -x 10XV3 -t 16 ...",
"bustools sort -o ... -T ... -t 16 -m 4G ...",
"bustools inspect -o ... -w 10x_version_onlist.txt ...",
"bustools correct -o ... -w 10x_version3_onlist.txt ...",
"bustools sort -o ... -T ... -t 16 -m 4G ...",
"bustools count -o ... -g t2g.txt -e ... -t ... -s nascent.txt --genecounts --umi-gene ..."
],
"runtimes": [
4194.1722021102905,
105.48708367347717,
35.703389406204224,
44.34646153450012,
31.489474534988403,
163.7225775718689
]
}
Checklist for successful runs
Use the following as a quick verification workflow:
Check kallisto alignment quality using run_info.json:
p_pseudoalignedis within expected rangen_processedmatches FASTQ size
Check barcode/UMI integrity using inspect.json:
Majority of readsare on on-list
Barcode/UMI lengths match the chemistry
Confirm pipeline parameters using kb_info.json:
Correct workflow (
standard/kite/nac/custom)Correct technology (
10xv2/10xv3/ custom)Correct references and t2g file
Commands match expected configuration
Archive all three JSON files for reproducibility.
Troubleshooting Common Problems
Low pseudoalignment rate:
Usually wrong transcriptome index or poor read quality.
An incorrect strandedness setting may cause low mapping rates. By default, many technologies are run in forward strand-specific mapping mode. However, some assays may not have the same strand-specificity. In this case, the default option will not apply. You can try all of
--strand=forward,--strand=unstranded, and--strand=reverseto determine the optimal option.
Barcode structure mismatches: Often caused by incorrect
-xor-wargument.Missing or empty output files: Indicates truncated FASTQs, corrupted BUS file, or interrupted run.
Inconsistent reference versions: Verify that all reference files were generated together using
kb ref.
These output files provide you with almost everything needed to ensure that kb-python ran correctly and that your data exhibit expected structure and quality. You are strongly encouraged to inspect all three files before proceeding to downstream analysis.