kb-python

Running kb-python usually involves two steps:

  1. Indexing a FASTA file of target sequences via kb ref

  2. Mapping sequencing reads to kallisto index using kb count

kb-python supports several different sequencing technologies. Run kb --list to view technology information.

kb ref

Build a kallisto index and transcript-to-gene mapping.

Usage:

kb ref -i index.idx -g t2g.txt -f1 cdna.fasta [arguments] genome.fasta genome.gtf

Required Arguments:

-i INDEX

Path to the kallisto index to be constructed.

-g T2G

Path to transcript-to-gene mapping to be generated

-f1 FASTA

Path to the cDNA FASTA (standard, nac) or mismatch FASTA (kite) to be generated. Optional with -d, or with --aa when no GTF file(s) are provided. Not used with --workflow=custom.

Positional Arguments (required only if `-d` is not used):

  • fasta: Genomic FASTA file(s), comma-delimited

  • gtf: Reference GTF file(s), comma-delimited. Not required with --aa.

  • feature: Path to TSV containing barcodes and feature names. kite workflow only.

required arguments for `nac` workflow:

-f2 FASTA

Path to the unprocessed transcripts FASTA to be generated

-c1 T2C

Path to generate cDNA transcripts-to-capture

-c2 T2C

Path to generate unprocessed transcripts-to-capture

optional arguments:

-h, --help

Show a help message and exit

--temp TMP

Override default temporary directory

--keep-tmp

Keep temporary files

--verbose

Print debugging information

--include-attribute KEY:VALUE

Only process GTF entries that have the provided KEY:VALUE attribute. May be specified multiple times.

--exclude-attribute KEY:VALUE

Only process GTF entires that do not have the provided KEY:VALUE attribute. May be specified multiple times.

-k K

Use this option to override the k-mer length of the index (max value: 31). Usually, the k-mer length automatically calculated by kb provides the best results (typically k=31, which is also the default).

-t THREADS

Number of threads to use (default: 8)

--d-list FASTA

D-list file(s) (default: the Genomic FASTA file(s) for standard/nac workflow)

--aa

Generate index from a FASTA-file containing amino acid sequences

--workflow

{standard,nac,kite,custom} The type of index to create. Use nac for an index type that can quantify nascent and mature RNA. Use custom for indexing targets directly. Use kite for feature barcoding. (default: standard)

-d NAME

Download a pre-built kallisto index (along with all necessary files) instead of building it locally

--make-unique

Replace repeated target names with unique names

--overwrite

Overwrite existing kallisto index

--kallisto KALLISTO

Path to kallisto binary to use (default: /opt/anaconda3/lib/python3.13/site-packages/kb_python/bins/darwin/m1/kallisto/kallisto)

--bustools BUSTOOLS

Path to bustools binary to use (default: /opt/anaconda3/lib/python3.13/site-packages/kb_python/bins/darwin/m1/bustools/bustools)

--opt-off

Disable performance optimizations

Output:

  • index.idx: A binary kallisto index file (path specified by -i)

  • t2g.txt: A two-column TSV mapping transcripts to genes (path specified by -g)

  • f1: A FASTA file of cDNA sequences (standard, nac) or mismatch sequences (kite) [path specified by -f1]

  • f2: A FASTA file of unprocessed transcript sequences (nac only) (path specified by -f2)

  • c1: A two-column TSV mapping cDNA transcripts to capture sequences (standard, nac) [path specified by -c1]

  • c2: A two-column TSV mapping unprocessed transcripts to capture sequences (nac only) [path specified by -c2]

kb count

Generate count matrices from a set of single-cell FASTQ files

Usage:

kb count -i INDEX -g T2G -x TECHNOLOGY [arguments] [FASTQs]

required arguments:

-i INDEX

Path to kallisto index

-g T2G

Path to transcript-to-gene mapping

-x TECHNOLOGY

Single-cell technology used (kb --list to view). If x=BULK, bulk RNA-seq quantification is performed instead.

positional arguments

  • FASTQs:

    FASTQ files. For paired-end data, list each R2 file immediately after its corresponding R1 files.

    For technology SMARTSEQ, sort all input FASTQs alphabetically by path and paired in order, and assign cell IDs as incrementing integers starting from zero. A single batch TSV with cell ID, read 1, and read 2 as columns can be provided to override this behavior.

    In place of listing FASTQ files, a batch TSV file with three columns (cell ID, read 1, read 2) can be provided to specify multiple samples (for single-end reads the batch file is formatted as (cell ID, read1)).

required arguments for `nac` workflow:

-c1 T2C

Path to mature transcripts-to-capture

-c2 T2C

Path to nascent transcripts-to-capture

optional arguments

-h, --help

Show a help message and exit

--tmp TMP

Override default temporary directory

--keep-tmp

Do not delete the tmp directory

--verbose

Print debugging information

-o OUT

Path to output directory (default: current directory)

--num

Store read numbers in BUS file

--parity {single, paired}

If both paired-end reads contain biological sequence, specify paired. Otherwise, specify single. (default: see kb --list)

-w ONLIST

Path to file of on-listed barcodes to correct to. If not provided and bustools supports the technology, a pre-packaged on-list is used. Otherwise, the bustools allowlist command is used. Specify NONE to bypass barcode error correction. (kb --list to view on-lists)

--exact-barcodes

Only exact matches are used for matching barcodes to on-list.

-r REPLACEMENT

Path to file of a replacement list to correct to. In the file, the first column is the original barcode and second is the replacement sequence.

-t THREADS

Number of threads to use (default: 8)

--strand {unstranded,forward,reverse}

Strandedness (default: see kb --list)

-m MEMORY

Maximum memory used (default: 2G for standard, 4G for others)

--inleaved

Specifies that input is an interleaved FASTQ file

--aa

Map to index generated from FASTA-file containing amino acid sequences

--workflow {standard,nac,kite,kite:10xFB}

Type of workflow. Use nac to specify a nac index for producing mature/nascent/ambiguous matrices. Use kite for feature barcoding. Use kite:10xFB for 10x Genomics Feature Barcoding technology. (default: standard)

--mm

Include reads that pseudoalign to multiple genes. Automatically enabled when generating a TCC matrix.

--h5ad

Generate h5ad file from count matrix

--tcc

Generate a TCC matrix instead of a gene count matrix.

--filter {bustools}

Produce a filtered gene count matrix (default: bustools)

--filter-threshold THRESH

Barcode filter threshold (default: auto)

--overwrite

Overwrite existing output.bus file

--dry-run

Do a dry run (no kallisto or bustools commands executed)

--batch-barcodes

When a batch file is supplied, store sample identifiers in barcodes

--loom

Generate loom file from count matrix

--loom-names col_attrs/{name},row_attrs/{name}

Names for col_attrs and row_attrs in loom file (default: barcode, target_name). Use --loom-names=velocyto for velocyto-compatible loom files

--sum TYPE

Produced summed count matrices (Options: none, cell, nucleus, total). Use cell to add ambiguous and processed transcript matrices. Use nucleus to add ambiguous and unprocessed transcript matrices. Use total to add all three matrices together. (Default: none)

--cellranger

Convert count matrices to cellranger-compatible format

--union

Take the union of all k-mer alignments (default: intersection)

--gene-names

Group counts by gene names instead of gene IDs when generating the loom or h5ad file

-N NUMREADS

Maximum number of reads to process from supplied input

--report

Generate a HTML report containing run statistics and basic plots. Using this option may cause kb to use more memory than specified with the -m option. It may also cause it to crash due to memory.

--long

Use lr-kallisto for long-read mapping

--threshold THRESH

Set threshold for lr-kallisto read mapping (default: 0.8)

--platform {PacBio, ONT}

Set platform for lr-kallisto (default: ONT)

--kallisto KALLISTO

Path to kallisto binary to use (default: /opt/anaconda3/lib/python3.13/site-packages/kb_python/bins/darwin/m1/kallisto/kallisto)

--bustools BUSTOOLS

Path to bustools binary to use (default: /opt/anaconda3/lib/python3.13/site-packages/kb_python/bins/darwin/m1/bustools/bustools)

--opt-off

Disable performance optimizations

optional arguments for BULK and SMARTSEQ2 technologies: --fragment-l L Mean length of fragments. (single-end only)

--fragment-s S

Standard deviation of fragment lengths. (single-end only)

--bootstraps B

Number of bootstraps to perform

--matrix-to-files

Reorganize matrix output into abundance tsv files

--matrix-to-directories

Reorganize matrix output into abundance tsv files across multiple directories

Output:

In the output directory specified by -o, the following files are made:

  • kb_info.json : A JSON file containing information about the kb run

  • kallisto bus Output Files: output.unfiltered.bus, transcripts.txt, matrix.ec, run_info.json

  • bustools Output Files: output.bus, inspect.json

  • counts_unfiltered folder with:
    • Single-Cell Count Matrices: in Market Matrix format cell_x_genes.mtx or cells_x_tcc.mtx if --ttc. For --workflow nac, cell_x_genes.mature.mtx, cell_x_genes.nascent.mtx, cell_x_genes.ambiguous.mtx

    • Barcode and Gene/Transcript ID Files: cell_x_genes.barcodes.txt, cell_x_genes.genes.txt, cell_x_genes.genes.names.txt (or cells_x_tcc.barcodes.txt, cells_x_tcc.ec.txt if --ttc)

    • If --h5ad is specified, an h5ad file (adata.h5ad) is created from the count matrix. If --loom is specified, a loom file (adata.loom) is created from the count matrix. The resulting anndata object will contain:

      • the full count matrix in adata.X. If --workflow nac, adata.X will contain the sum of the mature, nascent, and ambiguous count matrices. If --TCC, adata.X will contain the TCC matrix.

      • cell barcodes in adata.obs. If batch-barcodes, artificial sample barcodes will be appended to the beginning of each cell barcode.

      • gene/transcript IDs as the adata.var index. If --gene-names is specified, gene names will be used in adata.var instead of gene IDs. If -TCC, equivalence classes (composed of semi-colon delimited transcript IDs) will be used instead.

      • If --workflow nac is used, the mature, nascent, and ambiguous count matrices will be stored in adata.layers as mature, nascent, and ambiguous, respectively.

    • If --batch-barcodes is specified, a file with a 16 bp pseudobarcode for each cell (cell_x_genes.barcodes.prefix.txt) is created.

  • If --batch-barcodes is specified, two additional files are created:
    • `matrix.cells`: A file listing the sample IDs for each cell in the count matrix

    • `matrix.sample.barcodes`: A file listing the 16 bp pseudobarcodes for each sample

  • If -r is specified, the corrected count matrix, barcodes, and genes files are placed in the counts_unfiltered_modified folder.

  • If --report is specified, an HTML report (report.html) and Jupyter notebook (report.ipynb) are created containing run statistics and basic plots.

  • If --matrix-to-files, transcript- and gene-level abundance files in TSV and H5 format are created in the quant_unfiltered folder.

  • If --matrix-to-directories, transcript- and gene-level abundance files in TSV and H5 format are created in the quant_unfiltered folder. The directories inside quant_unfiltered have the form abundance_1, abundance_2, etc. corresponding to the samples in the order they were provided.

kb extract

Extract reads that were pseudoaligned to specific genes/transcripts (or extract all reads that were / were not pseudoaligned)

Usage:

kb extract [arguments] FASTQ

required arguments:

-i INDEX

Path to kallisto index

-ts, --targets TARGETS [TARGETS ...]

Gene or transcript names for which to extract the raw reads that align to the index

positional arguments

  • FASTQ: Single FASTQ file containing the sequencing reads (e.g. in case of 10x data, provide the R2 file). Sequencing technology will be treated as bulk here since barcode and UMI tracking is not necessary to extract reads.

optional arguments

-h, --help

Show a help message and exit

--tmp TMP

Override default temporary directory

--keep-tmp

Do not delete the tmp directory

--verbose

Print debugging information

-ttype, --target_type={gene, transcript}

Defines whether targets are gene or transcript names. (default: gene)

--extract_all

Extracts all reads that pseudo-aligned to any gene or transcript (as defined by target_type) (breaks down output by gene/transcript). Using extract_all might take a long time to run when there are a large number of genes/transcripts in the index.

--extract_all_fast

Extracts all reads that pseudo-aligned (does not break down output by gene/transcript; output saved in the all folder).

--extract_all_unmapped

Extracts all unmapped reads (output saved in the all_unmapped folder).

--mm

Also extract reads that multi-mapped to more than one gene.

-g T2G

Path to transcript-to-gene mapping file (required when mm is false, --target_type=gene (and extract_all_fast and extract_all_unmapped is false), OR extract_all is true).

-o OUT

Path to output directory (default: current directory)

-t THREADS

Number of threads to use (default: 8)

-s, --strand {unstranded,forward,reverse}

Strandedness (default: unstranded)

--aa

Map to index generated from FASTA-file containing amino acid sequences

-N NUMREADS

Maximum number of reads to process from supplied FASTQ

--kallisto KALLISTO

Path to kallisto binary to use (default: /opt/anaconda3/lib/python3.13/site-packages/kb_python/bins/darwin/m1/kallisto/kallisto)

--bustools BUSTOOLS

Path to bustools binary to use (default: /opt/anaconda3/lib/python3.13/site-packages/kb_python/bins/darwin/m1/bustools/bustools)

--opt-off

Disable performance optimizations

Output: In the output directory specified by -o, FASTQ files are created for each target specified with -ts containing the reads that pseudoaligned to that target. If --extract_all or --extract_all_fast is specified, FASTQ files are created containing all reads that pseudoaligned to any target in the index. If --extract_all_unmapped is specified, FASTQ files are created containing all unmapped reads.

kb info

Display package and citation information

Usage:

kb info

kb compile

Compile kallisto and bustools binaries from source

Usage:

kb compile [arguments] [target]

positional arguments:

  • target: Which binaries to compile. May be one of kallisto, bustools or all.

optional arguments:

--tmp TMP

Override default temporary directory

--keep-tmp

Do not delete the tmp directory

--verbose

Print debugging information

--view

See information about the current binaries, which are what will be used for kb ref and kb count.

--remove

Remove the existing compiled binaries. Binaries that are provided with kb are never removed.

--overwrite

Overwrite the existing compiled binaries, if they exist.

-o OUT

Save the compiled binaries to a different directory. Note that if this option is specified, the binaries will have to be manually specified with --kallisto or --bustools when running kb ref or kb count.

--url URL

Use a custom URL to a ZIP or tarball file containing the source code of the specified binary. May only be used with a single target.

--ref REF

Repository commmit hash or tag to fetch the source code from. May only be used with a single target.

--cmake-arguments URL

Additional arguments to pass to the cmake command. For example, to pass additional include directories, --cmake-arguments="-DCMAKE_CXX_FLAGS='-I /usr/include'"