kb-python
Running kb-python usually involves two steps:
Indexing a FASTA file of target sequences via
kb refMapping sequencing reads to kallisto index using
kb count
kb-python supports several different sequencing technologies. Run kb --list to view technology information.
kb ref
Build a kallisto index and transcript-to-gene mapping.
Usage:
kb ref -i index.idx -g t2g.txt -f1 cdna.fasta [arguments] genome.fasta genome.gtf
Required Arguments:
- -i INDEX
Path to the kallisto index to be constructed.
- -g T2G
Path to transcript-to-gene mapping to be generated
- -f1 FASTA
Path to the cDNA FASTA (standard, nac) or mismatch FASTA (kite) to be generated. Optional with
-d, or with--aawhen no GTF file(s) are provided. Not used with--workflow=custom.
Positional Arguments (required only if `-d` is not used):
fasta: Genomic FASTA file(s), comma-delimited
gtf: Reference GTF file(s), comma-delimited. Not required with
--aa.feature: Path to TSV containing barcodes and feature names. kite workflow only.
required arguments for `nac` workflow:
- -f2 FASTA
Path to the unprocessed transcripts FASTA to be generated
- -c1 T2C
Path to generate cDNA transcripts-to-capture
- -c2 T2C
Path to generate unprocessed transcripts-to-capture
optional arguments:
- -h, --help
Show a help message and exit
- --temp TMP
Override default temporary directory
- --keep-tmp
Keep temporary files
- --verbose
Print debugging information
- --include-attribute KEY:VALUE
Only process GTF entries that have the provided KEY:VALUE attribute. May be specified multiple times.
- --exclude-attribute KEY:VALUE
Only process GTF entires that do not have the provided KEY:VALUE attribute. May be specified multiple times.
- -k K
Use this option to override the k-mer length of the index (max value: 31). Usually, the k-mer length automatically calculated by kb provides the best results (typically k=31, which is also the default).
- -t THREADS
Number of threads to use (default: 8)
- --d-list FASTA
D-list file(s) (default: the Genomic FASTA file(s) for standard/nac workflow)
- --aa
Generate index from a FASTA-file containing amino acid sequences
- --workflow
{standard,nac,kite,custom} The type of index to create. Use nac for an index type that can quantify nascent and mature RNA. Use custom for indexing targets directly. Use kite for feature barcoding. (default: standard)
- -d NAME
Download a pre-built kallisto index (along with all necessary files) instead of building it locally
- --make-unique
Replace repeated target names with unique names
- --overwrite
Overwrite existing kallisto index
- --kallisto KALLISTO
Path to kallisto binary to use (default: /opt/anaconda3/lib/python3.13/site-packages/kb_python/bins/darwin/m1/kallisto/kallisto)
- --bustools BUSTOOLS
Path to bustools binary to use (default: /opt/anaconda3/lib/python3.13/site-packages/kb_python/bins/darwin/m1/bustools/bustools)
- --opt-off
Disable performance optimizations
Output:
index.idx: A binary kallisto index file (path specified by
-i)t2g.txt: A two-column TSV mapping transcripts to genes (path specified by
-g)f1: A FASTA file of cDNA sequences (standard, nac) or mismatch sequences (kite) [path specified by
-f1]f2: A FASTA file of unprocessed transcript sequences (nac only) (path specified by
-f2)c1: A two-column TSV mapping cDNA transcripts to capture sequences (standard, nac) [path specified by
-c1]c2: A two-column TSV mapping unprocessed transcripts to capture sequences (nac only) [path specified by
-c2]
kb count
Generate count matrices from a set of single-cell FASTQ files
Usage:
kb count -i INDEX -g T2G -x TECHNOLOGY [arguments] [FASTQs]
required arguments:
- -i INDEX
Path to kallisto index
- -g T2G
Path to transcript-to-gene mapping
- -x TECHNOLOGY
Single-cell technology used (
kb --listto view). Ifx=BULK, bulk RNA-seq quantification is performed instead.
positional arguments
- FASTQs:
FASTQ files. For paired-end data, list each R2 file immediately after its corresponding R1 files.
For technology SMARTSEQ, sort all input FASTQs alphabetically by path and paired in order, and assign cell IDs as incrementing integers starting from zero. A single batch TSV with cell ID, read 1, and read 2 as columns can be provided to override this behavior.
In place of listing FASTQ files, a batch TSV file with three columns (cell ID, read 1, read 2) can be provided to specify multiple samples (for single-end reads the batch file is formatted as (cell ID, read1)).
required arguments for `nac` workflow:
- -c1 T2C
Path to mature transcripts-to-capture
- -c2 T2C
Path to nascent transcripts-to-capture
optional arguments
- -h, --help
Show a help message and exit
- --tmp TMP
Override default temporary directory
- --keep-tmp
Do not delete the tmp directory
- --verbose
Print debugging information
- -o OUT
Path to output directory (default: current directory)
- --num
Store read numbers in BUS file
- --parity {single, paired}
If both paired-end reads contain biological sequence, specify paired. Otherwise, specify single. (default: see
kb --list)
- -w ONLIST
Path to file of on-listed barcodes to correct to. If not provided and bustools supports the technology, a pre-packaged on-list is used. Otherwise, the
bustools allowlistcommand is used. Specify NONE to bypass barcode error correction. (kb --listto view on-lists)- --exact-barcodes
Only exact matches are used for matching barcodes to on-list.
- -r REPLACEMENT
Path to file of a replacement list to correct to. In the file, the first column is the original barcode and second is the replacement sequence.
- -t THREADS
Number of threads to use (default: 8)
- --strand {unstranded,forward,reverse}
Strandedness (default: see
kb --list)
- -m MEMORY
Maximum memory used (default: 2G for standard, 4G for others)
- --inleaved
Specifies that input is an interleaved FASTQ file
- --aa
Map to index generated from FASTA-file containing amino acid sequences
- --workflow {standard,nac,kite,kite:10xFB}
Type of workflow. Use nac to specify a nac index for producing mature/nascent/ambiguous matrices. Use kite for feature barcoding. Use kite:10xFB for 10x Genomics Feature Barcoding technology. (default: standard)
- --mm
Include reads that pseudoalign to multiple genes. Automatically enabled when generating a TCC matrix.
- --h5ad
Generate h5ad file from count matrix
- --tcc
Generate a TCC matrix instead of a gene count matrix.
- --filter {bustools}
Produce a filtered gene count matrix (default: bustools)
- --filter-threshold THRESH
Barcode filter threshold (default: auto)
- --overwrite
Overwrite existing output.bus file
- --dry-run
Do a dry run (no kallisto or bustools commands executed)
- --batch-barcodes
When a batch file is supplied, store sample identifiers in barcodes
- --loom
Generate loom file from count matrix
- --loom-names col_attrs/{name},row_attrs/{name}
Names for col_attrs and row_attrs in loom file (default: barcode, target_name). Use
--loom-names=velocytofor velocyto-compatible loom files
- --sum TYPE
Produced summed count matrices (Options: none, cell, nucleus, total). Use cell to add ambiguous and processed transcript matrices. Use nucleus to add ambiguous and unprocessed transcript matrices. Use total to add all three matrices together. (Default: none)
- --cellranger
Convert count matrices to cellranger-compatible format
- --union
Take the union of all k-mer alignments (default: intersection)
- --gene-names
Group counts by gene names instead of gene IDs when generating the loom or h5ad file
- -N NUMREADS
Maximum number of reads to process from supplied input
- --report
Generate a HTML report containing run statistics and basic plots. Using this option may cause kb to use more memory than specified with the
-moption. It may also cause it to crash due to memory.- --long
Use lr-kallisto for long-read mapping
- --threshold THRESH
Set threshold for lr-kallisto read mapping (default: 0.8)
- --platform {PacBio, ONT}
Set platform for lr-kallisto (default: ONT)
- --kallisto KALLISTO
Path to kallisto binary to use (default: /opt/anaconda3/lib/python3.13/site-packages/kb_python/bins/darwin/m1/kallisto/kallisto)
- --bustools BUSTOOLS
Path to bustools binary to use (default: /opt/anaconda3/lib/python3.13/site-packages/kb_python/bins/darwin/m1/bustools/bustools)
- --opt-off
Disable performance optimizations
optional arguments for BULK and SMARTSEQ2 technologies: --fragment-l L Mean length of fragments. (single-end only)
- --fragment-s S
Standard deviation of fragment lengths. (single-end only)
- --bootstraps B
Number of bootstraps to perform
- --matrix-to-files
Reorganize matrix output into abundance tsv files
- --matrix-to-directories
Reorganize matrix output into abundance tsv files across multiple directories
Output:
In the output directory specified by -o, the following files are made:
kb_info.json : A JSON file containing information about the kb run
kallisto busOutput Files: output.unfiltered.bus, transcripts.txt, matrix.ec, run_info.jsonbustoolsOutput Files: output.bus, inspect.json- counts_unfiltered folder with:
Single-Cell Count Matrices: in Market Matrix format cell_x_genes.mtx or cells_x_tcc.mtx if
--ttc. For--workflow nac, cell_x_genes.mature.mtx, cell_x_genes.nascent.mtx, cell_x_genes.ambiguous.mtxBarcode and Gene/Transcript ID Files: cell_x_genes.barcodes.txt, cell_x_genes.genes.txt, cell_x_genes.genes.names.txt (or cells_x_tcc.barcodes.txt, cells_x_tcc.ec.txt if
--ttc)If
--h5adis specified, an h5ad file (adata.h5ad) is created from the count matrix. If--loomis specified, a loom file (adata.loom) is created from the count matrix. The resulting anndata object will contain:the full count matrix in adata.X. If
--workflow nac, adata.X will contain the sum of the mature, nascent, and ambiguous count matrices. If--TCC, adata.X will contain the TCC matrix.cell barcodes in adata.obs. If
batch-barcodes, artificial sample barcodes will be appended to the beginning of each cell barcode.gene/transcript IDs as the adata.var index. If --gene-names is specified, gene names will be used in adata.var instead of gene IDs. If
-TCC, equivalence classes (composed of semi-colon delimited transcript IDs) will be used instead.If
--workflow nacis used, the mature, nascent, and ambiguous count matrices will be stored in adata.layers as mature, nascent, and ambiguous, respectively.
If
--batch-barcodesis specified, a file with a 16 bp pseudobarcode for each cell (cell_x_genes.barcodes.prefix.txt) is created.
- If
--batch-barcodesis specified, two additional files are created: `matrix.cells`: A file listing the sample IDs for each cell in the count matrix
`matrix.sample.barcodes`: A file listing the 16 bp pseudobarcodes for each sample
- If
If
-ris specified, the corrected count matrix, barcodes, and genes files are placed in the counts_unfiltered_modified folder.If
--reportis specified, an HTML report (report.html) and Jupyter notebook (report.ipynb) are created containing run statistics and basic plots.If
--matrix-to-files, transcript- and gene-level abundance files in TSV and H5 format are created in the quant_unfiltered folder.If
--matrix-to-directories, transcript- and gene-level abundance files in TSV and H5 format are created in the quant_unfiltered folder. The directories inside quant_unfiltered have the form abundance_1, abundance_2, etc. corresponding to the samples in the order they were provided.
kb extract
Extract reads that were pseudoaligned to specific genes/transcripts (or extract all reads that were / were not pseudoaligned)
Usage:
kb extract [arguments] FASTQ
required arguments:
- -i INDEX
Path to kallisto index
- -ts, --targets TARGETS [TARGETS ...]
Gene or transcript names for which to extract the raw reads that align to the index
positional arguments
FASTQ: Single FASTQ file containing the sequencing reads (e.g. in case of 10x data, provide the R2 file). Sequencing technology will be treated as bulk here since barcode and UMI tracking is not necessary to extract reads.
optional arguments
- -h, --help
Show a help message and exit
- --tmp TMP
Override default temporary directory
- --keep-tmp
Do not delete the tmp directory
- --verbose
Print debugging information
- -ttype, --target_type={gene, transcript}
Defines whether targets are gene or transcript names. (default: gene)
- --extract_all
Extracts all reads that pseudo-aligned to any gene or transcript (as defined by target_type) (breaks down output by gene/transcript). Using extract_all might take a long time to run when there are a large number of genes/transcripts in the index.
- --extract_all_fast
Extracts all reads that pseudo-aligned (does not break down output by gene/transcript; output saved in the all folder).
- --extract_all_unmapped
Extracts all unmapped reads (output saved in the all_unmapped folder).
- --mm
Also extract reads that multi-mapped to more than one gene.
- -g T2G
Path to transcript-to-gene mapping file (required when
mmis false,--target_type=gene(andextract_all_fastandextract_all_unmappedis false), ORextract_allis true).- -o OUT
Path to output directory (default: current directory)
- -t THREADS
Number of threads to use (default: 8)
- -s, --strand {unstranded,forward,reverse}
Strandedness (default: unstranded)
- --aa
Map to index generated from FASTA-file containing amino acid sequences
- -N NUMREADS
Maximum number of reads to process from supplied FASTQ
- --kallisto KALLISTO
Path to kallisto binary to use (default: /opt/anaconda3/lib/python3.13/site-packages/kb_python/bins/darwin/m1/kallisto/kallisto)
- --bustools BUSTOOLS
Path to bustools binary to use (default: /opt/anaconda3/lib/python3.13/site-packages/kb_python/bins/darwin/m1/bustools/bustools)
- --opt-off
Disable performance optimizations
Output:
In the output directory specified by -o, FASTQ files are created for each target specified with -ts containing the reads that pseudoaligned to that target. If --extract_all or --extract_all_fast is specified, FASTQ files are created containing all reads that pseudoaligned to any target in the index. If --extract_all_unmapped is specified, FASTQ files are created containing all unmapped reads.
kb info
Display package and citation information
Usage:
kb info
kb compile
Compile kallisto and bustools binaries from source
Usage:
kb compile [arguments] [target]
positional arguments:
target: Which binaries to compile. May be one of kallisto, bustools or all.
optional arguments:
- --tmp TMP
Override default temporary directory
- --keep-tmp
Do not delete the tmp directory
- --verbose
Print debugging information
- --view
See information about the current binaries, which are what will be used for
kb refandkb count.- --remove
Remove the existing compiled binaries. Binaries that are provided with kb are never removed.
- --overwrite
Overwrite the existing compiled binaries, if they exist.
- -o OUT
Save the compiled binaries to a different directory. Note that if this option is specified, the binaries will have to be manually specified with
--kallistoor--bustoolswhen runningkb reforkb count.- --url URL
Use a custom URL to a ZIP or tarball file containing the source code of the specified binary. May only be used with a single target.
- --ref REF
Repository commmit hash or tag to fetch the source code from. May only be used with a single target.
- --cmake-arguments URL
Additional arguments to pass to the cmake command. For example, to pass additional include directories,
--cmake-arguments="-DCMAKE_CXX_FLAGS='-I /usr/include'"