Pseudoalignment of bulk RNA seq data

To view a comprehensive bulk RNA-seq analysis, be sure to check out the bulk RNA-seq tutorial.

Note

To align bulk RNA-seq data against a protein or amino acid reference, see: Translated Pseudoalignment.

Quantification

To quantify bulk RNA-seq reads, run kb count with the -x BULK option.

Let's say our reads, which are paired-end, came from a knockdown experiment and we have three control samples (C_1, C_2, C_3) and three knockdown samples (KD_1, KD_2, KD_3). Run the following, assuming your index is stored in human_index.idx and transcript-to-gene mapping file in human_t2g.txt:

kb count -x BULK -i human_index.idx -g human_t2g.txt \
--parity=paired --tcc --matrix-to-directories -o output_dir \
C_1_R1.fastq.gz C_1_R2.fastq.gz \
C_2_R1.fastq.gz C_2_R2.fastq.gz \
C_3_R1.fastq.gz C_3_R2.fastq.gz \
KD_1_R1.fastq.gz KD_1_R2.fastq.gz \
KD_2_R1.fastq.gz KD_2_R2.fastq.gz \
KD_3_R1.fastq.gz KD_3_R2.fastq.gz

Note

The order in which the input read files are supplied determines the sample identities after read quantification.

Other options that might be relevant are setting the --strand option to either forward or reverse if you have stranded RNA-seq reads, and setting --bootstraps if performing differential expression analysis.

The quantification output will be stored in output_dir/quant_unfiltered/, which contains the directories abundance_1, abundance_2, abundance_3, abundance_4, abundance_5, abundance_6 corresponding to the six samples (in the order they were provided). For example output_dir/quant_unfiltered/abundance_1/abundance.tsv will contain the transcript abundances for the C_1 control sample.

Differential transcript expression

See the bulk RNA-seq tutorial for a full example of using kallisto with edgeR for differential transcript expression.

Differential gene expression

For differential gene expression, use sleuth (which can also perform differential transcript expression) or import the kallisto results into tximport and then use a program such as DESeq2.

For tximport:

# Install tximport
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("tximport")

# Include tximport
require(tximport)

# Load transcript-to-gene mapping
t2g <- read.delim("human_t2g.txt", header = FALSE, sep = "\t")
t2g <- t2g[, 1:2]

# Prepare file paths and sample info
base_dir <- "output_dir/quant_unfiltered"
dirs <- list.dirs(base_dir, full.names = FALSE, recursive = FALSE)
abundance_dirs <- dirs[grepl("^abundance_\\d+$", dirs)]
files <- file.path(base_dir, abundance_dirs, "abundance.h5")
names(files) <- paste0("sample", 1:length(files))

# Run tximport
txi.kallisto <- tximport(files, type = "kallisto", txOut = TRUE)

# Summarize to gene-level
txi.kallisto.sum <- summarizeToGene(txi.kallisto, t2g)

Here is a sleuth tutorial (https://pachterlab.github.io/sleuth_walkthroughs/trapnell/analysis.html); the first few steps are to create a sleuth object which we can do as follows:

 # Install sleuth and rdhf5
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("rhdf5")
install.packages("devtools")
devtools::install_github('pachterlab/sleuth')

# Include sleuth
require(sleuth)

# Load transcript-to-gene mapping
t2g <- read.delim("human_t2g.txt", header = FALSE, sep = "\t")
t2g <- t2g[, 1:2]
colnames(t2g) <- c("target_id", "gene_id")

# Prepare file paths and sample info
base_dir <- "output_dir/quant_unfiltered"
dirs <- list.dirs(base_dir, full.names = FALSE, recursive = FALSE)
abundance_dirs <- dirs[grepl("^abundance_\\d+$", dirs)]
files <- file.path(base_dir, abundance_dirs)
samples <- paste0("sample", 1:length(files))
conditions <- c("control", "control", "control", "knockdown", "knockdown", "knockdown")
s2c <- data.frame(sample=samples, condition=conditions, path=files, stringsAsFactors=FALSE)

# Prepare a sleuth orject
so <- sleuth_prep(s2c, target_mapping = t2g)