Pseudoalignment of RNA seq data against a protein reference

Note

Reference: Luebbert L, Sullivan DK, Carilli M, Eldjárn Hjörleifsson K, Viloria Winnett A, Chari T, Pachter L. Efficient and accurate detection of viral sequences at single-cell resolution reveals putative novel viruses perturbing host gene expression. bioRxiv 2023.12.11.571168 https://doi.org/10.1101/2023.12.11.571168

kallisto can perform translated pseudoalignment of nucleotide sequences against an amino acid reference while retaining single-cell (for single-cell RNA sequencing data) or sample (for bulk RNA seq data) resolution.

To perform translated alignment, simply add the --aa flag to the kb ref and kb count commands.

The workflow can be executed in three lines of code, and computational requirements do not exceed those of a standard laptop. Building on kallisto’s versatility, the workflow is compatible with all state-of-the-art single-cell and bulk RNA sequencing methods, including but not limited to 10x Genomics, Drop-Seq, SMART-Seq, SPLiT-Seq (including Parse Biosciences), and spatial methods such as Visium.

The translated alignment workflows can be used to align RNA sequencing data to any protein reference. However, we first described its use in combination with the PalmDB viral protein database for the detection of viral sequences in RNA sequencing data:

Install kb-python (optional: install gget to fetch the host genome and transcriptome):

pip install kb-python gget

Download optimized PalmDB viral protein reference files:

wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/palmdb_rdrp_seqs.fa
wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/palmdb_clustered_t2g.txt

Create reference index (optional masking of the host, here human, genome using the D-list):

# Single-thread runtime: 1.5 h; Max RAM: 4.4 GB; Size of generated index: 593 MB
# Without D-list: Single-thread runtime: 3.5 min; Max RAM: 3.9 GB; Size of generated index: 592 MB
kb ref \
    --aa \
    --d-list $(gget ref --ftp -w dna homo_sapiens) \
    -i index.idx \
    --workflow custom \
    palmdb_rdrp_seqs.fa

Align sequencing reads:

# Single-thread runtime: 1.5 min / 1 million sequences; Max RAM: 2.1 GB
kb count \
    --aa \
    -i index.idx \
    -g palmdb_clustered_t2g.txt \
    --parity single \
    -x default \
    $USER_DATA.fastq.gz