Open In Colab

Viral proteins in single-cell RNA seq data

Click “Runtime” -> “Run all” to run this notebook.

In this tutorial, we will align single-cell RNA sequencing data collected from Dengue virus (DENV)-infected CEM.NK\(^{R}\) cells as well as peripheral blood mononuclear cells (PBMC) data collected from a patient suffering from DENV (during and after infection) to viral RdRP protein sequences. Let’s see if we can detect DENV RdRP sequences as expected.

Workflow reference:
Laura Luebbert, Delaney K Sullivan, Maria Carilli, Kristján Eldjárn Hjörleifsson, Alexander Viloria Winnett, Tara Chari, Lior Pachter (2023). Efficient and accurate detection of viral sequences at single-cell resolution reveals novel viruses perturbing host gene expression. bioRxiv 2023.12.11.571168; doi: https://doi.org/10.1101/2023.12.11.571168

Written by: Laura Luebbert (last updated: January 7th, 2025)

[ ]:
# SRR numbers of the sequencing dataset(s) to analyze
srr_numbers = [
    "SRR11321526",
    "SRR11321528",
    "SRR11321027",
    "SRR11321026"
    ]

# Genus and species of the host (separate by "_")
host_species = "homo_sapiens"

# Set this to False to download the complete RNA seq dataset (note: this might exceed the Google Colab disk space)
# With subset_of_data=True, only a subset of the DENV dataset is aligned in this example notebook to decrease runtime and required disk space
subset_of_data = True

# Number of threads to use during alignments
threads = 2

# k-mer length: Increase k for increased specificity (at a potential trade-off with sensitivity)
# k should be an odd integer and should be >=31 for translated alignment
# Since k relates to nucleotides and not amino acids, k corresponds to matching ⌊k/3⌋ amino acids (for k=31, this corresponds to ⌊31/3⌋=10 amino acids)
k = 31

Install software

[ ]:
!pip install -q ffq gget kb_python==0.29.1 anndata==0.10.9
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 49.6/49.6 kB 2.2 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
  Preparing metadata (setup.py) ... done
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36.5/36.5 MB 9.5 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 129.0/129.0 kB 8.4 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 43.1/43.1 MB 13.5 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 50.4/50.4 kB 2.5 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.2/3.2 MB 35.2 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34.4/34.4 MB 12.0 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 50.9/50.9 MB 11.3 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 30.2 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 33.3 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.9/56.9 kB 3.3 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 22.0/22.0 MB 33.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88.8/88.8 kB 6.0 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 40.6/40.6 kB 2.3 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 83.6/83.6 kB 5.4 MB/s eta 0:00:00
  Building wheel for loompy (setup.py) ... done
  Building wheel for session-info (setup.py) ... done
[ ]:
import anndata
import gget
import glob
import gzip
import json
import matplotlib.colors
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import platform
from Bio import SeqIO

%config InlineBackend.figure_format='retina'
[ ]:
# This step will not be necessary after the next kb release
if k<32:
  # Install kallisto v0.50.0 from source
  # This version performs translated search faster while providing the same results as later versions, but it does not support k>31
  # (the features to increase runtime and support k>31 will be incorporated into the next version of kb, so this step will not be necessary after the next release)
  !git clone https://github.com/pachterlab/kallisto.git --branch v0.50.0
  !cd kallisto && mkdir build && cd build && cmake .. && make
  kallisto = "kallisto/build/src/kallisto"
else:
  # Find the kallisto binary installed automatically with kb (or simply remove the --kallisto arguments below)
  kallisto = glob.glob(f"/usr/local/lib/python*/dist-packages/kb_python/bins/{platform.system().lower()}/kallisto/kallisto_k64")[0]

Download raw sequencing data

Here, we will use ffq to download the sequencing data from DENV1 infected CEM.NK^R cells (SRR11321027) and the uninfected control (SRR11321026) as well as PBMC data collected from a patient experiencing a DENV infection at fever day -2 (SRR11321526) and 180 (baseline post-infection) (SRR11321528):

[ ]:
%%time

## Get the FTP download links for raw data using ffq and store the results in a json file
srr_numbers = " ".join(srr_numbers)
!ffq $srr_numbers \
    --ftp \
    -o ffq.json

## Load ffq output
f = open("ffq.json")
data_json = json.load(f)
f.close()

## Download raw data using FTP links fetched by ffq
# NOTE: To decrease the runtime and required disk space of this example notebook,
# we will only download the first [gb_to_download] GB of each fastq file when subset_of_data=True
gb_to_download = 3

# Convert GB to bytes
max_bytes = gb_to_download * 1073741824
# Download data
for dataset in data_json:
    url = dataset["url"]

    if subset_of_data:
      !curl -r 0-$max_bytes -O $url
    else:
      !curl -O $url

# Remove unnecessary file to save disk space
!rm SRR11321528.fastq.gz
[2025-01-10 15:00:38,840]    INFO Parsing run SRR11321526
[2025-01-10 15:00:40,240]    INFO Parsing run SRR11321528
[2025-01-10 15:00:41,497]    INFO Parsing run SRR11321027
[2025-01-10 15:00:42,746]    INFO Parsing run SRR11321026
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3072M  100 3072M    0     0  44.8M      0  0:01:08  0:01:08 --:--:-- 22.5M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3072M  100 3072M    0     0  49.8M      0  0:01:01  0:01:01 --:--:-- 51.1M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3072M  100 3072M    0     0  49.1M      0  0:01:02  0:01:02 --:--:-- 51.0M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3072M  100 3072M    0     0  50.4M      0  0:01:00  0:01:00 --:--:-- 51.7M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3072M  100 3072M    0     0  40.0M      0  0:01:16  0:01:16 --:--:-- 49.9M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3072M  100 3072M    0     0  49.3M      0  0:01:02  0:01:02 --:--:-- 50.8M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3072M  100 3072M    0     0  49.6M      0  0:01:01  0:01:01 --:--:-- 51.1M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3072M  100 3072M    0     0  49.7M      0  0:01:01  0:01:01 --:--:-- 50.9M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3072M  100 3072M    0     0  50.3M      0  0:01:00  0:01:00 --:--:-- 51.7M
CPU times: user 4.67 s, sys: 768 ms, total: 5.44 s
Wall time: 9min 44s

Download the viral protein reference sequences

In this case, we are using a version of the PalmDB database v1 that was optimized for viral sequence detection with kallisto as described in this manuscript. These files are stored in the GitHub repository accompanying the manuscript.

[ ]:
# Download the virus ID to taxonomy mapping
!wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/ID_to_taxonomy_mapping.csv
# Download the customized transcripts to gene mapping
!wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/palmdb_clustered_t2g.txt
# Download the RdRP amino acid sequences
!wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/palmdb_rdrp_seqs.fa
--2025-01-10 15:10:22--  https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/ID_to_taxonomy_mapping.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19705497 (19M) [text/plain]
Saving to: ‘ID_to_taxonomy_mapping.csv’

ID_to_taxonomy_mapp 100%[===================>]  18.79M  --.-KB/s    in 0.06s

2025-01-10 15:10:23 (318 MB/s) - ‘ID_to_taxonomy_mapping.csv’ saved [19705497/19705497]

--2025-01-10 15:10:23--  https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/palmdb_clustered_t2g.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4561689 (4.3M) [text/plain]
Saving to: ‘palmdb_clustered_t2g.txt’

palmdb_clustered_t2 100%[===================>]   4.35M  --.-KB/s    in 0.02s

2025-01-10 15:10:23 (216 MB/s) - ‘palmdb_clustered_t2g.txt’ saved [4561689/4561689]

--2025-01-10 15:10:23--  https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/palmdb_rdrp_seqs.fa
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 35361991 (34M) [text/plain]
Saving to: ‘palmdb_rdrp_seqs.fa’

palmdb_rdrp_seqs.fa 100%[===================>]  33.72M  --.-KB/s    in 0.1s

2025-01-10 15:10:24 (280 MB/s) - ‘palmdb_rdrp_seqs.fa’ saved [35361991/35361991]

Build a reference index from the viral protein sequences and mask host (here, human) sequences

In addition to building a reference index from the protein sequences, we also want to mask host (in this case, human) sequences. For a more conservative host masking workflow that incorporates an additional step for the removal of host sequences, seethis notebook.

The `--aa argument <https://kallisto.readthedocs.io/en/latest/translated/pseudoalignment.html>`__ tells kb that this is an amino acid reference.

The `--d-list argument <https://kallisto.readthedocs.io/en/latest/index/index_generation.html#the-d-list>`__ is the path to the host genome/transcriptome. These sequences will be masked in the index. Here, we are using `gget ref <https://pachterlab.github.io/gget/en/ref.html>`__ to fetch the latest human genome and transcriptome from Ensembl.

We are using --workflow custom here since we do not have a .gtf file for the PalmDB fasta file.

--kallisto provides the path to the kallisto binary v0.50.0 (if k ≤ 31), which we installed above. This argument can be omitted to use the kallisto binary included in kb.

Building the index can take some time, depending on the number of threads used and the size of the D-list (here, ~20 min), but it only needs to be done once.

[ ]:
# Download the host reference genome and transcriptome from Ensembl
!gget ref -w cdna,dna -d $host_species

# Concatenate host genome and transcriptome into one file
host_genome = glob.glob("*.dna*.fa.gz")[0]
host_transcriptome = glob.glob("*.cdna.all.fa.gz")[0]
host_genome_transciptome = f"{host_species}.cdna_dna.fa.gz"
!cat $host_genome $host_transcriptome > $host_genome_transciptome
15:10:30 - INFO - Fetching reference information for homo_sapiens from Ensembl release: 113.
{
    "homo_sapiens": {
        "transcriptome_cdna": {
            "ftp": "http://ftp.ensembl.org/pub/release-113/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz",
            "ensembl_release": 113,
            "release_date": "2024-08-15",
            "release_time": "14:44",
            "bytes": "76M"
        },
        "genome_dna": {
            "ftp": "http://ftp.ensembl.org/pub/release-113/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz",
            "ensembl_release": 113,
            "release_date": "2024-08-15",
            "release_time": "20:18",
            "bytes": "841M"
        }
    }
}
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 75.9M  100 75.9M    0     0  34.4M      0  0:00:02  0:00:02 --:--:-- 34.4M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  841M  100  841M    0     0  47.0M      0  0:00:17  0:00:17 --:--:-- 51.0M
[ ]:
%%time
# Generate the viral protein reference index with masked host sequences using kallisto
reference_index = f"palmdb_{host_species}_dlist_cdna_dna.idx"

!kb ref \
    --kallisto $kallisto \
    --verbose \
    -t $threads \
    --aa \
    -k $k \
    --workflow custom \
    --d-list $host_genome_transciptome \
    -i $reference_index \
    palmdb_rdrp_seqs.fa
[2025-01-10 15:11:01,539]   DEBUG [main] Printing verbose output
[2025-01-10 15:11:03,746]   DEBUG [main] kallisto binary located at /content/kallisto/build/src/kallisto
[2025-01-10 15:11:03,746]   DEBUG [main] bustools binary located at /usr/local/lib/python3.10/dist-packages/kb_python/bins/linux/bustools/bustools
[2025-01-10 15:11:03,746]   DEBUG [main] Creating `tmp` directory
[2025-01-10 15:11:03,746]   DEBUG [main] Namespace(list=False, command='ref', tmp=None, keep_tmp=False, verbose=True, i='palmdb_homo_sapiens_dlist_cdna_dna.idx', g=None, f1=None, include_attribute=None, exclude_attribute=None, f2=None, c1=None, c2=None, d=None, k=31, t=2, d_list='homo_sapiens.cdna_dna.fa.gz', d_list_overhang=1, aa=True, workflow='custom', distinguish=False, make_unique=False, overwrite=False, kallisto='kallisto/build/src/kallisto', bustools='/usr/local/lib/python3.10/dist-packages/kb_python/bins/linux/bustools/bustools', opt_off=False, fasta='palmdb_rdrp_seqs.fa', gtf=None, feature=None, no_mismatches=False, ec_max_size=None, flank=None)
[2025-01-10 15:11:03,747]    INFO [ref_custom] Indexing palmdb_rdrp_seqs.fa to palmdb_homo_sapiens_dlist_cdna_dna.idx
[2025-01-10 15:11:03,747]   DEBUG [ref_custom] kallisto index -i palmdb_homo_sapiens_dlist_cdna_dna.idx -k 31 -t 2 -d homo_sapiens.cdna_dna.fa.gz --aa palmdb_rdrp_seqs.fa
[2025-01-10 15:11:03,848]   DEBUG [ref_custom]
[2025-01-10 15:11:03,849]   DEBUG [ref_custom] [build] loading fasta file palmdb_rdrp_seqs.fa
[2025-01-10 15:11:03,849]   DEBUG [ref_custom] [build] k-mer length: 31
[2025-01-10 15:11:07,562]   DEBUG [ref_custom] KmerStream::KmerStream(): Start computing k-mer cardinality estimations (1/2)
[2025-01-10 15:11:07,563]   DEBUG [ref_custom] KmerStream::KmerStream(): Start computing k-mer cardinality estimations (1/2)
[2025-01-10 15:11:14,888]   DEBUG [ref_custom] KmerStream::KmerStream(): Finished
[2025-01-10 15:11:15,289]   DEBUG [ref_custom] CompactedDBG::build(): Estimated number of k-mers occurring at least once: 37641510
[2025-01-10 15:11:15,289]   DEBUG [ref_custom] CompactedDBG::build(): Estimated number of minimizer occurring at least once: 7877811
[2025-01-10 15:11:28,647]   DEBUG [ref_custom] CompactedDBG::filter(): Processed 87630084 k-mers in 296561 reads
[2025-01-10 15:11:28,647]   DEBUG [ref_custom] CompactedDBG::filter(): Found 37508762 unique k-mers
[2025-01-10 15:11:28,647]   DEBUG [ref_custom] CompactedDBG::filter(): Number of blocks in Bloom filter is 257317
[2025-01-10 15:11:29,048]   DEBUG [ref_custom] CompactedDBG::construct(): Extract approximate unitigs (1/2)
[2025-01-10 15:12:54,026]   DEBUG [ref_custom] CompactedDBG::construct(): Extract approximate unitigs (2/2)
[2025-01-10 15:13:17,004]   DEBUG [ref_custom] CompactedDBG::construct(): Closed all input files
[2025-01-10 15:13:17,005]   DEBUG [ref_custom]
[2025-01-10 15:13:17,005]   DEBUG [ref_custom] CompactedDBG::construct(): Splitting unitigs (1/2)
[2025-01-10 15:13:17,206]   DEBUG [ref_custom]
[2025-01-10 15:13:17,206]   DEBUG [ref_custom] CompactedDBG::construct(): Splitting unitigs (2/2)
[2025-01-10 15:13:18,313]   DEBUG [ref_custom] CompactedDBG::construct(): Before split: 2040523 unitigs
[2025-01-10 15:13:18,313]   DEBUG [ref_custom] CompactedDBG::construct(): After split (1/1): 2040523 unitigs
[2025-01-10 15:13:18,313]   DEBUG [ref_custom] CompactedDBG::construct(): Unitigs split: 1534
[2025-01-10 15:13:18,313]   DEBUG [ref_custom] CompactedDBG::construct(): Unitigs deleted: 0
[2025-01-10 15:13:18,313]   DEBUG [ref_custom]
[2025-01-10 15:13:18,313]   DEBUG [ref_custom] CompactedDBG::construct(): Joining unitigs
[2025-01-10 15:13:27,757]   DEBUG [ref_custom] CompactedDBG::construct(): After join: 2021701 unitigs
[2025-01-10 15:13:27,757]   DEBUG [ref_custom] CompactedDBG::construct(): Joined 19205 unitigs
[2025-01-10 15:13:28,659]   DEBUG [ref_custom] [build] extracting distinguishing flanking k-mers from "homo_sapiens.cdna_dna.fa.gz"
[2025-01-10 15:43:17,553]   DEBUG [ref_custom] [build] identified 130 distinguishing flanking k-mers
[2025-01-10 15:43:26,580]   DEBUG [ref_custom] [build] building MPHF
[2025-01-10 15:43:35,998]   DEBUG [ref_custom] [build] creating equivalence classes ...
[2025-01-10 15:44:36,850]   DEBUG [ref_custom] [build] target de Bruijn graph has k-mer length 31 and minimizer length 23
[2025-01-10 15:44:36,850]   DEBUG [ref_custom] [build] target de Bruijn graph has 2021821 contigs and contains 37541835 k-mers
[2025-01-10 15:44:53,398]   DEBUG [ref_custom]
[2025-01-10 15:44:54,701]    INFO [ref_custom] Finished creating custom index
[2025-01-10 15:44:54,701]   DEBUG [main] Removing `tmp` directory
CPU times: user 12.3 s, sys: 1.98 s, total: 14.3 s
Wall time: 34min 1s
Alternatively, you can download a precomputed PalmDB reference index for use with kb in which the human (or mouse) genome and transcriptome were masked. You can find links to all available precomputed reference indices here.
Note: These indices were computed with k=31.
[ ]:
# # Download precomputed index with masked human genome and transcriptome
# k=31
# !wget https://data.caltech.edu/records/sh33z-hrx98/files/palmdb_human_dlist_cdna_dna.idx?download=1
# !mv palmdb_human_dlist_cdna_dna.idx?download=1 palmdb_human_dlist_cdna_dna.idx

Load the generated count matrix

[ ]:
# Open count matrix (AnnData object in h5ad format)
adata = anndata.read_h5ad(f"{out_folder}/counts_unfiltered/adata.h5ad")
adata
AnnData object with n_obs × n_vars = 15369 × 99228
[ ]:
# 'barcode' here includes batch/sample barcode (first 16 characters) + cell barcode
adata.obs
barcode
AAAAAAAAAAAAAAAAAAACCTGAGATGTCGG
AAAAAAAAAAAAAAAAAAACCTGAGTGAATTG
AAAAAAAAAAAAAAAAAAACCTGTCAATACCG
AAAAAAAAAAAAAAAAAAACCTGTCCACTCCA
AAAAAAAAAAAAAAAAAAACCTGTCCATGCTC
...
AAAAAAAAAAAAAAATTTTGCGCCAACGATGG
AAAAAAAAAAAAAAATTTTGGTTAGGTAAACT
AAAAAAAAAAAAAAATTTTGGTTTCTCTGAGA
AAAAAAAAAAAAAAATTTTGTCACAGCTCGAC
AAAAAAAAAAAAAAATTTTGTCAGTACTTCTT

15369 rows × 0 columns

Use the sample barcodes to add a column with the original SRR file for each cell:

[ ]:
# Load the sample/batch barcodes
b_file = open("kb_output/matrix.sample.barcodes")
barcodes = b_file.read().splitlines()
b_file.close()

# Load the sample/batch names
s_file = open("kb_output/matrix.cells")
samples = s_file.read().splitlines()
s_file.close()

# Create dataframe that maps sample barcodes to sample names
bc2sample_df = pd.DataFrame()
bc2sample_df["sample_barcode"] = barcodes
bc2sample_df["srr"] = samples

# Add sample name to AnnData object
adata.obs["barcode"] = adata.obs.index.values
adata.obs["sample_barcode"] = [bc[:-16] for bc in adata.obs.index.values]
adata.obs["cell_barcode"] = [bc[-16:] for bc in adata.obs.index.values]
adata.obs = adata.obs.merge(bc2sample_df, on="sample_barcode", how="left").set_index("cell_barcode", drop=False)
adata.obs
barcode sample_barcode cell_barcode srr
cell_barcode
AAACCTGAGATGTCGG AAAAAAAAAAAAAAAAAAACCTGAGATGTCGG AAAAAAAAAAAAAAAA AAACCTGAGATGTCGG SRR11321526
AAACCTGAGTGAATTG AAAAAAAAAAAAAAAAAAACCTGAGTGAATTG AAAAAAAAAAAAAAAA AAACCTGAGTGAATTG SRR11321526
AAACCTGTCAATACCG AAAAAAAAAAAAAAAAAAACCTGTCAATACCG AAAAAAAAAAAAAAAA AAACCTGTCAATACCG SRR11321526
AAACCTGTCCACTCCA AAAAAAAAAAAAAAAAAAACCTGTCCACTCCA AAAAAAAAAAAAAAAA AAACCTGTCCACTCCA SRR11321526
AAACCTGTCCATGCTC AAAAAAAAAAAAAAAAAAACCTGTCCATGCTC AAAAAAAAAAAAAAAA AAACCTGTCCATGCTC SRR11321526
... ... ... ... ...
TTTGCGCCAACGATGG AAAAAAAAAAAAAAATTTTGCGCCAACGATGG AAAAAAAAAAAAAAAT TTTGCGCCAACGATGG SRR11321026
TTTGGTTAGGTAAACT AAAAAAAAAAAAAAATTTTGGTTAGGTAAACT AAAAAAAAAAAAAAAT TTTGGTTAGGTAAACT SRR11321026
TTTGGTTTCTCTGAGA AAAAAAAAAAAAAAATTTTGGTTTCTCTGAGA AAAAAAAAAAAAAAAT TTTGGTTTCTCTGAGA SRR11321026
TTTGTCACAGCTCGAC AAAAAAAAAAAAAAATTTTGTCACAGCTCGAC AAAAAAAAAAAAAAAT TTTGTCACAGCTCGAC SRR11321026
TTTGTCAGTACTTCTT AAAAAAAAAAAAAAATTTTGTCAGTACTTCTT AAAAAAAAAAAAAAAT TTTGTCAGTACTTCTT SRR11321026

15369 rows × 4 columns

[ ]:
# Create dataframe with a label for each SRR library
samples = [
    'SRR11321526',
    'SRR11321528',
    'SRR11321027',
    'SRR11321026'
    ]

# Label corresponding to each SRR library (samples)
labels = [
    'Patient PBMC\n(fever day -2)',
    'Patient PBMC\n(post-infection control)',
    'DENV-infected CEM.NK$^R$',
    'Uninfected\nControl (CEM.NK$^R$)'
    ]

metadata_df = pd.DataFrame()
metadata_df["SRR"] = samples
metadata_df["label"] = labels
metadata_df
SRR label
0 SRR11321526 Patient PBMC\n(fever day -2)
1 SRR11321528 Patient PBMC\n(post-infection control)
2 SRR11321027 DENV-infected CEM.NK$^R$
3 SRR11321026 Uninfected\nControl (CEM.NK$^R$)

Plot counts for Dengue virus (DENV) in each sample

Find the virus IDs for Dengue virus in the PalmDB viral protein reference:

[ ]:
# Load the PalmDB virus ID to virus taxonomy mapping
u_tax_csv = "ID_to_taxonomy_mapping.csv"
tax_df = pd.read_csv(u_tax_csv)

# Find the reference IDs for Dengue virus RdRPs
# Note: Only the rep_ID (representative ID) will occur in the count matrix
tax_df[tax_df["species"].str.contains("Dengue virus")]
ID rep_ID phylum class order family genus species strandedness
7298 u1010 u1010 Kitrinoviricota Flasuviricetes Amarillovirales Flaviviridae Flavivirus Dengue virus +ssRNA
7299 u1014 u1010 Kitrinoviricota Flasuviricetes Amarillovirales Flaviviridae Flavivirus Dengue virus +ssRNA
7300 u1015 u1010 Kitrinoviricota Flasuviricetes Amarillovirales Flaviviridae Flavivirus Dengue virus +ssRNA
7301 u1025 u1010 Kitrinoviricota Flasuviricetes Amarillovirales Flaviviridae Flavivirus Dengue virus +ssRNA
7302 u10313 u1010 Kitrinoviricota Flasuviricetes Amarillovirales Flaviviridae Flavivirus Dengue virus +ssRNA
... ... ... ... ... ... ... ... ... ...
8360 u9796 u1010 Kitrinoviricota Flasuviricetes Amarillovirales Flaviviridae Flavivirus Dengue virus +ssRNA
8361 u9846 u1010 Kitrinoviricota Flasuviricetes Amarillovirales Flaviviridae Flavivirus Dengue virus +ssRNA
8362 u9911 u1010 Kitrinoviricota Flasuviricetes Amarillovirales Flaviviridae Flavivirus Dengue virus +ssRNA
8363 u9935 u1010 Kitrinoviricota Flasuviricetes Amarillovirales Flaviviridae Flavivirus Dengue virus +ssRNA
8364 u9989 u1010 Kitrinoviricota Flasuviricetes Amarillovirales Flaviviridae Flavivirus Dengue virus +ssRNA

1067 rows × 9 columns

[ ]:
# Target virus IDs to plot (here, Dengue virus)
target_virus = "Dengue virus"
target_ids = tax_df[tax_df["species"].str.contains(target_virus)]["rep_ID"].unique()
target_ids
array(['u1010'], dtype=object)

Create bar plot showing raw counts for DENV:

[ ]:
fig, ax = plt.subplots(figsize=(5, 5))
fontsize = 14
width = 0.75

counts = []
for sample in metadata_df["SRR"].values:
    counts.append(adata.X[adata.obs["srr"].str.contains(sample), adata.var.index.isin(target_ids)].sum())

x = np.arange(len(metadata_df["SRR"].values))

ax.bar(x, counts, width=width, color="#003049", edgecolor="black")

# Adjust figure labels and axes
ax.set_title(f"Counts for Dengue virus observed in\nCENV-infected patient and cell line", fontsize=fontsize+2)
ax.set_yscale("symlog")
ax.set_ylabel(f"Total counts for {target_ids[0]} (Dengue virus))", fontsize=fontsize)
# ax.set_xlabel("Sample", fontsize=fontsize)
ax.set_xticks(x, metadata_df["label"].values, rotation=45, ha="right")
ax.tick_params(axis="both", labelsize=fontsize-2)
ax.grid(True, which="both", color="lightgray", ls="--", lw=1)
ax.set_axisbelow(True)

# Save figure
plt.savefig("denv_rdrp_count.png", dpi=300, bbox_inches="tight")

fig.show()
../../_images/translated_notebooks_virus_detection_sc_32_0.png
[ ]:
# NOTE: These counts are low since we only aligned a small subset of the data in this example notebook
counts
[6.0, 0.0, 219.0, 0.0]

Visualize other virus-like sequences observed in the dataset

When interpreting the presence of other virus- or RdRP-like sequences (each identified by its own virus ID), keep in mind that there will likely be many RdRP-like sequences introduced by contamination of laboratory reagents. A (non-comprehensive) list of virus IDs observed in blank sequencing data is available here and we will download it below for comparison to the virus IDs observed in our dataset. Another indication that a sequence may have originated from sample contamination is its uniform presence across all sequenced samples.

[ ]:
# Download list of virus IDs seen in blank sequencing reagents
!wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/refs/heads/main/viruses_in_blank_reagents/total_raw_count_per_virus_id_in_laboratory_reagents.csv
blank_df = pd.read_csv("total_raw_count_per_virus_id_in_laboratory_reagents.csv")
blank_df
--2025-01-10 20:22:00--  https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/refs/heads/main/viruses_in_blank_reagents/total_raw_count_per_virus_id_in_laboratory_reagents.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 54642 (53K) [text/plain]
Saving to: ‘total_raw_count_per_virus_id_in_laboratory_reagents.csv’

total_raw_count_per 100%[===================>]  53.36K  --.-KB/s    in 0.002s

2025-01-10 20:22:01 (33.5 MB/s) - ‘total_raw_count_per_virus_id_in_laboratory_reagents.csv’ saved [54642/54642]

count virus
0 5374210.0 u172514
1 4301784.0 u226460
2 752771.0 u237705
3 456424.0 u202260
4 359783.0 u223701
... ... ...
4573 1.0 u169611
4574 1.0 u41840
4575 1.0 u169761
4576 1.0 u169999
4577 1.0 u158491

4578 rows × 2 columns

Plot the top 20 virus IDs observed in this dataset (raw counts are normalized to the total number of viral reads in each sample, and virus IDs also observed in blank sequencing reagents are marked in red):

[ ]:
# Plot the top 20 virus IDs (counts will be normalized to total counts in each SRR library)
def plot_top_virs_stacked_bar(adata, top_n=20):
    # Convert sparse matrix to dense if needed
    counts_matrix = adata.X
    if not isinstance(counts_matrix, np.ndarray):
        counts_matrix = counts_matrix.toarray()

    # Create a DataFrame with counts and group by SRR library
    df_counts = pd.DataFrame(counts_matrix, columns=adata.var_names, index=adata.obs["srr"])
    grouped_counts = df_counts.groupby(level=0).sum()

    # Normalize counts by total read depth per SRR library
    grouped_counts_normalized = grouped_counts.div(grouped_counts.sum(axis=1), axis=0)

    # Sum normalized counts across all SRR libraries for each virus
    vir_totals = grouped_counts_normalized.sum(axis=0)

    # Get the top N viruses across all SRR libraries
    top_vir_indices = np.argsort(vir_totals)[-top_n:][::-1]
    top_virs = grouped_counts_normalized.columns[top_vir_indices]
    top_counts = grouped_counts_normalized[top_virs]

    # Create a stacked bar plot
    fig, ax = plt.subplots(figsize=(7, 5))
    colors = ['#e34a33', '#fdbb84', '#2b8cbe', '#a6bddb'] * (len(top_counts) // 6 + 1)
    fontsize = 14

    bar_bottom = np.zeros(len(top_virs))
    for i, (srr_id, row_counts) in enumerate(top_counts.iterrows()):
        ax.bar(
            top_virs,
            row_counts,
            bottom=bar_bottom,
            color=colors[i],
            label=metadata_df[metadata_df["SRR"] == srr_id]["label"].values[0].replace("\n", " ")
        )
        bar_bottom += row_counts

    # Adjust figure labels and axes
    ax.set_title(f"Top {top_n} Virus IDs", fontsize=fontsize+2)
    ax.set_xlabel("Virus ID", fontsize=fontsize)
    ax.set_ylabel("Fraction of Total Reads in Sample", fontsize=fontsize)
    ax.set_xticklabels(top_virs, rotation=45, ha='right')
    # Mark virus IDs that have been observed in blank sequencing reagents in red
    for i, label in enumerate(top_virs):
        if label in blank_df["virus"].values:
            ax.get_xticklabels()[i].set_color('red')
    ax.legend(title="Sample", loc='upper right', fontsize=fontsize-2, title_fontsize=fontsize)
    ax.tick_params(axis='both', which='major', labelsize=fontsize-2)
    ax.grid(axis='y', linestyle='--', color="lightgrey", alpha=1)
    ax.set_axisbelow(True)
    ax.margins(x=0.01)

    # Save figure
    fig.savefig("top_virus_IDs_per_sample.png", dpi=300, bbox_inches="tight")

    fig.show()

    return top_virs

top_virs = plot_top_virs_stacked_bar(adata, top_n=20)
../../_images/translated_notebooks_virus_detection_sc_37_0.png

Show the predicted taxonomies for each virus ID shown in the plot above (these are the species-like operational taxonomic units (OTUs) provided by PalmDB):

[ ]:
top_vir_tax = tax_df[tax_df["rep_ID"].isin(top_virs)][tax_df.columns[1:]].drop_duplicates()
top_vir_tax['rep_ID'] = pd.Categorical(top_vir_tax['rep_ID'], categories=top_virs, ordered=True)
top_vir_tax = top_vir_tax.sort_values('rep_ID')
top_vir_tax
rep_ID phylum class order family genus species strandedness
212430 u101227 Pisuviricota Pisoniviricetes Picornavirales Picornaviridae . . +ssRNA
34276 u27694 Peploviricota Herviviricetes Herpesvirales Herpesviridae Varicellovirus Bubaline alphaherpesvirus 1 dsDNA
263358 u202260 . . . . . . unknown
35782 u34159 . . . . . . unknown
66423 u100002 Lenarviricota Allassoviricetes Levivirales . . . +ssRNA
256038 u181379 . . . . . . unknown
39629 u45195 . . . . . . unknown
238847 u135858 . . . . . . unknown
38543 u41991 . . . . . . unknown
219921 u103829 Negarnaviricota Insthoviricetes Articulavirales Orthomyxoviridae . . -ssRNA
142323 u100041 Kitrinoviricota . . . . . +ssRNA
140792 u100031 Kitrinoviricota Alsuviricetes . . . . +ssRNA
95765 u100012 Lenarviricota Allassoviricetes . . . . +ssRNA
226917 u110641 Duplornaviricota Resentoviricetes Reovirales Reoviridae . . dsRNA
35684 u33987 . . . . . . unknown
36272 u35686 . . . . . . unknown
207936 u100644 Lenarviricota Amabiliviricetes Wolframvirales . . . +ssRNA
7298 u1010 Kitrinoviricota Flasuviricetes Amarillovirales Flaviviridae Flavivirus Dengue virus +ssRNA
144317 u100048 Lenarviricota Amabiliviricetes . . . . +ssRNA
34466 u30588 . . . . . . unknown

Extract reads that aligned to specific virus IDs for further analysis

To confirm the presence of a known virus for which a reference genome is available, you can align the data to the viral (nucleotide) reference genome using the standard kb workflow (simply omit the --aa flag in the kb ref and kb count commands and supply the viral reference genome to kb ref instead of the PalmDB protein reference).

If this fails or no reference genome is available, use kb extract to extract raw reads that aligned to a given virus ID for further analysis (such as BLAST alignment):

[ ]:
# Sequences that matched to [vir_ids_to_extract] will be extracted
# Here, we will extract reads that mapped to the virus ID u101227
# u101227 is the virus ID with the most counts across all samples so we might like to find out more about it
vir_ids_to_extract = ["u101227"]

# Provide a single fastq file to extract sequences from (here, we are using "Patient PBMC (post-infection control)")
fastq_file = "/content/SRR11321528_1_short.fastq"
[ ]:
%%time
# Extract sequences using kb extract
outfolder_extract = "extracted_seqs"
targets = " ".join(vir_ids_to_extract)

!kb extract \
    --kallisto $kallisto \
    --verbose \
    -t $threads \
    --aa \
    -k $k \
    -g palmdb_clustered_t2g.txt \
    -i $reference_index \
    -ts $targets \
    -o $outfolder_extract \
    $fastq_file
[2025-01-10 20:22:36,217]   DEBUG [main] Printing verbose output
[2025-01-10 20:22:38,426]   DEBUG [main] kallisto binary located at /content/kallisto/build/src/kallisto
[2025-01-10 20:22:38,427]   DEBUG [main] bustools binary located at /usr/local/lib/python3.10/dist-packages/kb_python/bins/linux/bustools/bustools
[2025-01-10 20:22:38,427]   DEBUG [main] Creating `extracted_seqs/tmp` directory
[2025-01-10 20:22:38,428]   DEBUG [main] Namespace(list=False, command='extract', tmp=None, keep_tmp=False, verbose=True, fastq='/content/SRR11321528_1_short.fastq', i='palmdb_homo_sapiens_dlist_cdna_dna.idx', targets=['u101227'], target_type='gene', extract_all=False, extract_all_fast=False, extract_all_unmapped=False, mm=False, g='palmdb_clustered_t2g.txt', o='extracted_seqs', t=2, strand=None, aa=True, N=None, kallisto='kallisto/build/src/kallisto', bustools='/usr/local/lib/python3.10/dist-packages/kb_python/bins/linux/bustools/bustools', opt_off=False, k=31)
[2025-01-10 20:22:41,916]    INFO [extract] Performing alignment using kallisto...
[2025-01-10 20:22:41,916]    INFO [extract] Using index palmdb_homo_sapiens_dlist_cdna_dna.idx to generate BUS file to extracted_seqs/tmp from
[2025-01-10 20:22:41,916]    INFO [extract]         /content/SRR11321528_1_short.fastq
[2025-01-10 20:22:41,916]   DEBUG [extract] kallisto bus -i palmdb_homo_sapiens_dlist_cdna_dna.idx -o extracted_seqs/tmp -x bulk -t 2 --num --aa /content/SRR11321528_1_short.fastq
[2025-01-10 20:22:42,018]   DEBUG [extract]
[2025-01-10 20:22:52,284]   DEBUG [extract] [index] k-mer length: 31
[2025-01-10 20:23:17,325]   DEBUG [extract] [index] number of targets: 296,561
[2025-01-10 20:23:17,426]   DEBUG [extract] [index] number of k-mers: 37,541,835
[2025-01-10 20:23:17,427]   DEBUG [extract] [index] number of distinguishing flanking k-mers: 130
[2025-01-10 20:23:17,427]   DEBUG [extract] [quant] running in single-end mode
[2025-01-10 20:23:17,427]   DEBUG [extract] [quant] will process file 1: /content/SRR11321528_1_short.fastq
[2025-01-10 20:29:38,975]   DEBUG [extract] [quant] finding pseudoalignments for all files ...
[2025-01-10 20:35:12,343]   DEBUG [extract] [progress] 1M reads processed (0.2% mapped)
[2025-01-10 20:41:23,853]   DEBUG [extract] [progress] 2M reads processed (0.2% mapped)
[2025-01-10 20:47:13,944]   DEBUG [extract] [progress] 3M reads processed (0.2% mapped)
[2025-01-10 20:53:03,094]   DEBUG [extract] [progress] 4M reads processed (0.2% mapped)
[2025-01-10 20:59:01,997]   DEBUG [extract] [progress] 5M reads processed (0.2% mapped)
[2025-01-10 21:05:06,825]   DEBUG [extract] [progress] 6M reads processed (0.2% mapped)
[2025-01-10 21:11:00,966]   DEBUG [extract] [progress] 7M reads processed (0.2% mapped)
[2025-01-10 21:16:30,838]   DEBUG [extract] [progress] 8M reads processed (0.2% mapped)
[2025-01-10 21:22:43,933]   DEBUG [extract] [progress] 9M reads processed (0.2% mapped)
[2025-01-10 21:28:20,131]   DEBUG [extract] [progress] 10M reads processed (0.2% mapped)
[2025-01-10 21:34:19,356]   DEBUG [extract] [progress] 11M reads processed (0.2% mapped)
[2025-01-10 21:40:08,500]   DEBUG [extract] [progress] 12M reads processed (0.2% mapped)
[2025-01-10 21:46:17,800]   DEBUG [extract] [progress] 13M reads processed (0.2% mapped)
[2025-01-10 21:51:49,182]   DEBUG [extract] [progress] 14M reads processed (0.2% mapped)
[2025-01-10 21:58:01,668]   DEBUG [extract] [progress] 15M reads processed (0.2% mapped)
[2025-01-10 22:03:50,793]   DEBUG [extract] [progress] 16M reads processed (0.2% mapped)
[2025-01-10 22:10:08,338]   DEBUG [extract] [progress] 17M reads processed (0.2% mapped)
[2025-01-10 22:15:51,016]   DEBUG [extract] [progress] 18M reads processed (0.2% mapped)
[2025-01-10 22:16:16,355]   DEBUG [extract] [progress] 19M reads processed (0.2% mapped)              done
[2025-01-10 22:16:16,355]   DEBUG [extract] [quant] processed 20,000,000 reads, 39,165 reads pseudoaligned
[2025-01-10 22:16:16,456]   DEBUG [extract]
[2025-01-10 22:16:19,313]    INFO [extract] Alignment complete. Beginning extraction of reads using bustools...
[2025-01-10 22:16:19,313]   DEBUG [extract] Replacing transcript entries with -1 for equivalence classes that map to multiple genes from extracted_seqs/tmp/matrix.ec
[2025-01-10 22:16:19,313]   DEBUG [extract] Fetching equivalence classes that map to multiple genes...
[2025-01-10 22:18:02,036]   DEBUG [extract] Found the following equivalence classes that map to multiple genes: 2 ,12 ,21 ,32 ,33 ,46 ,47 ,50 ,74 ,79 ,82 ,88 ,94 ,105 ,120 ,122 ,127 ,130 ,133 ,134 ,145 ,148 ,155 ,159 ,163 ,175 ,176 ,178 ,188 ,192 ,196 ,205 ,208 ,219 ,221 ,228 ,234 ,235 ,236 ,246 ,250 ,251 ,257 ,281 ,284 ,290 ,292 ,300 ,301 ,306 ,317 ,318 ,330 ,345 ,355 ,363 ,366 ,370 ,378 ,381 ,391 ,393 ,398 ,408 ,415 ,424 ,461 ,464 ,485 ,488 ,491 ,500 ,501 ,508 ,514 ,515 ,516 ,523 ,538 ,555 ,559 ,566 ,600 ,602 ,618 ,627 ,640 ,658 ,663 ,665 ,670 ,672 ,678 ,689 ,709 ,717 ,718 ,730 ,734 ,741 ,744 ,753 ,755 ,773 ,777 ,778 ,783 ,785 ,798 ,799 ,806 ,808 ,811 ,817 ,839 ,855 ,866 ,879 ,887 ,901 ,915 ,919 ,937 ,945 ,948 ,955 ,959 ,963 ,966 ,969 ,978 ,981 ,982 ,984 ,991 ,992
[2025-01-10 22:18:02,508]   DEBUG [extract] matrix.ec file where transcript entries were replaced with -1 for equivalence classes that map to multiple genes saved at extracted_seqs/tmp/matrix_no_mm.ec
[2025-01-10 22:18:03,075]    INFO [extract] Extracting reads for following transcripts for gene ID u101227: u101227, u102725, u102856, u102947, u10335, u103456, u103675, u104127, u10482, u105052, u10615, u106215, u10770, u110499, u111077, u112094, u112930, u11566, u11569, u11836, u11900, u12042, u12244, u123282, u126493, u129004, u129529, u131170, u132107, u132728, u13403, u13457, u134583, u134791, u135143, u135699, u13571, u136670, u137293, u137756, u139241, u139747, u140312, u140369, u14054, u140956, u141387, u143726, u144916, u14643, u148403, u149735, u15117, u152755, u152787, u152990, u153247, u15348, u16191, u163680, u164306, u164471, u164858, u164862, u16616, u16650, u16673, u167681, u168111, u168557, u169370, u169890, u170485, u170643, u170684, u172845, u173262, u173267, u173273, u174860, u17506, u175136, u17570, u176023, u176354, u176685, u176766, u177572, u180974, u18133, u181372, u182074, u182409, u182729, u18497, u18522, u18526, u18552, u187751, u18816, u188403, u189847, u19125, u192470, u192506, u192594, u19269, u192756, u192778, u19413, u19414, u195952, u195974, u196140, u196146, u19676, u19683, u197076, u197735, u198145, u198863, u19904, u199489, u200181, u20080, u20401, u20905, u211976, u21252, u21348, u215888, u215931, u21623, u216267, u216535, u217051, u217464, u217631, u21838, u218639, u21883, u219099, u219536, u220867, u226460, u22963, u23001, u230077, u231334, u232374, u235197, u235199, u237195, u237646, u237705, u237861, u237874, u238129, u238297, u240678, u24119, u242498, u242575, u243228, u24451, u249846, u250515, u250842, u25121, u25122, u251542, u252286, u25326, u253263, u25488, u25633, u25834, u25941, u260723, u260863, u26195, u262007, u26234, u266280, u266533, u266575, u26809, u26837, u26912, u269394, u269502, u27033, u27043, u270944, u271418, u271476, u271967, u273827, u274926, u275199, u276589, u278285, u278547, u278745, u278897, u278900, u278917, u285756, u285785, u285886, u286675, u287636, u294801, u295612, u33024, u33728, u33972, u34195, u37215, u37275, u37365, u37907, u38885, u41220, u41917, u42877, u43050, u43075, u45290, u45700, u45990, u45993, u46809, u47157, u47625, u47650, u47921, u50707, u5313, u55146, u5733, u59672, u61411, u64246, u66038, u67495, u67526, u70657, u7131, u72282, u74997, u7838, u79493, u79948, u80423, u80961, u82533, u83982, u84674, u86925, u88655, u89511, u89629, u89858, u92526, u9318, u93829, u93893, u94411, u94453, u94828, u9563, u9592, u9594, u9731, u97337, u97563, u98014, u9917, u99271, u9953
[2025-01-10 22:18:03,075]    INFO [extract] Capturing records from BUS file extracted_seqs/tmp/output.bus to extracted_seqs/tmp/output_extracted_u101227.bus with capture list extracted_seqs/tmp/pull_out_reads_transcript_ids_temp.txt
[2025-01-10 22:18:03,075]   DEBUG [extract] bustools capture -o extracted_seqs/tmp/output_extracted_u101227.bus -c extracted_seqs/tmp/pull_out_reads_transcript_ids_temp.txt -e extracted_seqs/tmp/matrix_no_mm.ec -t extracted_seqs/tmp/transcripts.txt --transcripts extracted_seqs/tmp/output.bus
[2025-01-10 22:18:03,481]   DEBUG [extract] Parsing transcripts .. done
[2025-01-10 22:18:03,481]   DEBUG [extract] Parsing ECs .. done
[2025-01-10 22:18:03,481]   DEBUG [extract] Parsing capture list .. done
[2025-01-10 22:18:04,682]   DEBUG [extract] Read in 39165 BUS records, wrote 741 BUS records
[2025-01-10 22:18:05,797]   DEBUG [extract] extracted_seqs/tmp/output_extracted_u101227.bus passed validation
[2025-01-10 22:18:05,797]    INFO [extract] Sorting BUS file extracted_seqs/tmp/output_extracted_u101227.bus to extracted_seqs/tmp/output_extracted_u101227_sorted.bus
[2025-01-10 22:18:05,798]   DEBUG [extract] bustools sort -o extracted_seqs/tmp/output_extracted_u101227_sorted.bus -T tmp -t 8 -m 2G --flags extracted_seqs/tmp/output_extracted_u101227.bus
[2025-01-10 22:18:05,899]   DEBUG [extract] Warning: Number of threads cannot be greater than or equal to 2. Setting number of threads to 2
[2025-01-10 22:18:08,407]   DEBUG [extract] all fits in buffer
[2025-01-10 22:18:09,809]   DEBUG [extract] Read in 741 BUS records
[2025-01-10 22:18:09,809]   DEBUG [extract] reading time 5.5e-05s
[2025-01-10 22:18:09,809]   DEBUG [extract] sorting time 7.1e-05s
[2025-01-10 22:18:09,809]   DEBUG [extract] writing time 0.000146s
[2025-01-10 22:18:09,810]    INFO [extract] Extracting BUS file extracted_seqs/tmp/output_extracted_u101227_sorted.bus to extracted_seqs/u101227
[2025-01-10 22:18:09,810]   DEBUG [extract] bustools extract -o extracted_seqs/u101227 -f /content/SRR11321528_1_short.fastq -N 1 extracted_seqs/tmp/output_extracted_u101227_sorted.bus
[2025-01-10 22:19:34,572]   DEBUG [extract] Read in 741 BUS records
[2025-01-10 22:19:34,695]   DEBUG [main] Removing `extracted_seqs/tmp` directory
CPU times: user 30.3 s, sys: 4.07 s, total: 34.4 s
Wall time: 1h 57min 29s
[ ]:
# Grab the reads extracted by kb extract for each virus ID
def get_all_sequences(fastq_file):
    """Extract all sequences from a gzipped FASTQ file."""
    sequences = []
    with gzip.open(fastq_file, "rt") as f:
        for record in SeqIO.parse(f, "fastq"):
            sequences.append(str(record.seq))
    return sequences

extracted_seqs_dict = {}
for vir_id in vir_ids_to_extract:
    # Grab the reads from the fastq file returned by kb extract
    seqs = get_all_sequences(f"{outfolder_extract}/{vir_id}/1.fastq.gz")
    # Save the extracted reads for each virus ID in a dictionary
    extracted_seqs_dict[vir_id] = seqs

BLAST extracted reads

BLAST the first read extracted for virus ID u101227 using `gget blast <https://pachterlab.github.io/gget/en/blast.html>`__ to see which known reference sequences it aligns to:

[ ]:
# BLAST the first read extracted for u101227 using gget
seq_idx = 0
virus_id = "u101227"

raw_read = extracted_seqs_dict[virus_id][seq_idx]

blast_df = gget.blast(raw_read)
blast_df
INFO:gget.utils:Sequence recognized as nucleotide sequence.
INFO:gget.utils:BLAST will use program 'blastn' with database 'nt'.
INFO:gget.utils:BLAST initiated with search ID S3781DRZ013. Estimated time to completion: 30 seconds.
INFO:gget.utils:Retrieving results...
/usr/local/lib/python3.10/dist-packages/gget/gget_blast.py:327: FutureWarning: Passing literal html to 'read_html' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.
  results_df = pd.read_html(str(dsc_table))[0]
Description Scientific Name Common Name Taxid Max Score Total Score Query Cover E value Per. Ident Acc. Len Accession
0 Severe acute respiratory syndrome coronavirus ... Severe acute respiratory syndrome coronavirus 2 NaN 2697049 84.2 84.2 40% 4.000000e-12 91.67% 29567 OY708460.1
1 Gossypium raimondii isolate Gr291 retrotranspo... Gossypium raimondii Peruvian cotton 29730 71.3 71.3 45% 3.000000e-08 85.29% 1708 KT895861.1
2 Severe acute respiratory syndrome coronavirus ... Severe acute respiratory syndrome coronavirus 2 NaN 2697049 58.4 58.4 23% 3.000000e-04 97.06% 29816 OY714418.1
3 Synthetic construct LB106 DNA, synthetic long-... synthetic construct NaN 32630 54.7 54.7 19% 3.000000e-03 100.00% 572 LC597715.1
4 Gastrodia elata chromosome GE13 mitochondrion,... Gastrodia elata NaN 91201 52.8 52.8 19% 1.200000e-02 100.00% 34420 OP441103.1

BLAST a few more of the extracted sequences to see if the BLAST results reproduce:

[ ]:
# Set the seed for reproducibility
np.random.seed(0)

# Randomly select indices for three of the extracted sequences
seq_idxs = np.random.randint(0, 200, size=3)
seq_idxs
array([172,  47, 117])
[ ]:
gget.blast(extracted_seqs_dict[virus_id][seq_idxs[0]], limit=5)
INFO:gget.utils:Sequence recognized as nucleotide sequence.
INFO:gget.utils:BLAST will use program 'blastn' with database 'nt'.
INFO:gget.utils:BLAST initiated with search ID S3E24YMB013. Estimated time to completion: 30 seconds.
INFO:gget.utils:Retrieving results...
/usr/local/lib/python3.10/dist-packages/gget/gget_blast.py:327: FutureWarning: Passing literal html to 'read_html' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.
  results_df = pd.read_html(str(dsc_table))[0]
Description Scientific Name Common Name Taxid Max Score Total Score Query Cover E value Per. Ident Acc. Len Accession
0 Severe acute respiratory syndrome coronavirus ... Severe acute respiratory syndrome coronavirus 2 NaN 2697049 58.4 58.4 23% 0.0003 97.06% 29829 OY695180.1
1 Gastrodia elata chromosome GE13 mitochondrion,... Gastrodia elata NaN 91201 52.8 52.8 19% 0.0120 100.00% 34420 OP441103.1
[ ]:
gget.blast(extracted_seqs_dict[virus_id][seq_idxs[1]], limit=5)
INFO:gget.utils:Sequence recognized as nucleotide sequence.
INFO:gget.utils:BLAST will use program 'blastn' with database 'nt'.
INFO:gget.utils:BLAST initiated with search ID S3E34MM6013. Estimated time to completion: 30 seconds.
INFO:gget.utils:Retrieving results...
/usr/local/lib/python3.10/dist-packages/gget/gget_blast.py:327: FutureWarning: Passing literal html to 'read_html' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.
  results_df = pd.read_html(str(dsc_table))[0]
Description Scientific Name Common Name Taxid Max Score Total Score Query Cover E value Per. Ident Acc. Len Accession
0 Homo sapiens ribosomal protein L39 (RPL39), mRNA Homo sapiens human 9606 141 141 50% 3.000000e-29 100.00% 390 NM_001000.4
1 PREDICTED: Pan troglodytes ribosomal protein L... Pan troglodytes chimpanzee 9598 141 141 50% 3.000000e-29 100.00% 438 XM_063804925.1
2 PREDICTED: Pan paniscus ribosomal protein L39 ... Pan paniscus pygmy chimpanzee 9597 141 141 50% 3.000000e-29 100.00% 500 XM_014343539.5
3 Homo sapiens ribosomal protein L39, mRNA (cDNA... Homo sapiens human 9606 141 141 50% 3.000000e-29 100.00% 419 BC070205.1
4 PREDICTED: Gorilla gorilla gorilla ribosomal p... Gorilla gorilla gorilla western lowland gorilla 9595 141 141 50% 3.000000e-29 100.00% 484 XM_019019406.3
[ ]:
gget.blast(extracted_seqs_dict[virus_id][seq_idxs[2]], limit=5)
INFO:gget.utils:Sequence recognized as nucleotide sequence.
INFO:gget.utils:BLAST will use program 'blastn' with database 'nt'.
INFO:gget.utils:BLAST initiated with search ID S3E44SY2016. Estimated time to completion: 30 seconds.
INFO:gget.utils:Retrieving results...
/usr/local/lib/python3.10/dist-packages/gget/gget_blast.py:327: FutureWarning: Passing literal html to 'read_html' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.
  results_df = pd.read_html(str(dsc_table))[0]
Description Scientific Name Common Name Taxid Max Score Total Score Query Cover E value Per. Ident Acc. Len Accession
0 Severe acute respiratory syndrome coronavirus ... Severe acute respiratory syndrome coronavirus 2 NaN 2697049 78.7 78.7 40% 2.000000e-10 90.00% 29567 OY708460.1
1 Gossypium raimondii isolate Gr291 retrotranspo... Gossypium raimondii Peruvian cotton 29730 71.3 71.3 45% 3.000000e-08 85.29% 1708 KT895861.1
2 Severe acute respiratory syndrome coronavirus ... Severe acute respiratory syndrome coronavirus 2 NaN 2697049 63.9 63.9 25% 6.000000e-06 97.30% 29816 OY714418.1
3 Synthetic construct LB106 DNA, synthetic long-... synthetic construct NaN 32630 54.7 54.7 19% 3.000000e-03 100.00% 572 LC597715.1
4 Gastrodia elata chromosome GE13 mitochondrion,... Gastrodia elata NaN 91201 52.8 52.8 19% 1.200000e-02 100.00% 34420 OP441103.1

Align data to virus reference genome

Some of the reads that aligned to u101227 BLASTed to SARS-CoV2 sequences. Given that u101227 was observed uniformly in all samples, which included two samples from patient PBMCs and two from a cell line, it is possible that these viral sequences were introduced by sample contamination. It is also possible that the BLAST hits reflect adapter sequence or contamination within the SARS-CoV-2 entries in the BLAST database itself, rather than true biological signal in the samples. The predicted taxonomy for u101227 in PalmDB is Picornaviridae, which includes disease-causing viruses like Enteroviruses. If this virus does not have a reference genome available, the BLAST results will not be very reliable.

We can align the dataset to one of the SARS-CoV2 reference genomes by rerunnning the kb ref and kb count commands from above, but providing the virus reference genome and omitting the --aa flag since this is now a nucleotide reference (instead of a protein reference):

[ ]:
# Download SARS-CoV2 reference genome from NCBI
# Finding the correct reference genome for the virus present in this dataset might take several tries and I would recommend alingning to all available reference genomes for a given virus
# We will perform an alignment to a single SARS-CoV2 reference genome here as an exampe
!wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/858/895/GCF_009858895.2_ASM985889v3/GCF_009858895.2_ASM985889v3_genomic.fna.gz
!wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/858/895/GCF_009858895.2_ASM985889v3/GCF_009858895.2_ASM985889v3_genomic.gtf.gz
--2025-01-10 22:21:53--  https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/858/895/GCF_009858895.2_ASM985889v3/GCF_009858895.2_ASM985889v3_genomic.fna.gz
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.31, 130.14.250.10, 130.14.250.11, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.31|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9591 (9.4K) [application/x-gzip]
Saving to: ‘GCF_009858895.2_ASM985889v3_genomic.fna.gz’

GCF_009858895.2_ASM 100%[===================>]   9.37K  --.-KB/s    in 0s

2025-01-10 22:21:53 (109 MB/s) - ‘GCF_009858895.2_ASM985889v3_genomic.fna.gz’ saved [9591/9591]

--2025-01-10 22:21:53--  https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/858/895/GCF_009858895.2_ASM985889v3/GCF_009858895.2_ASM985889v3_genomic.gtf.gz
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.12, 130.14.250.13, 130.14.250.7, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.12|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1218 (1.2K) [application/x-gzip]
Saving to: ‘GCF_009858895.2_ASM985889v3_genomic.gtf.gz’

GCF_009858895.2_ASM 100%[===================>]   1.19K  --.-KB/s    in 0s

2025-01-10 22:21:53 (75.3 MB/s) - ‘GCF_009858895.2_ASM985889v3_genomic.gtf.gz’ saved [1218/1218]

[ ]:
# We will use a slightly lower k for this alignment since the viral genome is smaller and more prone to mutations
# Set k=31 here for more specificity (at a trade-off with sensitivity)
k = 21
[ ]:
# Generate a SARS-CoV2 reference index
sars_cov2_genome = "GCF_009858895.2_ASM985889v3_genomic.fna.gz"
sars_cov2_gtf = "GCF_009858895.2_ASM985889v3_genomic.gtf.gz"

sars_reference_index = "sars_cov2_ref.idx"
sars_t2g = "sars_cov2_t2g.txt"
sars_fasta = "sars_cov2_transcripts.fa"

!kb ref \
    --kallisto $kallisto \
    --verbose \
    -t $threads \
    -k $k \
    --d-list $host_genome_transciptome \
    -i $sars_reference_index \
    -g $sars_t2g \
    -f1 $sars_fasta \
    $sars_cov2_genome $sars_cov2_gtf

# Align data to the SARS-CoV2 reference index generated above
sars_out_folder = "kb_output_sars_cov2"

!kb count \
    --kallisto $kallisto \
    --verbose \
    -t $threads \
    -k $k \
    -x 10xv2 \
    --parity single \
    -i $sars_reference_index \
    -g $sars_t2g \
    --h5ad \
    -o $sars_out_folder \
    --batch-barcodes \
    batch.txt
[2025-01-10 22:22:06,079]   DEBUG [main] Printing verbose output
[2025-01-10 22:22:08,288]   DEBUG [main] kallisto binary located at /content/kallisto/build/src/kallisto
[2025-01-10 22:22:08,289]   DEBUG [main] bustools binary located at /usr/local/lib/python3.10/dist-packages/kb_python/bins/linux/bustools/bustools
[2025-01-10 22:22:08,289]   DEBUG [main] Creating `tmp` directory
[2025-01-10 22:22:08,289]   DEBUG [main] Namespace(list=False, command='ref', tmp=None, keep_tmp=False, verbose=True, i='sars_cov2_ref.idx', g='sars_cov2_t2g.txt', f1='sars_cov2_transcripts.fa', include_attribute=None, exclude_attribute=None, f2=None, c1=None, c2=None, d=None, k=21, t=2, d_list='homo_sapiens.cdna_dna.fa.gz', d_list_overhang=1, aa=False, workflow='standard', distinguish=False, make_unique=False, overwrite=False, kallisto='kallisto/build/src/kallisto', bustools='/usr/local/lib/python3.10/dist-packages/kb_python/bins/linux/bustools/bustools', opt_off=False, fasta='GCF_009858895.2_ASM985889v3_genomic.fna.gz', gtf='GCF_009858895.2_ASM985889v3_genomic.gtf.gz', feature=None, no_mismatches=False, ec_max_size=None, flank=None)
[2025-01-10 22:22:08,289]    INFO [ref] Preparing GCF_009858895.2_ASM985889v3_genomic.fna.gz, GCF_009858895.2_ASM985889v3_genomic.gtf.gz
[2025-01-10 22:22:08,292] WARNING [main] Gene `GU280_gp01` has no transcripts. The entire gene will be marked as a transcript and an exon with ID `GU280_gp01`.
[2025-01-10 22:22:08,292] WARNING [main] Gene `GU280_gp02` has no transcripts. The entire gene will be marked as a transcript and an exon with ID `GU280_gp02`.
[2025-01-10 22:22:08,292] WARNING [main] Gene `GU280_gp03` has no transcripts. The entire gene will be marked as a transcript and an exon with ID `GU280_gp03`.
[2025-01-10 22:22:08,292] WARNING [main] Gene `GU280_gp04` has no transcripts. The entire gene will be marked as a transcript and an exon with ID `GU280_gp04`.
[2025-01-10 22:22:08,293] WARNING [main] Gene `GU280_gp05` has no transcripts. The entire gene will be marked as a transcript and an exon with ID `GU280_gp05`.
[2025-01-10 22:22:08,293] WARNING [main] Gene `GU280_gp06` has no transcripts. The entire gene will be marked as a transcript and an exon with ID `GU280_gp06`.
[2025-01-10 22:22:08,293] WARNING [main] Gene `GU280_gp07` has no transcripts. The entire gene will be marked as a transcript and an exon with ID `GU280_gp07`.
[2025-01-10 22:22:08,293] WARNING [main] Gene `GU280_gp08` has no transcripts. The entire gene will be marked as a transcript and an exon with ID `GU280_gp08`.
[2025-01-10 22:22:08,293] WARNING [main] Gene `GU280_gp09` has no transcripts. The entire gene will be marked as a transcript and an exon with ID `GU280_gp09`.
[2025-01-10 22:22:08,293] WARNING [main] Gene `GU280_gp10` has no transcripts. The entire gene will be marked as a transcript and an exon with ID `GU280_gp10`.
[2025-01-10 22:22:08,293] WARNING [main] Gene `GU280_gp11` has no transcripts. The entire gene will be marked as a transcript and an exon with ID `GU280_gp11`.
[2025-01-10 22:22:08,293]    INFO [ref] Splitting genome GCF_009858895.2_ASM985889v3_genomic.fna.gz into cDNA at /content/tmp/tmppr1s84kg
[2025-01-10 22:22:08,296]    INFO [ref] Concatenating 1 cDNAs to sars_cov2_transcripts.fa
[2025-01-10 22:22:08,297]    INFO [ref] Creating transcript-to-gene mapping at sars_cov2_t2g.txt
[2025-01-10 22:22:08,298] WARNING [ref] Using provided k-mer length 21 instead of optimal length 31
[2025-01-10 22:22:08,298]    INFO [ref] Indexing sars_cov2_transcripts.fa to sars_cov2_ref.idx
[2025-01-10 22:22:08,298]   DEBUG [ref] kallisto index -i sars_cov2_ref.idx -k 21 -t 2 -d homo_sapiens.cdna_dna.fa.gz sars_cov2_transcripts.fa
[2025-01-10 22:22:08,399]   DEBUG [ref]
[2025-01-10 22:22:08,400]   DEBUG [ref] [build] loading fasta file sars_cov2_transcripts.fa
[2025-01-10 22:22:08,400]   DEBUG [ref] [build] k-mer length: 21
[2025-01-10 22:22:08,400]   DEBUG [ref] KmerStream::KmerStream(): Start computing k-mer cardinality estimations (1/2)
[2025-01-10 22:22:08,400]   DEBUG [ref] KmerStream::KmerStream(): Start computing k-mer cardinality estimations (1/2)
[2025-01-10 22:22:08,400]   DEBUG [ref] KmerStream::KmerStream(): Finished
[2025-01-10 22:22:08,901]   DEBUG [ref] CompactedDBG::build(): Estimated number of k-mers occurring at least once: 29244
[2025-01-10 22:22:08,902]   DEBUG [ref] CompactedDBG::build(): Estimated number of minimizer occurring at least once: 7101
[2025-01-10 22:22:08,902]   DEBUG [ref] CompactedDBG::filter(): Processed 29044 k-mers in 11 reads
[2025-01-10 22:22:08,902]   DEBUG [ref] CompactedDBG::filter(): Found 29020 unique k-mers
[2025-01-10 22:22:08,902]   DEBUG [ref] CompactedDBG::filter(): Number of blocks in Bloom filter is 201
[2025-01-10 22:22:08,902]   DEBUG [ref] CompactedDBG::construct(): Extract approximate unitigs (1/2)
[2025-01-10 22:22:09,003]   DEBUG [ref] CompactedDBG::construct(): Extract approximate unitigs (2/2)
[2025-01-10 22:22:09,004]   DEBUG [ref] CompactedDBG::construct(): Closed all input files
[2025-01-10 22:22:09,004]   DEBUG [ref]
[2025-01-10 22:22:09,004]   DEBUG [ref] CompactedDBG::construct(): Splitting unitigs (1/2)
[2025-01-10 22:22:09,004]   DEBUG [ref]
[2025-01-10 22:22:09,004]   DEBUG [ref] CompactedDBG::construct(): Splitting unitigs (2/2)
[2025-01-10 22:22:09,004]   DEBUG [ref] CompactedDBG::construct(): Before split: 21 unitigs
[2025-01-10 22:22:09,004]   DEBUG [ref] CompactedDBG::construct(): After split (1/1): 21 unitigs
[2025-01-10 22:22:09,004]   DEBUG [ref] CompactedDBG::construct(): Unitigs split: 0
[2025-01-10 22:22:09,004]   DEBUG [ref] CompactedDBG::construct(): Unitigs deleted: 0
[2025-01-10 22:22:09,004]   DEBUG [ref]
[2025-01-10 22:22:09,004]   DEBUG [ref] CompactedDBG::construct(): Joining unitigs
[2025-01-10 22:22:09,004]   DEBUG [ref] CompactedDBG::construct(): After join: 12 unitigs
[2025-01-10 22:22:09,004]   DEBUG [ref] CompactedDBG::construct(): Joined 9 unitigs
[2025-01-10 22:22:09,004]   DEBUG [ref] [build] extracting distinguishing flanking k-mers from "homo_sapiens.cdna_dna.fa.gz"
[2025-01-10 22:52:25,513]   DEBUG [ref] [build] identified 192 distinguishing flanking k-mers
[2025-01-10 22:52:25,514]   DEBUG [ref] [build] building MPHF
[2025-01-10 22:52:25,614]   DEBUG [ref] [build] creating equivalence classes ...
[2025-01-10 22:52:25,614]   DEBUG [ref] [build] target de Bruijn graph has k-mer length 21 and minimizer length 13
[2025-01-10 22:52:25,614]   DEBUG [ref] [build] target de Bruijn graph has 385 contigs and contains 29232 k-mers
[2025-01-10 22:52:25,614]   DEBUG [ref]
[2025-01-10 22:52:26,716]   DEBUG [main] Removing `tmp` directory
[2025-01-10 22:52:53,155]   DEBUG [main] Printing verbose output
[2025-01-10 22:52:55,364]   DEBUG [main] kallisto binary located at /content/kallisto/build/src/kallisto
[2025-01-10 22:52:55,365]   DEBUG [main] bustools binary located at /usr/local/lib/python3.10/dist-packages/kb_python/bins/linux/bustools/bustools
[2025-01-10 22:52:55,369]   DEBUG [main] Creating `kb_output_sars_cov2/tmp` directory
[2025-01-10 22:52:55,369]   DEBUG [main] Namespace(list=False, command='count', tmp=None, keep_tmp=False, verbose=True, i='sars_cov2_ref.idx', g='sars_cov2_t2g.txt', x='10xv2', o='kb_output_sars_cov2', num=False, w=None, r=None, t=2, m='2G', strand=None, inleaved=False, genomebam=False, aa=False, gtf=None, chromosomes=None, workflow='standard', em=False, mm=False, tcc=False, filter=None, filter_threshold=None, c1=None, c2=None, overwrite=False, dry_run=False, batch_barcodes=True, loom=False, h5ad=True, loom_names='barcode,target_name', sum='none', cellranger=False, gene_names=False, N=None, report=False, no_inspect=False, long=False, threshold=0.8, error_rate=None, platform='ONT', kallisto='kallisto/build/src/kallisto', bustools='/usr/local/lib/python3.10/dist-packages/kb_python/bins/linux/bustools/bustools', opt_off=False, k=21, no_validate=False, no_fragment=False, union=False, no_jump=False, quant_umis=False, keep_flags=False, parity='single', fragment_l=None, fragment_s=None, bootstraps=None, matrix_to_files=False, matrix_to_directories=False, fastqs=['batch.txt'])
[2025-01-10 22:53:00,312]    INFO [count] Using index sars_cov2_ref.idx to generate BUS file to kb_output_sars_cov2 from
[2025-01-10 22:53:00,312]    INFO [count]         /content/kb_output_sars_cov2/tmp/tmp4p630fag
[2025-01-10 22:53:00,313]   DEBUG [count] kallisto bus -i sars_cov2_ref.idx -o kb_output_sars_cov2 -x 10xv2 -t 2 --batch-barcodes --batch /content/kb_output_sars_cov2/tmp/tmp4p630fag
[2025-01-10 22:53:00,414]   DEBUG [count]
[2025-01-10 22:53:00,414]   DEBUG [count] [bus] will try running read files supplied in batch file
[2025-01-10 22:53:00,414]   DEBUG [count] [bus] Note: Strand option was not specified; setting it to --fr-stranded for specified technology
[2025-01-10 22:53:00,414]   DEBUG [count] [index] k-mer length: 21
[2025-01-10 22:53:00,414]   DEBUG [count] [index] number of targets: 11
[2025-01-10 22:53:00,414]   DEBUG [count] [index] number of k-mers: 29,232
[2025-01-10 22:53:00,414]   DEBUG [count] [index] number of distinguishing flanking k-mers: 192
[2025-01-10 22:53:00,414]   DEBUG [count] [quant] running in single-end mode
[2025-01-10 22:53:00,414]   DEBUG [count] [quant] will process file 1: /content/SRR11321526_1_short.fastq
[2025-01-10 22:53:00,414]   DEBUG [count] [quant] will process file 2: /content/SRR11321526_2_short.fastq
[2025-01-10 22:53:00,415]   DEBUG [count] [quant] will process file 1: /content/SRR11321528_1_short.fastq
[2025-01-10 22:53:00,415]   DEBUG [count] [quant] will process file 2: /content/SRR11321528_2_short.fastq
[2025-01-10 22:53:00,415]   DEBUG [count] [quant] will process file 1: /content/SRR11321027_1_short.fastq
[2025-01-10 22:53:00,415]   DEBUG [count] [quant] will process file 2: /content/SRR11321027_2_short.fastq
[2025-01-10 22:53:00,415]   DEBUG [count] [quant] will process file 1: /content/SRR11321026_1_short.fastq
[2025-01-10 22:53:00,415]   DEBUG [count] [quant] will process file 2: /content/SRR11321026_2_short.fastq
[2025-01-10 22:54:13,017]   DEBUG [count] [quant] finding pseudoalignments for all files ...
[2025-01-10 22:55:26,842]   DEBUG [count] [progress] 1M reads processed (0.0% mapped)
[2025-01-10 22:56:38,616]   DEBUG [count] [progress] 2M reads processed (0.0% mapped)
[2025-01-10 22:57:51,633]   DEBUG [count] [progress] 3M reads processed (0.0% mapped)
[2025-01-10 22:59:04,247]   DEBUG [count] [progress] 4M reads processed (0.0% mapped)
[2025-01-10 23:00:18,133]   DEBUG [count] [progress] 5M reads processed (0.0% mapped)
[2025-01-10 23:01:29,261]   DEBUG [count] [progress] 6M reads processed (0.0% mapped)
[2025-01-10 23:02:44,006]   DEBUG [count] [progress] 7M reads processed (0.0% mapped)
[2025-01-10 23:03:58,135]   DEBUG [count] [progress] 8M reads processed (0.0% mapped)
[2025-01-10 23:05:12,056]   DEBUG [count] [progress] 9M reads processed (0.0% mapped)
[2025-01-10 23:06:27,251]   DEBUG [count] [progress] 10M reads processed (0.0% mapped)
[2025-01-10 23:07:41,214]   DEBUG [count] [progress] 11M reads processed (0.0% mapped)
[2025-01-10 23:08:55,498]   DEBUG [count] [progress] 12M reads processed (0.0% mapped)
[2025-01-10 23:10:09,482]   DEBUG [count] [progress] 13M reads processed (0.0% mapped)
[2025-01-10 23:11:23,321]   DEBUG [count] [progress] 14M reads processed (0.0% mapped)
[2025-01-10 23:12:38,384]   DEBUG [count] [progress] 15M reads processed (0.0% mapped)
[2025-01-10 23:13:53,246]   DEBUG [count] [progress] 16M reads processed (0.0% mapped)
[2025-01-10 23:15:06,130]   DEBUG [count] [progress] 17M reads processed (0.0% mapped)
[2025-01-10 23:16:20,647]   DEBUG [count] [progress] 18M reads processed (0.0% mapped)
[2025-01-10 23:17:34,493]   DEBUG [count] [progress] 19M reads processed (0.0% mapped)
[2025-01-10 23:18:41,643]   DEBUG [count] [progress] 20M reads processed (0.0% mapped)
[2025-01-10 23:19:57,201]   DEBUG [count] [progress] 21M reads processed (0.0% mapped)
[2025-01-10 23:21:12,646]   DEBUG [count] [progress] 22M reads processed (0.0% mapped)
[2025-01-10 23:22:26,495]   DEBUG [count] [progress] 23M reads processed (0.0% mapped)
[2025-01-10 23:23:41,101]   DEBUG [count] [progress] 24M reads processed (0.0% mapped)
[2025-01-10 23:25:00,423]   DEBUG [count] [progress] 25M reads processed (0.0% mapped)
[2025-01-10 23:26:15,749]   DEBUG [count] [progress] 26M reads processed (0.0% mapped)
[2025-01-10 23:27:30,788]   DEBUG [count] [progress] 27M reads processed (0.0% mapped)
[2025-01-10 23:28:44,837]   DEBUG [count] [progress] 28M reads processed (0.0% mapped)
[2025-01-10 23:30:00,421]   DEBUG [count] [progress] 29M reads processed (0.0% mapped)
[2025-01-10 23:31:15,614]   DEBUG [count] [progress] 30M reads processed (0.0% mapped)
[2025-01-10 23:32:29,335]   DEBUG [count] [progress] 31M reads processed (0.0% mapped)
[2025-01-10 23:33:42,670]   DEBUG [count] [progress] 32M reads processed (0.0% mapped)
[2025-01-10 23:34:55,149]   DEBUG [count] [progress] 33M reads processed (0.0% mapped)
[2025-01-10 23:36:08,691]   DEBUG [count] [progress] 34M reads processed (0.0% mapped)
[2025-01-10 23:37:21,768]   DEBUG [count] [progress] 35M reads processed (0.0% mapped)
[2025-01-10 23:38:34,436]   DEBUG [count] [progress] 36M reads processed (0.0% mapped)
[2025-01-10 23:39:46,948]   DEBUG [count] [progress] 37M reads processed (0.0% mapped)
[2025-01-10 23:40:59,180]   DEBUG [count] [progress] 38M reads processed (0.0% mapped)
[2025-01-10 23:42:13,862]   DEBUG [count] [progress] 39M reads processed (0.0% mapped)
[2025-01-10 23:43:28,554]   DEBUG [count] [progress] 40M reads processed (0.0% mapped)
[2025-01-10 23:44:42,803]   DEBUG [count] [progress] 41M reads processed (0.0% mapped)
[2025-01-10 23:46:00,436]   DEBUG [count] [progress] 42M reads processed (0.0% mapped)
[2025-01-10 23:47:17,115]   DEBUG [count] [progress] 43M reads processed (0.0% mapped)
[2025-01-10 23:48:33,170]   DEBUG [count] [progress] 44M reads processed (0.0% mapped)
[2025-01-10 23:49:51,295]   DEBUG [count] [progress] 45M reads processed (0.0% mapped)
[2025-01-10 23:51:09,533]   DEBUG [count] [progress] 46M reads processed (0.0% mapped)
[2025-01-10 23:52:27,425]   DEBUG [count] [progress] 47M reads processed (0.0% mapped)
[2025-01-10 23:53:40,746]   DEBUG [count] [progress] 48M reads processed (0.0% mapped)
[2025-01-10 23:53:40,846]   DEBUG [count] [progress] 50M reads processed (0.0% mapped)              done
[2025-01-10 23:53:40,847]   DEBUG [count] [quant] processed 50,000,000 reads, 14 reads pseudoaligned
[2025-01-10 23:53:41,948]   DEBUG [count]
[2025-01-10 23:53:41,964]    INFO [count] Sorting BUS file kb_output_sars_cov2/output.bus to kb_output_sars_cov2/tmp/output.s.bus
[2025-01-10 23:53:41,964]   DEBUG [count] bustools sort -o kb_output_sars_cov2/tmp/output.s.bus -T kb_output_sars_cov2/tmp -t 2 -m 2G kb_output_sars_cov2/output.bus
[2025-01-10 23:53:44,131]   DEBUG [count] all fits in buffer
[2025-01-10 23:53:45,432]   DEBUG [count] Read in 14 BUS records
[2025-01-10 23:53:45,432]   DEBUG [count] reading time 5e-06s
[2025-01-10 23:53:45,432]   DEBUG [count] sorting time 2e-06s
[2025-01-10 23:53:45,432]   DEBUG [count] writing time 5.8e-05s
[2025-01-10 23:53:45,433]    INFO [count] On-list not provided
[2025-01-10 23:53:45,433]    INFO [count] Copying pre-packaged 10XV2 on-list to kb_output_sars_cov2
[2025-01-10 23:53:45,607]    INFO [count] Inspecting BUS file kb_output_sars_cov2/tmp/output.s.bus
[2025-01-10 23:53:45,607]   DEBUG [count] bustools inspect -o kb_output_sars_cov2/inspect.json -w kb_output_sars_cov2/10x_version2_whitelist.txt kb_output_sars_cov2/tmp/output.s.bus
[2025-01-10 23:53:47,311]    INFO [count] Correcting BUS records in kb_output_sars_cov2/tmp/output.s.bus to kb_output_sars_cov2/tmp/output.s.c.bus with on-list kb_output_sars_cov2/10x_version2_whitelist.txt
[2025-01-10 23:53:47,312]   DEBUG [count] bustools correct -o kb_output_sars_cov2/tmp/output.s.c.bus -w kb_output_sars_cov2/10x_version2_whitelist.txt kb_output_sars_cov2/tmp/output.s.bus
[2025-01-10 23:53:48,415]   DEBUG [count] Found 737280 barcodes in the on-list
[2025-01-10 23:53:48,816]   DEBUG [count] Processed 14 BUS records
[2025-01-10 23:53:48,816]   DEBUG [count] In on-list = 13
[2025-01-10 23:53:48,816]   DEBUG [count] Corrected    = 0
[2025-01-10 23:53:48,816]   DEBUG [count] Uncorrected  = 1
[2025-01-10 23:53:49,918]    INFO [count] Sorting BUS file kb_output_sars_cov2/tmp/output.s.c.bus to kb_output_sars_cov2/output.unfiltered.bus
[2025-01-10 23:53:49,918]   DEBUG [count] bustools sort -o kb_output_sars_cov2/output.unfiltered.bus -T kb_output_sars_cov2/tmp -t 2 -m 2G kb_output_sars_cov2/tmp/output.s.c.bus
[2025-01-10 23:53:52,025]   DEBUG [count] all fits in buffer
[2025-01-10 23:53:53,326]   DEBUG [count] Read in 13 BUS records
[2025-01-10 23:53:53,326]   DEBUG [count] reading time 5e-06s
[2025-01-10 23:53:53,326]   DEBUG [count] sorting time 1e-06s
[2025-01-10 23:53:53,326]   DEBUG [count] writing time 5.4e-05s
[2025-01-10 23:53:53,330]    INFO [count] Generating count matrix kb_output_sars_cov2/counts_unfiltered/cells_x_genes from BUS file kb_output_sars_cov2/output.unfiltered.bus
[2025-01-10 23:53:53,330]   DEBUG [count] bustools count -o kb_output_sars_cov2/counts_unfiltered/cells_x_genes -g sars_cov2_t2g.txt -e kb_output_sars_cov2/matrix.ec -t kb_output_sars_cov2/transcripts.txt --genecounts --umi-gene kb_output_sars_cov2/output.unfiltered.bus
[2025-01-10 23:53:54,995]   DEBUG [count] kb_output_sars_cov2/counts_unfiltered/cells_x_genes.mtx passed validation
[2025-01-10 23:53:54,996]    INFO [count] Writing gene names to file kb_output_sars_cov2/counts_unfiltered/cells_x_genes.genes.names.txt
[2025-01-10 23:53:54,996] WARNING [count] 11 gene IDs do not have corresponding valid gene names. These genes will use their gene IDs instead.
[2025-01-10 23:53:54,996]    INFO [count] Reading matrix kb_output_sars_cov2/counts_unfiltered/cells_x_genes.mtx
[2025-01-10 23:53:55,302]    INFO [count] Writing matrix to h5ad kb_output_sars_cov2/counts_unfiltered/adata.h5ad
[2025-01-10 23:53:55,622]   DEBUG [main] Removing `kb_output_sars_cov2/tmp` directory

Though some reads aligned, only 14 out of 50,000,000 reads mapped to this SARS-CoV2 genome, so this was likely not the correct viral reference genome. You could also confirm the presence of virus using the assembly of contigs from the raw sequencing data (for example, using a tool like SPAdes) and the subsequent alignment of the contigs to all viral reference genomes.

For novel viruses, the parallel analysis of viral presence and host gene expression can provide further insights, as was done here (investigate viral tropism) and here (correlate viral presence with altered host gene expression at single-cell resolution).

For inspiration regarding further analysis, each figure from our manuscript can be reproduced using these notebooks.