Viral proteins in single-cell RNA seq data

Click “Runtime” -> “Run all” to run this notebook.

In this tutorial, we will align single-cell RNA sequencing data collected from Dengue virus (DENV)-infected CEM.NK$^{R}$ cells as well as peripheral blood mononuclear cells (PBMC) data collected from a patient suffering from DENV (during and after infection) to viral RdRP protein sequences. Let’s see if we can detect DENV RdRP sequences as expected.

Workflow reference:

Laura Luebbert, Delaney K Sullivan, Maria Carilli, Kristján Eldjárn Hjörleifsson, Alexander Viloria Winnett, Tara Chari, Lior Pachter (2023). Efficient and accurate detection of viral sequences at single-cell resolution reveals novel viruses perturbing host gene expression. bioRxiv 2023.12.11.571168; doi: https://doi.org/10.1101/2023.12.11.571168

Data references:
RNA seq data:
https://www.nature.com/articles/s41598-020-65939-5
https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA613041&o=acc_s%3Aa
PalmDB viral protein reference database:
https://www.nature.com/articles/s41586-021-04332-2

Written by: Laura Luebbert (last updated: January 7th, 2025)

[ ]:

# SRR numbers of the sequencing dataset(s) to analyze
srr_numbers = [
    "SRR11321526",
    "SRR11321528",
    "SRR11321027",
    "SRR11321026"
    ]

# Genus and species of the host (separate by "_")
host_species = "homo_sapiens"

# Set this to False to download the complete RNA seq dataset (note: this might exceed the Google Colab disk space)
# With subset_of_data=True, only a subset of the DENV dataset is aligned in this example notebook to decrease runtime and required disk space
subset_of_data = True

# Number of threads to use during alignments
threads = 2

# k-mer length: Increase k for increased specificity (at a potential trade-off with sensitivity)
# k should be an odd integer and should be >=31 for translated alignment
# Since k relates to nucleotides and not amino acids, k corresponds to matching ⌊k/3⌋ amino acids (for k=31, this corresponds to ⌊31/3⌋=10 amino acids)
k = 31

Install software

[ ]:

!pip install -q ffq gget kb_python==0.29.1 anndata==0.10.9

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 49.6/49.6 kB 2.2 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
  Preparing metadata (setup.py) ... done
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36.5/36.5 MB 9.5 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 129.0/129.0 kB 8.4 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 43.1/43.1 MB 13.5 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 50.4/50.4 kB 2.5 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.2/3.2 MB 35.2 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34.4/34.4 MB 12.0 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 50.9/50.9 MB 11.3 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 30.2 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 33.3 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.9/56.9 kB 3.3 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 22.0/22.0 MB 33.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88.8/88.8 kB 6.0 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 40.6/40.6 kB 2.3 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 83.6/83.6 kB 5.4 MB/s eta 0:00:00
  Building wheel for loompy (setup.py) ... done
  Building wheel for session-info (setup.py) ... done

[ ]:

import anndata
import gget
import glob
import gzip
import json
import matplotlib.colors
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import platform
from Bio import SeqIO

%config InlineBackend.figure_format='retina'

[ ]:

# This step will not be necessary after the next kb release
if k<32:
  # Install kallisto v0.50.0 from source
  # This version performs translated search faster while providing the same results as later versions, but it does not support k>31
  # (the features to increase runtime and support k>31 will be incorporated into the next version of kb, so this step will not be necessary after the next release)
  !git clone https://github.com/pachterlab/kallisto.git --branch v0.50.0
  !cd kallisto && mkdir build && cd build && cmake .. && make
  kallisto = "kallisto/build/src/kallisto"
else:
  # Find the kallisto binary installed automatically with kb (or simply remove the --kallisto arguments below)
  kallisto = glob.glob(f"/usr/local/lib/python*/dist-packages/kb_python/bins/{platform.system().lower()}/kallisto/kallisto_k64")[0]

Download raw sequencing data

Here, we will use ffq to download the sequencing data from DENV1 infected CEM.NK^R cells (SRR11321027) and the uninfected control (SRR11321026) as well as PBMC data collected from a patient experiencing a DENV infection at fever day -2 (SRR11321526) and 180 (baseline post-infection) (SRR11321528):

[ ]:

%%time

## Get the FTP download links for raw data using ffq and store the results in a json file
srr_numbers = " ".join(srr_numbers)
!ffq $srr_numbers \
    --ftp \
    -o ffq.json

## Load ffq output
f = open("ffq.json")
data_json = json.load(f)
f.close()

## Download raw data using FTP links fetched by ffq
# NOTE: To decrease the runtime and required disk space of this example notebook,
# we will only download the first [gb_to_download] GB of each fastq file when subset_of_data=True
gb_to_download = 3

# Convert GB to bytes
max_bytes = gb_to_download * 1073741824
# Download data
for dataset in data_json:
    url = dataset["url"]

    if subset_of_data:
      !curl -r 0-$max_bytes -O $url
    else:
      !curl -O $url

# Remove unnecessary file to save disk space
!rm SRR11321528.fastq.gz

[2025-01-10 15:00:38,840]    INFO Parsing run SRR11321526
[2025-01-10 15:00:40,240]    INFO Parsing run SRR11321528
[2025-01-10 15:00:41,497]    INFO Parsing run SRR11321027
[2025-01-10 15:00:42,746]    INFO Parsing run SRR11321026
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3072M  100 3072M    0     0  44.8M      0  0:01:08  0:01:08 --:--:-- 22.5M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3072M  100 3072M    0     0  49.8M      0  0:01:01  0:01:01 --:--:-- 51.1M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3072M  100 3072M    0     0  49.1M      0  0:01:02  0:01:02 --:--:-- 51.0M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3072M  100 3072M    0     0  50.4M      0  0:01:00  0:01:00 --:--:-- 51.7M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3072M  100 3072M    0     0  40.0M      0  0:01:16  0:01:16 --:--:-- 49.9M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3072M  100 3072M    0     0  49.3M      0  0:01:02  0:01:02 --:--:-- 50.8M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3072M  100 3072M    0     0  49.6M      0  0:01:01  0:01:01 --:--:-- 51.1M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3072M  100 3072M    0     0  49.7M      0  0:01:01  0:01:01 --:--:-- 50.9M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3072M  100 3072M    0     0  50.3M      0  0:01:00  0:01:00 --:--:-- 51.7M
CPU times: user 4.67 s, sys: 768 ms, total: 5.44 s
Wall time: 9min 44s

Download the viral protein reference sequences

In this case, we are using a version of the PalmDB database v1 that was optimized for viral sequence detection with kallisto as described in this manuscript. These files are stored in the GitHub repository accompanying the manuscript.

[ ]:

# Download the virus ID to taxonomy mapping
!wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/ID_to_taxonomy_mapping.csv
# Download the customized transcripts to gene mapping
!wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/palmdb_clustered_t2g.txt
# Download the RdRP amino acid sequences
!wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/palmdb_rdrp_seqs.fa

--2025-01-10 15:10:22--  https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/ID_to_taxonomy_mapping.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19705497 (19M) [text/plain]
Saving to: ‘ID_to_taxonomy_mapping.csv’

ID_to_taxonomy_mapp 100%[===================>]  18.79M  --.-KB/s    in 0.06s

2025-01-10 15:10:23 (318 MB/s) - ‘ID_to_taxonomy_mapping.csv’ saved [19705497/19705497]

--2025-01-10 15:10:23--  https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/palmdb_clustered_t2g.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4561689 (4.3M) [text/plain]
Saving to: ‘palmdb_clustered_t2g.txt’

palmdb_clustered_t2 100%[===================>]   4.35M  --.-KB/s    in 0.02s

2025-01-10 15:10:23 (216 MB/s) - ‘palmdb_clustered_t2g.txt’ saved [4561689/4561689]

--2025-01-10 15:10:23--  https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/palmdb_rdrp_seqs.fa
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 35361991 (34M) [text/plain]
Saving to: ‘palmdb_rdrp_seqs.fa’

palmdb_rdrp_seqs.fa 100%[===================>]  33.72M  --.-KB/s    in 0.1s

2025-01-10 15:10:24 (280 MB/s) - ‘palmdb_rdrp_seqs.fa’ saved [35361991/35361991]

Build a reference index from the viral protein sequences and mask host (here, human) sequences

In addition to building a reference index from the protein sequences, we also want to mask host (in this case, human) sequences. For a more conservative host masking workflow that incorporates an additional step for the removal of host sequences, seethis notebook.

The `--aa argument <https://kallisto.readthedocs.io/en/latest/translated/pseudoalignment.html>`__ tells kb that this is an amino acid reference.

The `--d-list argument <https://kallisto.readthedocs.io/en/latest/index/index_generation.html#the-d-list>`__ is the path to the host genome/transcriptome. These sequences will be masked in the index. Here, we are using `gget ref <https://pachterlab.github.io/gget/en/ref.html>`__ to fetch the latest human genome and transcriptome from Ensembl.

We are using --workflow custom here since we do not have a .gtf file for the PalmDB fasta file.

--kallisto provides the path to the kallisto binary v0.50.0 (if k ≤ 31), which we installed above. This argument can be omitted to use the kallisto binary included in kb.

Building the index can take some time, depending on the number of threads used and the size of the D-list (here, ~20 min), but it only needs to be done once.

[ ]:

# Download the host reference genome and transcriptome from Ensembl
!gget ref -w cdna,dna -d $host_species

# Concatenate host genome and transcriptome into one file
host_genome = glob.glob("*.dna*.fa.gz")[0]
host_transcriptome = glob.glob("*.cdna.all.fa.gz")[0]
host_genome_transciptome = f"{host_species}.cdna_dna.fa.gz"
!cat $host_genome $host_transcriptome > $host_genome_transciptome

15:10:30 - INFO - Fetching reference information for homo_sapiens from Ensembl release: 113.
{
    "homo_sapiens": {
        "transcriptome_cdna": {
            "ftp": "http://ftp.ensembl.org/pub/release-113/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz",
            "ensembl_release": 113,
            "release_date": "2024-08-15",
            "release_time": "14:44",
            "bytes": "76M"
        },
        "genome_dna": {
            "ftp": "http://ftp.ensembl.org/pub/release-113/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz",
            "ensembl_release": 113,
            "release_date": "2024-08-15",
            "release_time": "20:18",
            "bytes": "841M"
        }
    }
}
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 75.9M  100 75.9M    0     0  34.4M      0  0:00:02  0:00:02 --:--:-- 34.4M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  841M  100  841M    0     0  47.0M      0  0:00:17  0:00:17 --:--:-- 51.0M

[ ]:

%%time
# Generate the viral protein reference index with masked host sequences using kallisto
reference_index = f"palmdb_{host_species}_dlist_cdna_dna.idx"

!kb ref \
    --kallisto $kallisto \
    --verbose \
    -t $threads \
    --aa \
    -k $k \
    --workflow custom \
    --d-list $host_genome_transciptome \
    -i $reference_index \
    palmdb_rdrp_seqs.fa

[2025-01-10 15:11:01,539]   DEBUG [main] Printing verbose output
[2025-01-10 15:11:03,746]   DEBUG [main] kallisto binary located at /content/kallisto/build/src/kallisto
[2025-01-10 15:11:03,746]   DEBUG [main] bustools binary located at /usr/local/lib/python3.10/dist-packages/kb_python/bins/linux/bustools/bustools
[2025-01-10 15:11:03,746]   DEBUG [main] Creating `tmp` directory
[2025-01-10 15:11:03,746]   DEBUG [main] Namespace(list=False, command='ref', tmp=None, keep_tmp=False, verbose=True, i='palmdb_homo_sapiens_dlist_cdna_dna.idx', g=None, f1=None, include_attribute=None, exclude_attribute=None, f2=None, c1=None, c2=None, d=None, k=31, t=2, d_list='homo_sapiens.cdna_dna.fa.gz', d_list_overhang=1, aa=True, workflow='custom', distinguish=False, make_unique=False, overwrite=False, kallisto='kallisto/build/src/kallisto', bustools='/usr/local/lib/python3.10/dist-packages/kb_python/bins/linux/bustools/bustools', opt_off=False, fasta='palmdb_rdrp_seqs.fa', gtf=None, feature=None, no_mismatches=False, ec_max_size=None, flank=None)
[2025-01-10 15:11:03,747]    INFO [ref_custom] Indexing palmdb_rdrp_seqs.fa to palmdb_homo_sapiens_dlist_cdna_dna.idx
[2025-01-10 15:11:03,747]   DEBUG [ref_custom] kallisto index -i palmdb_homo_sapiens_dlist_cdna_dna.idx -k 31 -t 2 -d homo_sapiens.cdna_dna.fa.gz --aa palmdb_rdrp_seqs.fa
[2025-01-10 15:11:03,848]   DEBUG [ref_custom]
[2025-01-10 15:11:03,849]   DEBUG [ref_custom] [build] loading fasta file palmdb_rdrp_seqs.fa
[2025-01-10 15:11:03,849]   DEBUG [ref_custom] [build] k-mer length: 31
[2025-01-10 15:11:07,562]   DEBUG [ref_custom] KmerStream::KmerStream(): Start computing k-mer cardinality estimations (1/2)
[2025-01-10 15:11:07,563]   DEBUG [ref_custom] KmerStream::KmerStream(): Start computing k-mer cardinality estimations (1/2)
[2025-01-10 15:11:14,888]   DEBUG [ref_custom] KmerStream::KmerStream(): Finished
[2025-01-10 15:11:15,289]   DEBUG [ref_custom] CompactedDBG::build(): Estimated number of k-mers occurring at least once: 37641510
[2025-01-10 15:11:15,289]   DEBUG [ref_custom] CompactedDBG::build(): Estimated number of minimizer occurring at least once: 7877811
[2025-01-10 15:11:28,647]   DEBUG [ref_custom] CompactedDBG::filter(): Processed 87630084 k-mers in 296561 reads
[2025-01-10 15:11:28,647]   DEBUG [ref_custom] CompactedDBG::filter(): Found 37508762 unique k-mers
[2025-01-10 15:11:28,647]   DEBUG [ref_custom] CompactedDBG::filter(): Number of blocks in Bloom filter is 257317
[2025-01-10 15:11:29,048]   DEBUG [ref_custom] CompactedDBG::construct(): Extract approximate unitigs (1/2)
[2025-01-10 15:12:54,026]   DEBUG [ref_custom] CompactedDBG::construct(): Extract approximate unitigs (2/2)
[2025-01-10 15:13:17,004]   DEBUG [ref_custom] CompactedDBG::construct(): Closed all input files
[2025-01-10 15:13:17,005]   DEBUG [ref_custom]
[2025-01-10 15:13:17,005]   DEBUG [ref_custom] CompactedDBG::construct(): Splitting unitigs (1/2)
[2025-01-10 15:13:17,206]   DEBUG [ref_custom]
[2025-01-10 15:13:17,206]   DEBUG [ref_custom] CompactedDBG::construct(): Splitting unitigs (2/2)
[2025-01-10 15:13:18,313]   DEBUG [ref_custom] CompactedDBG::construct(): Before split: 2040523 unitigs
[2025-01-10 15:13:18,313]   DEBUG [ref_custom] CompactedDBG::construct(): After split (1/1): 2040523 unitigs
[2025-01-10 15:13:18,313]   DEBUG [ref_custom] CompactedDBG::construct(): Unitigs split: 1534
[2025-01-10 15:13:18,313]   DEBUG [ref_custom] CompactedDBG::construct(): Unitigs deleted: 0
[2025-01-10 15:13:18,313]   DEBUG [ref_custom]
[2025-01-10 15:13:18,313]   DEBUG [ref_custom] CompactedDBG::construct(): Joining unitigs
[2025-01-10 15:13:27,757]   DEBUG [ref_custom] CompactedDBG::construct(): After join: 2021701 unitigs
[2025-01-10 15:13:27,757]   DEBUG [ref_custom] CompactedDBG::construct(): Joined 19205 unitigs
[2025-01-10 15:13:28,659]   DEBUG [ref_custom] [build] extracting distinguishing flanking k-mers from "homo_sapiens.cdna_dna.fa.gz"
[2025-01-10 15:43:17,553]   DEBUG [ref_custom] [build] identified 130 distinguishing flanking k-mers
[2025-01-10 15:43:26,580]   DEBUG [ref_custom] [build] building MPHF
[2025-01-10 15:43:35,998]   DEBUG [ref_custom] [build] creating equivalence classes ...
[2025-01-10 15:44:36,850]   DEBUG [ref_custom] [build] target de Bruijn graph has k-mer length 31 and minimizer length 23
[2025-01-10 15:44:36,850]   DEBUG [ref_custom] [build] target de Bruijn graph has 2021821 contigs and contains 37541835 k-mers
[2025-01-10 15:44:53,398]   DEBUG [ref_custom]
[2025-01-10 15:44:54,701]    INFO [ref_custom] Finished creating custom index
[2025-01-10 15:44:54,701]   DEBUG [main] Removing `tmp` directory
CPU times: user 12.3 s, sys: 1.98 s, total: 14.3 s
Wall time: 34min 1s

Alternatively, you can download a precomputed PalmDB reference index for use with kb in which the human (or mouse) genome and transcriptome were masked. You can find links to all available precomputed reference indices here.

Note: These indices were computed with k=31.

[ ]:

# # Download precomputed index with masked human genome and transcriptome
# k=31
# !wget https://data.caltech.edu/records/sh33z-hrx98/files/palmdb_human_dlist_cdna_dna.idx?download=1
# !mv palmdb_human_dlist_cdna_dna.idx?download=1 palmdb_human_dlist_cdna_dna.idx

Align data using kallisto translated search

Create a batch.txt file so we can run all fastq files simultaneously (to learn more about batch files, see Box 7 in the Protocols paper). To keep track of which cell barcodes belonged to which SRR file, we will activate the --batch-barcodes option in the alignment step below:

[ ]:

%%time

samples = srr_numbers.split(" ")

# NOTE: To decrease the runtime and disk space of this example notebook,
# we will only align the top [n_seq_to_keep[i]] sequences in each fastq file when subset_of_data=True
n_seq_to_keep = [
  20000000,
  20000000,
  5000000, # No need to keep as many reads for the cell line samples as viral read % is higher
  5000000
]

with open("batch.txt", "w") as batch_file:
  for i, sample in enumerate(samples):
    R1 = f"/content/{sample}_1.fastq.gz"
    R2 = f"/content/{sample}_2.fastq.gz"

    if subset_of_data:
      # Shorten fastq files (skip this step during a real analysis)
      n_rows_to_keep = n_seq_to_keep[i] * 4
      R1_short = R1.split(".fastq.gz")[0] + "_short.fastq"
      R2_short = R2.split(".fastq.gz")[0] + "_short.fastq"

      !zcat $R1 | head -$n_rows_to_keep > $R1_short
      !zcat $R2 | head -$n_rows_to_keep > $R2_short
      # Delete original files to save disk space
      !rm $R1 $R2

      # Write batch file in the following format:
      # sample_name \t R1_filepath \t R2_filepath
      batch_file.write(sample + "\t" + R1_short + "\t" + R2_short + "\n")

    else:
      batch_file.write(sample + "\t" + R1 + "\t" + R2 + "\n")

CPU times: user 3.07 s, sys: 458 ms, total: 3.53 s
Wall time: 7min 41s

Align FASTQs using kallisto translated search:

The --aa argument tells kb that this is an amino acid reference index.
The `-x argument <https://kallisto.readthedocs.io/en/latest/sc/technologies.html>`__ tells kb where to find the cell barcodes and UMIs in the data. Here, we set it to 10xv2 because this data was generated using the Chromium Single-Cell 5′ Reagent version 2 kit.
The --h5ad flag generates a .h5ad file from the count matrix.

NOTE for paired read data: kallisto translated search currently does not yet support the simultaneous alignment of paired reads, so parity should always be set to ‘single’ (--parity single) and the R1 and R2 files obtained using paired read sequencing should be aligned separately. The resulting matrices from the R1 and R2 reads should provide similar viral counts.

[ ]:

%%time
out_folder = "kb_output"

!kb count \
    --kallisto $kallisto \
    --verbose \
    -t $threads \
    --aa \
    -k $k \
    -x 10xv2 \
    --parity single \
    -i $reference_index \
    -g palmdb_clustered_t2g.txt \
    --h5ad \
    -o $out_folder \
    --batch-barcodes \
    batch.txt

[2025-01-10 15:52:50,983]   DEBUG [main] Printing verbose output
[2025-01-10 15:52:53,191]   DEBUG [main] kallisto binary located at /content/kallisto/build/src/kallisto
[2025-01-10 15:52:53,192]   DEBUG [main] bustools binary located at /usr/local/lib/python3.10/dist-packages/kb_python/bins/linux/bustools/bustools
[2025-01-10 15:52:53,192]   DEBUG [main] Creating `kb_output/tmp` directory
[2025-01-10 15:52:53,194]   DEBUG [main] Namespace(list=False, command='count', tmp=None, keep_tmp=False, verbose=True, i='palmdb_homo_sapiens_dlist_cdna_dna.idx', g='palmdb_clustered_t2g.txt', x='10xv2', o='kb_output', num=False, w=None, r=None, t=2, m='2G', strand=None, inleaved=False, genomebam=False, aa=True, gtf=None, chromosomes=None, workflow='standard', em=False, mm=False, tcc=False, filter=None, filter_threshold=None, c1=None, c2=None, overwrite=False, dry_run=False, batch_barcodes=True, loom=False, h5ad=True, loom_names='barcode,target_name', sum='none', cellranger=False, gene_names=False, N=None, report=False, no_inspect=False, long=False, threshold=0.8, error_rate=None, platform='ONT', kallisto='kallisto/build/src/kallisto', bustools='/usr/local/lib/python3.10/dist-packages/kb_python/bins/linux/bustools/bustools', opt_off=False, k=31, no_validate=False, no_fragment=False, union=False, no_jump=False, quant_umis=False, keep_flags=False, parity='single', fragment_l=None, fragment_s=None, bootstraps=None, matrix_to_files=False, matrix_to_directories=False, fastqs=['batch.txt'])
[2025-01-10 15:52:57,069]    INFO [count] Using index palmdb_homo_sapiens_dlist_cdna_dna.idx to generate BUS file to kb_output from
[2025-01-10 15:52:57,069]    INFO [count]         /content/kb_output/tmp/tmpsb1qpyjx
[2025-01-10 15:52:57,069]   DEBUG [count] kallisto bus -i palmdb_homo_sapiens_dlist_cdna_dna.idx -o kb_output -x 10xv2 -t 2 --aa --batch-barcodes --batch /content/kb_output/tmp/tmpsb1qpyjx
[2025-01-10 15:52:57,171]   DEBUG [count]
[2025-01-10 15:52:57,171]   DEBUG [count] [bus] will try running read files supplied in batch file
[2025-01-10 15:52:57,171]   DEBUG [count] [bus] Note: Strand option was not specified; setting it to --fr-stranded for specified technology
[2025-01-10 15:53:07,012]   DEBUG [count] [index] k-mer length: 31
[2025-01-10 15:53:29,938]   DEBUG [count] [index] number of targets: 296,561
[2025-01-10 15:53:29,939]   DEBUG [count] [index] number of k-mers: 37,541,835
[2025-01-10 15:53:29,939]   DEBUG [count] [index] number of distinguishing flanking k-mers: 130
[2025-01-10 15:53:29,939]   DEBUG [count] [quant] running in single-end mode
[2025-01-10 15:53:29,939]   DEBUG [count] [quant] will process file 1: /content/SRR11321526_1_short.fastq
[2025-01-10 15:53:29,939]   DEBUG [count] [quant] will process file 2: /content/SRR11321526_2_short.fastq
[2025-01-10 15:53:29,939]   DEBUG [count] [quant] will process file 1: /content/SRR11321528_1_short.fastq
[2025-01-10 15:53:29,939]   DEBUG [count] [quant] will process file 2: /content/SRR11321528_2_short.fastq
[2025-01-10 15:53:29,939]   DEBUG [count] [quant] will process file 1: /content/SRR11321027_1_short.fastq
[2025-01-10 15:53:29,939]   DEBUG [count] [quant] will process file 2: /content/SRR11321027_2_short.fastq
[2025-01-10 15:53:29,939]   DEBUG [count] [quant] will process file 1: /content/SRR11321026_1_short.fastq
[2025-01-10 15:53:29,939]   DEBUG [count] [quant] will process file 2: /content/SRR11321026_2_short.fastq
[2025-01-10 15:59:40,532]   DEBUG [count] [quant] finding pseudoalignments for all files ...
[2025-01-10 16:05:40,481]   DEBUG [count] [progress] 1M reads processed (0.2% mapped)
[2025-01-10 16:11:19,599]   DEBUG [count] [progress] 2M reads processed (0.2% mapped)
[2025-01-10 16:16:56,456]   DEBUG [count] [progress] 3M reads processed (0.2% mapped)
[2025-01-10 16:22:43,587]   DEBUG [count] [progress] 4M reads processed (0.2% mapped)
[2025-01-10 16:28:31,712]   DEBUG [count] [progress] 5M reads processed (0.2% mapped)
[2025-01-10 16:34:26,790]   DEBUG [count] [progress] 6M reads processed (0.2% mapped)
[2025-01-10 16:40:07,766]   DEBUG [count] [progress] 7M reads processed (0.2% mapped)
[2025-01-10 16:45:46,763]   DEBUG [count] [progress] 8M reads processed (0.2% mapped)
[2025-01-10 16:51:19,055]   DEBUG [count] [progress] 9M reads processed (0.2% mapped)
[2025-01-10 16:56:52,839]   DEBUG [count] [progress] 10M reads processed (0.2% mapped)
[2025-01-10 17:02:29,792]   DEBUG [count] [progress] 11M reads processed (0.2% mapped)
[2025-01-10 17:08:07,981]   DEBUG [count] [progress] 12M reads processed (0.2% mapped)
[2025-01-10 17:13:39,961]   DEBUG [count] [progress] 13M reads processed (0.2% mapped)
[2025-01-10 17:19:20,291]   DEBUG [count] [progress] 14M reads processed (0.2% mapped)
[2025-01-10 17:24:58,571]   DEBUG [count] [progress] 15M reads processed (0.2% mapped)
[2025-01-10 17:30:22,827]   DEBUG [count] [progress] 16M reads processed (0.2% mapped)
[2025-01-10 17:35:13,251]   DEBUG [count] [progress] 17M reads processed (0.2% mapped)
[2025-01-10 17:40:20,934]   DEBUG [count] [progress] 18M reads processed (0.2% mapped)
[2025-01-10 17:45:16,769]   DEBUG [count] [progress] 19M reads processed (0.2% mapped)
[2025-01-10 17:50:11,728]   DEBUG [count] [progress] 20M reads processed (0.2% mapped)
[2025-01-10 17:55:20,603]   DEBUG [count] [progress] 21M reads processed (0.2% mapped)
[2025-01-10 18:00:11,547]   DEBUG [count] [progress] 22M reads processed (0.2% mapped)
[2025-01-10 18:05:09,311]   DEBUG [count] [progress] 23M reads processed (0.2% mapped)
[2025-01-10 18:10:18,138]   DEBUG [count] [progress] 24M reads processed (0.2% mapped)
[2025-01-10 18:15:09,731]   DEBUG [count] [progress] 25M reads processed (0.2% mapped)
[2025-01-10 18:20:10,316]   DEBUG [count] [progress] 26M reads processed (0.2% mapped)
[2025-01-10 18:25:08,681]   DEBUG [count] [progress] 27M reads processed (0.2% mapped)
[2025-01-10 18:30:06,926]   DEBUG [count] [progress] 28M reads processed (0.2% mapped)
[2025-01-10 18:35:02,398]   DEBUG [count] [progress] 29M reads processed (0.2% mapped)
[2025-01-10 18:40:06,895]   DEBUG [count] [progress] 30M reads processed (0.2% mapped)
[2025-01-10 18:45:03,551]   DEBUG [count] [progress] 31M reads processed (0.2% mapped)
[2025-01-10 18:50:46,827]   DEBUG [count] [progress] 32M reads processed (0.2% mapped)
[2025-01-10 18:56:18,283]   DEBUG [count] [progress] 33M reads processed (0.2% mapped)
[2025-01-10 19:01:44,666]   DEBUG [count] [progress] 34M reads processed (0.2% mapped)
[2025-01-10 19:07:33,105]   DEBUG [count] [progress] 35M reads processed (0.2% mapped)
[2025-01-10 19:13:10,611]   DEBUG [count] [progress] 36M reads processed (0.2% mapped)
[2025-01-10 19:18:39,088]   DEBUG [count] [progress] 37M reads processed (0.2% mapped)
[2025-01-10 19:24:19,279]   DEBUG [count] [progress] 38M reads processed (0.2% mapped)
[2025-01-10 19:29:47,357]   DEBUG [count] [progress] 39M reads processed (0.2% mapped)
[2025-01-10 19:35:35,178]   DEBUG [count] [progress] 40M reads processed (0.2% mapped)
[2025-01-10 19:41:12,484]   DEBUG [count] [progress] 41M reads processed (0.2% mapped)
[2025-01-10 19:47:02,956]   DEBUG [count] [progress] 42M reads processed (0.2% mapped)
[2025-01-10 19:52:26,931]   DEBUG [count] [progress] 43M reads processed (0.2% mapped)
[2025-01-10 19:58:33,900]   DEBUG [count] [progress] 44M reads processed (0.2% mapped)
[2025-01-10 20:04:37,109]   DEBUG [count] [progress] 45M reads processed (0.2% mapped)
[2025-01-10 20:10:22,079]   DEBUG [count] [progress] 46M reads processed (0.2% mapped)
[2025-01-10 20:15:53,707]   DEBUG [count] [progress] 47M reads processed (0.2% mapped)
[2025-01-10 20:21:27,294]   DEBUG [count] [progress] 48M reads processed (0.2% mapped)
[2025-01-10 20:21:27,395]   DEBUG [count] [progress] 50M reads processed (0.2% mapped)              done
[2025-01-10 20:21:27,395]   DEBUG [count] [quant] processed 50,000,000 reads, 97,542 reads pseudoaligned
[2025-01-10 20:21:27,596]   DEBUG [count]
[2025-01-10 20:21:30,323]    INFO [count] Sorting BUS file kb_output/output.bus to kb_output/tmp/output.s.bus
[2025-01-10 20:21:30,323]   DEBUG [count] bustools sort -o kb_output/tmp/output.s.bus -T kb_output/tmp -t 2 -m 2G kb_output/output.bus
[2025-01-10 20:21:32,449]   DEBUG [count] all fits in buffer
[2025-01-10 20:21:33,751]   DEBUG [count] Read in 97542 BUS records
[2025-01-10 20:21:33,751]   DEBUG [count] reading time 0.001918s
[2025-01-10 20:21:33,751]   DEBUG [count] sorting time 0.017437s
[2025-01-10 20:21:33,752]   DEBUG [count] writing time 0.00732s
[2025-01-10 20:21:33,752]    INFO [count] On-list not provided
[2025-01-10 20:21:33,752]    INFO [count] Copying pre-packaged 10XV2 on-list to kb_output
[2025-01-10 20:21:33,980]    INFO [count] Inspecting BUS file kb_output/tmp/output.s.bus
[2025-01-10 20:21:33,980]   DEBUG [count] bustools inspect -o kb_output/inspect.json -w kb_output/10x_version2_whitelist.txt kb_output/tmp/output.s.bus
[2025-01-10 20:21:35,692]    INFO [count] Correcting BUS records in kb_output/tmp/output.s.bus to kb_output/tmp/output.s.c.bus with on-list kb_output/10x_version2_whitelist.txt
[2025-01-10 20:21:35,693]   DEBUG [count] bustools correct -o kb_output/tmp/output.s.c.bus -w kb_output/10x_version2_whitelist.txt kb_output/tmp/output.s.bus
[2025-01-10 20:21:36,899]   DEBUG [count] Found 737280 barcodes in the on-list
[2025-01-10 20:21:37,300]   DEBUG [count] Processed 86901 BUS records
[2025-01-10 20:21:37,300]   DEBUG [count] In on-list = 77345
[2025-01-10 20:21:37,301]   DEBUG [count] Corrected    = 1231
[2025-01-10 20:21:37,301]   DEBUG [count] Uncorrected  = 8325
[2025-01-10 20:21:38,402]    INFO [count] Sorting BUS file kb_output/tmp/output.s.c.bus to kb_output/output.unfiltered.bus
[2025-01-10 20:21:38,403]   DEBUG [count] bustools sort -o kb_output/output.unfiltered.bus -T kb_output/tmp -t 2 -m 2G kb_output/tmp/output.s.c.bus
[2025-01-10 20:21:40,609]   DEBUG [count] all fits in buffer
[2025-01-10 20:21:41,811]   DEBUG [count] Read in 78576 BUS records
[2025-01-10 20:21:41,811]   DEBUG [count] reading time 0.001695s
[2025-01-10 20:21:41,811]   DEBUG [count] sorting time 0.011695s
[2025-01-10 20:21:41,812]   DEBUG [count] writing time 0.005858s
[2025-01-10 20:21:41,815]    INFO [count] Generating count matrix kb_output/counts_unfiltered/cells_x_genes from BUS file kb_output/output.unfiltered.bus
[2025-01-10 20:21:41,815]   DEBUG [count] bustools count -o kb_output/counts_unfiltered/cells_x_genes -g palmdb_clustered_t2g.txt -e kb_output/matrix.ec -t kb_output/transcripts.txt --genecounts --umi-gene kb_output/output.unfiltered.bus
[2025-01-10 20:21:44,995]   DEBUG [count] kb_output/counts_unfiltered/cells_x_genes.mtx passed validation
[2025-01-10 20:21:44,995]    INFO [count] Writing gene names to file kb_output/counts_unfiltered/cells_x_genes.genes.names.txt
[2025-01-10 20:21:45,561] WARNING [count] 99228 gene IDs do not have corresponding valid gene names. These genes will use their gene IDs instead.
[2025-01-10 20:21:45,609]    INFO [count] Reading matrix kb_output/counts_unfiltered/cells_x_genes.mtx
[2025-01-10 20:21:46,111]    INFO [count] Writing matrix to h5ad kb_output/counts_unfiltered/adata.h5ad
[2025-01-10 20:21:46,519]   DEBUG [main] Removing `kb_output/tmp` directory
CPU times: user 1min 44s, sys: 15.2 s, total: 1min 59s
Wall time: 4h 29min 17s

Load the generated count matrix

[ ]:

# Open count matrix (AnnData object in h5ad format)
adata = anndata.read_h5ad(f"{out_folder}/counts_unfiltered/adata.h5ad")
adata

AnnData object with n_obs × n_vars = 15369 × 99228

[ ]:

# 'barcode' here includes batch/sample barcode (first 16 characters) + cell barcode
adata.obs


barcode
AAAAAAAAAAAAAAAAAAACCTGAGATGTCGG
AAAAAAAAAAAAAAAAAAACCTGAGTGAATTG
AAAAAAAAAAAAAAAAAAACCTGTCAATACCG
AAAAAAAAAAAAAAAAAAACCTGTCCACTCCA
AAAAAAAAAAAAAAAAAAACCTGTCCATGCTC
...
AAAAAAAAAAAAAAATTTTGCGCCAACGATGG
AAAAAAAAAAAAAAATTTTGGTTAGGTAAACT
AAAAAAAAAAAAAAATTTTGGTTTCTCTGAGA
AAAAAAAAAAAAAAATTTTGTCACAGCTCGAC
AAAAAAAAAAAAAAATTTTGTCAGTACTTCTT

15369 rows × 0 columns

Use the sample barcodes to add a column with the original SRR file for each cell:

[ ]:

# Load the sample/batch barcodes
b_file = open("kb_output/matrix.sample.barcodes")
barcodes = b_file.read().splitlines()
b_file.close()

# Load the sample/batch names
s_file = open("kb_output/matrix.cells")
samples = s_file.read().splitlines()
s_file.close()

# Create dataframe that maps sample barcodes to sample names
bc2sample_df = pd.DataFrame()
bc2sample_df["sample_barcode"] = barcodes
bc2sample_df["srr"] = samples

# Add sample name to AnnData object
adata.obs["barcode"] = adata.obs.index.values
adata.obs["sample_barcode"] = [bc[:-16] for bc in adata.obs.index.values]
adata.obs["cell_barcode"] = [bc[-16:] for bc in adata.obs.index.values]
adata.obs = adata.obs.merge(bc2sample_df, on="sample_barcode", how="left").set_index("cell_barcode", drop=False)
adata.obs

	barcode	sample_barcode	cell_barcode	srr
cell_barcode
AAACCTGAGATGTCGG	AAAAAAAAAAAAAAAAAAACCTGAGATGTCGG	AAAAAAAAAAAAAAAA	AAACCTGAGATGTCGG	SRR11321526
AAACCTGAGTGAATTG	AAAAAAAAAAAAAAAAAAACCTGAGTGAATTG	AAAAAAAAAAAAAAAA	AAACCTGAGTGAATTG	SRR11321526
AAACCTGTCAATACCG	AAAAAAAAAAAAAAAAAAACCTGTCAATACCG	AAAAAAAAAAAAAAAA	AAACCTGTCAATACCG	SRR11321526
AAACCTGTCCACTCCA	AAAAAAAAAAAAAAAAAAACCTGTCCACTCCA	AAAAAAAAAAAAAAAA	AAACCTGTCCACTCCA	SRR11321526
AAACCTGTCCATGCTC	AAAAAAAAAAAAAAAAAAACCTGTCCATGCTC	AAAAAAAAAAAAAAAA	AAACCTGTCCATGCTC	SRR11321526
...	...	...	...	...
TTTGCGCCAACGATGG	AAAAAAAAAAAAAAATTTTGCGCCAACGATGG	AAAAAAAAAAAAAAAT	TTTGCGCCAACGATGG	SRR11321026
TTTGGTTAGGTAAACT	AAAAAAAAAAAAAAATTTTGGTTAGGTAAACT	AAAAAAAAAAAAAAAT	TTTGGTTAGGTAAACT	SRR11321026
TTTGGTTTCTCTGAGA	AAAAAAAAAAAAAAATTTTGGTTTCTCTGAGA	AAAAAAAAAAAAAAAT	TTTGGTTTCTCTGAGA	SRR11321026
TTTGTCACAGCTCGAC	AAAAAAAAAAAAAAATTTTGTCACAGCTCGAC	AAAAAAAAAAAAAAAT	TTTGTCACAGCTCGAC	SRR11321026
TTTGTCAGTACTTCTT	AAAAAAAAAAAAAAATTTTGTCAGTACTTCTT	AAAAAAAAAAAAAAAT	TTTGTCAGTACTTCTT	SRR11321026

15369 rows × 4 columns

[ ]:

# Create dataframe with a label for each SRR library
samples = [
    'SRR11321526',
    'SRR11321528',
    'SRR11321027',
    'SRR11321026'
    ]

# Label corresponding to each SRR library (samples)
labels = [
    'Patient PBMC\n(fever day -2)',
    'Patient PBMC\n(post-infection control)',
    'DENV-infected CEM.NK$^R$',
    'Uninfected\nControl (CEM.NK$^R$)'
    ]

metadata_df = pd.DataFrame()
metadata_df["SRR"] = samples
metadata_df["label"] = labels
metadata_df

	SRR	label
0	SRR11321526	Patient PBMC\n(fever day -2)
1	SRR11321528	Patient PBMC\n(post-infection control)
2	SRR11321027	DENV-infected CEM.NK$^R$
3	SRR11321026	Uninfected\nControl (CEM.NK$^R$)

Plot counts for Dengue virus (DENV) in each sample

Find the virus IDs for Dengue virus in the PalmDB viral protein reference:

[ ]:

# Load the PalmDB virus ID to virus taxonomy mapping
u_tax_csv = "ID_to_taxonomy_mapping.csv"
tax_df = pd.read_csv(u_tax_csv)

# Find the reference IDs for Dengue virus RdRPs
# Note: Only the rep_ID (representative ID) will occur in the count matrix
tax_df[tax_df["species"].str.contains("Dengue virus")]

	ID	rep_ID	phylum	class	order	family	genus	species	strandedness
7298	u1010	u1010	Kitrinoviricota	Flasuviricetes	Amarillovirales	Flaviviridae	Flavivirus	Dengue virus	+ssRNA
7299	u1014	u1010	Kitrinoviricota	Flasuviricetes	Amarillovirales	Flaviviridae	Flavivirus	Dengue virus	+ssRNA
7300	u1015	u1010	Kitrinoviricota	Flasuviricetes	Amarillovirales	Flaviviridae	Flavivirus	Dengue virus	+ssRNA
7301	u1025	u1010	Kitrinoviricota	Flasuviricetes	Amarillovirales	Flaviviridae	Flavivirus	Dengue virus	+ssRNA
7302	u10313	u1010	Kitrinoviricota	Flasuviricetes	Amarillovirales	Flaviviridae	Flavivirus	Dengue virus	+ssRNA
...	...	...	...	...	...	...	...	...	...
8360	u9796	u1010	Kitrinoviricota	Flasuviricetes	Amarillovirales	Flaviviridae	Flavivirus	Dengue virus	+ssRNA
8361	u9846	u1010	Kitrinoviricota	Flasuviricetes	Amarillovirales	Flaviviridae	Flavivirus	Dengue virus	+ssRNA
8362	u9911	u1010	Kitrinoviricota	Flasuviricetes	Amarillovirales	Flaviviridae	Flavivirus	Dengue virus	+ssRNA
8363	u9935	u1010	Kitrinoviricota	Flasuviricetes	Amarillovirales	Flaviviridae	Flavivirus	Dengue virus	+ssRNA
8364	u9989	u1010	Kitrinoviricota	Flasuviricetes	Amarillovirales	Flaviviridae	Flavivirus	Dengue virus	+ssRNA

1067 rows × 9 columns

[ ]:

# Target virus IDs to plot (here, Dengue virus)
target_virus = "Dengue virus"
target_ids = tax_df[tax_df["species"].str.contains(target_virus)]["rep_ID"].unique()
target_ids

array(['u1010'], dtype=object)

Create bar plot showing raw counts for DENV:

[ ]:

fig, ax = plt.subplots(figsize=(5, 5))
fontsize = 14
width = 0.75

counts = []
for sample in metadata_df["SRR"].values:
    counts.append(adata.X[adata.obs["srr"].str.contains(sample), adata.var.index.isin(target_ids)].sum())

x = np.arange(len(metadata_df["SRR"].values))

ax.bar(x, counts, width=width, color="#003049", edgecolor="black")

# Adjust figure labels and axes
ax.set_title(f"Counts for Dengue virus observed in\nCENV-infected patient and cell line", fontsize=fontsize+2)
ax.set_yscale("symlog")
ax.set_ylabel(f"Total counts for {target_ids[0]} (Dengue virus))", fontsize=fontsize)
# ax.set_xlabel("Sample", fontsize=fontsize)
ax.set_xticks(x, metadata_df["label"].values, rotation=45, ha="right")
ax.tick_params(axis="both", labelsize=fontsize-2)
ax.grid(True, which="both", color="lightgray", ls="--", lw=1)
ax.set_axisbelow(True)

# Save figure
plt.savefig("denv_rdrp_count.png", dpi=300, bbox_inches="tight")

fig.show()

../../_images/translated_notebooks_virus_detection_sc_32_0.png

[ ]:

# NOTE: These counts are low since we only aligned a small subset of the data in this example notebook
counts

[6.0, 0.0, 219.0, 0.0]

Visualize other virus-like sequences observed in the dataset

When interpreting the presence of other virus- or RdRP-like sequences (each identified by its own virus ID), keep in mind that there will likely be many RdRP-like sequences introduced by contamination of laboratory reagents. A (non-comprehensive) list of virus IDs observed in blank sequencing data is available here and we will download it below for comparison to the virus IDs observed in our dataset. Another indication that a sequence may have originated from sample contamination is its uniform presence across all sequenced samples.

[ ]:

# Download list of virus IDs seen in blank sequencing reagents
!wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/refs/heads/main/viruses_in_blank_reagents/total_raw_count_per_virus_id_in_laboratory_reagents.csv
blank_df = pd.read_csv("total_raw_count_per_virus_id_in_laboratory_reagents.csv")
blank_df

--2025-01-10 20:22:00--  https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/refs/heads/main/viruses_in_blank_reagents/total_raw_count_per_virus_id_in_laboratory_reagents.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 54642 (53K) [text/plain]
Saving to: ‘total_raw_count_per_virus_id_in_laboratory_reagents.csv’

total_raw_count_per 100%[===================>]  53.36K  --.-KB/s    in 0.002s

2025-01-10 20:22:01 (33.5 MB/s) - ‘total_raw_count_per_virus_id_in_laboratory_reagents.csv’ saved [54642/54642]

	count	virus
0	5374210.0	u172514
1	4301784.0	u226460
2	752771.0	u237705
3	456424.0	u202260
4	359783.0	u223701
...	...	...
4573	1.0	u169611
4574	1.0	u41840
4575	1.0	u169761
4576	1.0	u169999
4577	1.0	u158491

4578 rows × 2 columns

Plot the top 20 virus IDs observed in this dataset (raw counts are normalized to the total number of viral reads in each sample, and virus IDs also observed in blank sequencing reagents are marked in red):

[ ]:

# Plot the top 20 virus IDs (counts will be normalized to total counts in each SRR library)
def plot_top_virs_stacked_bar(adata, top_n=20):
    # Convert sparse matrix to dense if needed
    counts_matrix = adata.X
    if not isinstance(counts_matrix, np.ndarray):
        counts_matrix = counts_matrix.toarray()

    # Create a DataFrame with counts and group by SRR library
    df_counts = pd.DataFrame(counts_matrix, columns=adata.var_names, index=adata.obs["srr"])
    grouped_counts = df_counts.groupby(level=0).sum()

    # Normalize counts by total read depth per SRR library
    grouped_counts_normalized = grouped_counts.div(grouped_counts.sum(axis=1), axis=0)

    # Sum normalized counts across all SRR libraries for each virus
    vir_totals = grouped_counts_normalized.sum(axis=0)

    # Get the top N viruses across all SRR libraries
    top_vir_indices = np.argsort(vir_totals)[-top_n:][::-1]
    top_virs = grouped_counts_normalized.columns[top_vir_indices]
    top_counts = grouped_counts_normalized[top_virs]

    # Create a stacked bar plot
    fig, ax = plt.subplots(figsize=(7, 5))
    colors = ['#e34a33', '#fdbb84', '#2b8cbe', '#a6bddb'] * (len(top_counts) // 6 + 1)
    fontsize = 14

    bar_bottom = np.zeros(len(top_virs))
    for i, (srr_id, row_counts) in enumerate(top_counts.iterrows()):
        ax.bar(
            top_virs,
            row_counts,
            bottom=bar_bottom,
            color=colors[i],
            label=metadata_df[metadata_df["SRR"] == srr_id]["label"].values[0].replace("\n", " ")
        )
        bar_bottom += row_counts

    # Adjust figure labels and axes
    ax.set_title(f"Top {top_n} Virus IDs", fontsize=fontsize+2)
    ax.set_xlabel("Virus ID", fontsize=fontsize)
    ax.set_ylabel("Fraction of Total Reads in Sample", fontsize=fontsize)
    ax.set_xticklabels(top_virs, rotation=45, ha='right')
    # Mark virus IDs that have been observed in blank sequencing reagents in red
    for i, label in enumerate(top_virs):
        if label in blank_df["virus"].values:
            ax.get_xticklabels()[i].set_color('red')
    ax.legend(title="Sample", loc='upper right', fontsize=fontsize-2, title_fontsize=fontsize)
    ax.tick_params(axis='both', which='major', labelsize=fontsize-2)
    ax.grid(axis='y', linestyle='--', color="lightgrey", alpha=1)
    ax.set_axisbelow(True)
    ax.margins(x=0.01)

    # Save figure
    fig.savefig("top_virus_IDs_per_sample.png", dpi=300, bbox_inches="tight")

    fig.show()

    return top_virs

top_virs = plot_top_virs_stacked_bar(adata, top_n=20)

../../_images/translated_notebooks_virus_detection_sc_37_0.png

Show the predicted taxonomies for each virus ID shown in the plot above (these are the species-like operational taxonomic units (OTUs) provided by PalmDB):

[ ]:

top_vir_tax = tax_df[tax_df["rep_ID"].isin(top_virs)][tax_df.columns[1:]].drop_duplicates()
top_vir_tax['rep_ID'] = pd.Categorical(top_vir_tax['rep_ID'], categories=top_virs, ordered=True)
top_vir_tax = top_vir_tax.sort_values('rep_ID')
top_vir_tax

	rep_ID	phylum	class	order	family	genus	species	strandedness
212430	u101227	Pisuviricota	Pisoniviricetes	Picornavirales	Picornaviridae	.	.	+ssRNA
34276	u27694	Peploviricota	Herviviricetes	Herpesvirales	Herpesviridae	Varicellovirus	Bubaline alphaherpesvirus 1	dsDNA
263358	u202260	.	.	.	.	.	.	unknown
35782	u34159	.	.	.	.	.	.	unknown
66423	u100002	Lenarviricota	Allassoviricetes	Levivirales	.	.	.	+ssRNA
256038	u181379	.	.	.	.	.	.	unknown
39629	u45195	.	.	.	.	.	.	unknown
238847	u135858	.	.	.	.	.	.	unknown
38543	u41991	.	.	.	.	.	.	unknown
219921	u103829	Negarnaviricota	Insthoviricetes	Articulavirales	Orthomyxoviridae	.	.	-ssRNA
142323	u100041	Kitrinoviricota	.	.	.	.	.	+ssRNA
140792	u100031	Kitrinoviricota	Alsuviricetes	.	.	.	.	+ssRNA
95765	u100012	Lenarviricota	Allassoviricetes	.	.	.	.	+ssRNA
226917	u110641	Duplornaviricota	Resentoviricetes	Reovirales	Reoviridae	.	.	dsRNA
35684	u33987	.	.	.	.	.	.	unknown
36272	u35686	.	.	.	.	.	.	unknown
207936	u100644	Lenarviricota	Amabiliviricetes	Wolframvirales	.	.	.	+ssRNA
7298	u1010	Kitrinoviricota	Flasuviricetes	Amarillovirales	Flaviviridae	Flavivirus	Dengue virus	+ssRNA
144317	u100048	Lenarviricota	Amabiliviricetes	.	.	.	.	+ssRNA
34466	u30588	.	.	.	.	.	.	unknown

Extract reads that aligned to specific virus IDs for further analysis

To confirm the presence of a known virus for which a reference genome is available, you can align the data to the viral (nucleotide) reference genome using the standard kb workflow (simply omit the --aa flag in the kb ref and kb count commands and supply the viral reference genome to kb ref instead of the PalmDB protein reference).

If this fails or no reference genome is available, use kb extract to extract raw reads that aligned to a given virus ID for further analysis (such as BLAST alignment):

[ ]:

# Sequences that matched to [vir_ids_to_extract] will be extracted
# Here, we will extract reads that mapped to the virus ID u101227
# u101227 is the virus ID with the most counts across all samples so we might like to find out more about it
vir_ids_to_extract = ["u101227"]

# Provide a single fastq file to extract sequences from (here, we are using "Patient PBMC (post-infection control)")
fastq_file = "/content/SRR11321528_1_short.fastq"

[ ]:

%%time
# Extract sequences using kb extract
outfolder_extract = "extracted_seqs"
targets = " ".join(vir_ids_to_extract)

!kb extract \
    --kallisto $kallisto \
    --verbose \
    -t $threads \
    --aa \
    -k $k \
    -g palmdb_clustered_t2g.txt \
    -i $reference_index \
    -ts $targets \
    -o $outfolder_extract \
    $fastq_file

[2025-01-10 20:22:36,217]   DEBUG [main] Printing verbose output
[2025-01-10 20:22:38,426]   DEBUG [main] kallisto binary located at /content/kallisto/build/src/kallisto
[2025-01-10 20:22:38,427]   DEBUG [main] bustools binary located at /usr/local/lib/python3.10/dist-packages/kb_python/bins/linux/bustools/bustools
[2025-01-10 20:22:38,427]   DEBUG [main] Creating `extracted_seqs/tmp` directory
[2025-01-10 20:22:38,428]   DEBUG [main] Namespace(list=False, command='extract', tmp=None, keep_tmp=False, verbose=True, fastq='/content/SRR11321528_1_short.fastq', i='palmdb_homo_sapiens_dlist_cdna_dna.idx', targets=['u101227'], target_type='gene', extract_all=False, extract_all_fast=False, extract_all_unmapped=False, mm=False, g='palmdb_clustered_t2g.txt', o='extracted_seqs', t=2, strand=None, aa=True, N=None, kallisto='kallisto/build/src/kallisto', bustools='/usr/local/lib/python3.10/dist-packages/kb_python/bins/linux/bustools/bustools', opt_off=False, k=31)
[2025-01-10 20:22:41,916]    INFO [extract] Performing alignment using kallisto...
[2025-01-10 20:22:41,916]    INFO [extract] Using index palmdb_homo_sapiens_dlist_cdna_dna.idx to generate BUS file to extracted_seqs/tmp from
[2025-01-10 20:22:41,916]    INFO [extract]         /content/SRR11321528_1_short.fastq
[2025-01-10 20:22:41,916]   DEBUG [extract] kallisto bus -i palmdb_homo_sapiens_dlist_cdna_dna.idx -o extracted_seqs/tmp -x bulk -t 2 --num --aa /content/SRR11321528_1_short.fastq
[2025-01-10 20:22:42,018]   DEBUG [extract]
[2025-01-10 20:22:52,284]   DEBUG [extract] [index] k-mer length: 31
[2025-01-10 20:23:17,325]   DEBUG [extract] [index] number of targets: 296,561
[2025-01-10 20:23:17,426]   DEBUG [extract] [index] number of k-mers: 37,541,835
[2025-01-10 20:23:17,427]   DEBUG [extract] [index] number of distinguishing flanking k-mers: 130
[2025-01-10 20:23:17,427]   DEBUG [extract] [quant] running in single-end mode
[2025-01-10 20:23:17,427]   DEBUG [extract] [quant] will process file 1: /content/SRR11321528_1_short.fastq
[2025-01-10 20:29:38,975]   DEBUG [extract] [quant] finding pseudoalignments for all files ...
[2025-01-10 20:35:12,343]   DEBUG [extract] [progress] 1M reads processed (0.2% mapped)
[2025-01-10 20:41:23,853]   DEBUG [extract] [progress] 2M reads processed (0.2% mapped)
[2025-01-10 20:47:13,944]   DEBUG [extract] [progress] 3M reads processed (0.2% mapped)
[2025-01-10 20:53:03,094]   DEBUG [extract] [progress] 4M reads processed (0.2% mapped)
[2025-01-10 20:59:01,997]   DEBUG [extract] [progress] 5M reads processed (0.2% mapped)
[2025-01-10 21:05:06,825]   DEBUG [extract] [progress] 6M reads processed (0.2% mapped)
[2025-01-10 21:11:00,966]   DEBUG [extract] [progress] 7M reads processed (0.2% mapped)
[2025-01-10 21:16:30,838]   DEBUG [extract] [progress] 8M reads processed (0.2% mapped)
[2025-01-10 21:22:43,933]   DEBUG [extract] [progress] 9M reads processed (0.2% mapped)
[2025-01-10 21:28:20,131]   DEBUG [extract] [progress] 10M reads processed (0.2% mapped)
[2025-01-10 21:34:19,356]   DEBUG [extract] [progress] 11M reads processed (0.2% mapped)
[2025-01-10 21:40:08,500]   DEBUG [extract] [progress] 12M reads processed (0.2% mapped)
[2025-01-10 21:46:17,800]   DEBUG [extract] [progress] 13M reads processed (0.2% mapped)
[2025-01-10 21:51:49,182]   DEBUG [extract] [progress] 14M reads processed (0.2% mapped)
[2025-01-10 21:58:01,668]   DEBUG [extract] [progress] 15M reads processed (0.2% mapped)
[2025-01-10 22:03:50,793]   DEBUG [extract] [progress] 16M reads processed (0.2% mapped)
[2025-01-10 22:10:08,338]   DEBUG [extract] [progress] 17M reads processed (0.2% mapped)
[2025-01-10 22:15:51,016]   DEBUG [extract] [progress] 18M reads processed (0.2% mapped)
[2025-01-10 22:16:16,355]   DEBUG [extract] [progress] 19M reads processed (0.2% mapped)              done
[2025-01-10 22:16:16,355]   DEBUG [extract] [quant] processed 20,000,000 reads, 39,165 reads pseudoaligned
[2025-01-10 22:16:16,456]   DEBUG [extract]
[2025-01-10 22:16:19,313]    INFO [extract] Alignment complete. Beginning extraction of reads using bustools...
[2025-01-10 22:16:19,313]   DEBUG [extract] Replacing transcript entries with -1 for equivalence classes that map to multiple genes from extracted_seqs/tmp/matrix.ec
[2025-01-10 22:16:19,313]   DEBUG [extract] Fetching equivalence classes that map to multiple genes...
[2025-01-10 22:18:02,036]   DEBUG [extract] Found the following equivalence classes that map to multiple genes: 2 ,12 ,21 ,32 ,33 ,46 ,47 ,50 ,74 ,79 ,82 ,88 ,94 ,105 ,120 ,122 ,127 ,130 ,133 ,134 ,145 ,148 ,155 ,159 ,163 ,175 ,176 ,178 ,188 ,192 ,196 ,205 ,208 ,219 ,221 ,228 ,234 ,235 ,236 ,246 ,250 ,251 ,257 ,281 ,284 ,290 ,292 ,300 ,301 ,306 ,317 ,318 ,330 ,345 ,355 ,363 ,366 ,370 ,378 ,381 ,391 ,393 ,398 ,408 ,415 ,424 ,461 ,464 ,485 ,488 ,491 ,500 ,501 ,508 ,514 ,515 ,516 ,523 ,538 ,555 ,559 ,566 ,600 ,602 ,618 ,627 ,640 ,658 ,663 ,665 ,670 ,672 ,678 ,689 ,709 ,717 ,718 ,730 ,734 ,741 ,744 ,753 ,755 ,773 ,777 ,778 ,783 ,785 ,798 ,799 ,806 ,808 ,811 ,817 ,839 ,855 ,866 ,879 ,887 ,901 ,915 ,919 ,937 ,945 ,948 ,955 ,959 ,963 ,966 ,969 ,978 ,981 ,982 ,984 ,991 ,992
[2025-01-10 22:18:02,508]   DEBUG [extract] matrix.ec file where transcript entries were replaced with -1 for equivalence classes that map to multiple genes saved at extracted_seqs/tmp/matrix_no_mm.ec
[2025-01-10 22:18:03,075]    INFO [extract] Extracting reads for following transcripts for gene ID u101227: u101227, u102725, u102856, u102947, u10335, u103456, u103675, u104127, u10482, u105052, u10615, u106215, u10770, u110499, u111077, u112094, u112930, u11566, u11569, u11836, u11900, u12042, u12244, u123282, u126493, u129004, u129529, u131170, u132107, u132728, u13403, u13457, u134583, u134791, u135143, u135699, u13571, u136670, u137293, u137756, u139241, u139747, u140312, u140369, u14054, u140956, u141387, u143726, u144916, u14643, u148403, u149735, u15117, u152755, u152787, u152990, u153247, u15348, u16191, u163680, u164306, u164471, u164858, u164862, u16616, u16650, u16673, u167681, u168111, u168557, u169370, u169890, u170485, u170643, u170684, u172845, u173262, u173267, u173273, u174860, u17506, u175136, u17570, u176023, u176354, u176685, u176766, u177572, u180974, u18133, u181372, u182074, u182409, u182729, u18497, u18522, u18526, u18552, u187751, u18816, u188403, u189847, u19125, u192470, u192506, u192594, u19269, u192756, u192778, u19413, u19414, u195952, u195974, u196140, u196146, u19676, u19683, u197076, u197735, u198145, u198863, u19904, u199489, u200181, u20080, u20401, u20905, u211976, u21252, u21348, u215888, u215931, u21623, u216267, u216535, u217051, u217464, u217631, u21838, u218639, u21883, u219099, u219536, u220867, u226460, u22963, u23001, u230077, u231334, u232374, u235197, u235199, u237195, u237646, u237705, u237861, u237874, u238129, u238297, u240678, u24119, u242498, u242575, u243228, u24451, u249846, u250515, u250842, u25121, u25122, u251542, u252286, u25326, u253263, u25488, u25633, u25834, u25941, u260723, u260863, u26195, u262007, u26234, u266280, u266533, u266575, u26809, u26837, u26912, u269394, u269502, u27033, u27043, u270944, u271418, u271476, u271967, u273827, u274926, u275199, u276589, u278285, u278547, u278745, u278897, u278900, u278917, u285756, u285785, u285886, u286675, u287636, u294801, u295612, u33024, u33728, u33972, u34195, u37215, u37275, u37365, u37907, u38885, u41220, u41917, u42877, u43050, u43075, u45290, u45700, u45990, u45993, u46809, u47157, u47625, u47650, u47921, u50707, u5313, u55146, u5733, u59672, u61411, u64246, u66038, u67495, u67526, u70657, u7131, u72282, u74997, u7838, u79493, u79948, u80423, u80961, u82533, u83982, u84674, u86925, u88655, u89511, u89629, u89858, u92526, u9318, u93829, u93893, u94411, u94453, u94828, u9563, u9592, u9594, u9731, u97337, u97563, u98014, u9917, u99271, u9953
[2025-01-10 22:18:03,075]    INFO [extract] Capturing records from BUS file extracted_seqs/tmp/output.bus to extracted_seqs/tmp/output_extracted_u101227.bus with capture list extracted_seqs/tmp/pull_out_reads_transcript_ids_temp.txt
[2025-01-10 22:18:03,075]   DEBUG [extract] bustools capture -o extracted_seqs/tmp/output_extracted_u101227.bus -c extracted_seqs/tmp/pull_out_reads_transcript_ids_temp.txt -e extracted_seqs/tmp/matrix_no_mm.ec -t extracted_seqs/tmp/transcripts.txt --transcripts extracted_seqs/tmp/output.bus
[2025-01-10 22:18:03,481]   DEBUG [extract] Parsing transcripts .. done
[2025-01-10 22:18:03,481]   DEBUG [extract] Parsing ECs .. done
[2025-01-10 22:18:03,481]   DEBUG [extract] Parsing capture list .. done
[2025-01-10 22:18:04,682]   DEBUG [extract] Read in 39165 BUS records, wrote 741 BUS records
[2025-01-10 22:18:05,797]   DEBUG [extract] extracted_seqs/tmp/output_extracted_u101227.bus passed validation
[2025-01-10 22:18:05,797]    INFO [extract] Sorting BUS file extracted_seqs/tmp/output_extracted_u101227.bus to extracted_seqs/tmp/output_extracted_u101227_sorted.bus
[2025-01-10 22:18:05,798]   DEBUG [extract] bustools sort -o extracted_seqs/tmp/output_extracted_u101227_sorted.bus -T tmp -t 8 -m 2G --flags extracted_seqs/tmp/output_extracted_u101227.bus
[2025-01-10 22:18:05,899]   DEBUG [extract] Warning: Number of threads cannot be greater than or equal to 2. Setting number of threads to 2
[2025-01-10 22:18:08,407]   DEBUG [extract] all fits in buffer
[2025-01-10 22:18:09,809]   DEBUG [extract] Read in 741 BUS records
[2025-01-10 22:18:09,809]   DEBUG [extract] reading time 5.5e-05s
[2025-01-10 22:18:09,809]   DEBUG [extract] sorting time 7.1e-05s
[2025-01-10 22:18:09,809]   DEBUG [extract] writing time 0.000146s
[2025-01-10 22:18:09,810]    INFO [extract] Extracting BUS file extracted_seqs/tmp/output_extracted_u101227_sorted.bus to extracted_seqs/u101227
[2025-01-10 22:18:09,810]   DEBUG [extract] bustools extract -o extracted_seqs/u101227 -f /content/SRR11321528_1_short.fastq -N 1 extracted_seqs/tmp/output_extracted_u101227_sorted.bus
[2025-01-10 22:19:34,572]   DEBUG [extract] Read in 741 BUS records
[2025-01-10 22:19:34,695]   DEBUG [main] Removing `extracted_seqs/tmp` directory
CPU times: user 30.3 s, sys: 4.07 s, total: 34.4 s
Wall time: 1h 57min 29s

[ ]:

# Grab the reads extracted by kb extract for each virus ID
def get_all_sequences(fastq_file):
    """Extract all sequences from a gzipped FASTQ file."""
    sequences = []
    with gzip.open(fastq_file, "rt") as f:
        for record in SeqIO.parse(f, "fastq"):
            sequences.append(str(record.seq))
    return sequences

extracted_seqs_dict = {}
for vir_id in vir_ids_to_extract:
    # Grab the reads from the fastq file returned by kb extract
    seqs = get_all_sequences(f"{outfolder_extract}/{vir_id}/1.fastq.gz")
    # Save the extracted reads for each virus ID in a dictionary
    extracted_seqs_dict[vir_id] = seqs

BLAST extracted reads

BLAST the first read extracted for virus ID u101227 using `gget blast <https://pachterlab.github.io/gget/en/blast.html>`__ to see which known reference sequences it aligns to:

[ ]:

# BLAST the first read extracted for u101227 using gget
seq_idx = 0
virus_id = "u101227"

raw_read = extracted_seqs_dict[virus_id][seq_idx]

blast_df = gget.blast(raw_read)
blast_df

INFO:gget.utils:Sequence recognized as nucleotide sequence.
INFO:gget.utils:BLAST will use program 'blastn' with database 'nt'.
INFO:gget.utils:BLAST initiated with search ID S3781DRZ013. Estimated time to completion: 30 seconds.
INFO:gget.utils:Retrieving results...
/usr/local/lib/python3.10/dist-packages/gget/gget_blast.py:327: FutureWarning: Passing literal html to 'read_html' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.
  results_df = pd.read_html(str(dsc_table))[0]

	Description	Scientific Name	Common Name	Taxid	Max Score	Total Score	Query Cover	E value	Per. Ident	Acc. Len	Accession
0	Severe acute respiratory syndrome coronavirus ...	Severe acute respiratory syndrome coronavirus 2	NaN	2697049	84.2	84.2	40%	4.000000e-12	91.67%	29567	OY708460.1
1	Gossypium raimondii isolate Gr291 retrotranspo...	Gossypium raimondii	Peruvian cotton	29730	71.3	71.3	45%	3.000000e-08	85.29%	1708	KT895861.1
2	Severe acute respiratory syndrome coronavirus ...	Severe acute respiratory syndrome coronavirus 2	NaN	2697049	58.4	58.4	23%	3.000000e-04	97.06%	29816	OY714418.1
3	Synthetic construct LB106 DNA, synthetic long-...	synthetic construct	NaN	32630	54.7	54.7	19%	3.000000e-03	100.00%	572	LC597715.1
4	Gastrodia elata chromosome GE13 mitochondrion,...	Gastrodia elata	NaN	91201	52.8	52.8	19%	1.200000e-02	100.00%	34420	OP441103.1

BLAST a few more of the extracted sequences to see if the BLAST results reproduce:

[ ]:

# Set the seed for reproducibility
np.random.seed(0)

# Randomly select indices for three of the extracted sequences
seq_idxs = np.random.randint(0, 200, size=3)
seq_idxs

array([172,  47, 117])

[ ]:

gget.blast(extracted_seqs_dict[virus_id][seq_idxs[0]], limit=5)

INFO:gget.utils:Sequence recognized as nucleotide sequence.
INFO:gget.utils:BLAST will use program 'blastn' with database 'nt'.
INFO:gget.utils:BLAST initiated with search ID S3E24YMB013. Estimated time to completion: 30 seconds.
INFO:gget.utils:Retrieving results...
/usr/local/lib/python3.10/dist-packages/gget/gget_blast.py:327: FutureWarning: Passing literal html to 'read_html' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.
  results_df = pd.read_html(str(dsc_table))[0]

	Description	Scientific Name	Common Name	Taxid	Max Score	Total Score	Query Cover	E value	Per. Ident	Acc. Len	Accession
0	Severe acute respiratory syndrome coronavirus ...	Severe acute respiratory syndrome coronavirus 2	NaN	2697049	58.4	58.4	23%	0.0003	97.06%	29829	OY695180.1
1	Gastrodia elata chromosome GE13 mitochondrion,...	Gastrodia elata	NaN	91201	52.8	52.8	19%	0.0120	100.00%	34420	OP441103.1

[ ]:

gget.blast(extracted_seqs_dict[virus_id][seq_idxs[1]], limit=5)

INFO:gget.utils:Sequence recognized as nucleotide sequence.
INFO:gget.utils:BLAST will use program 'blastn' with database 'nt'.
INFO:gget.utils:BLAST initiated with search ID S3E34MM6013. Estimated time to completion: 30 seconds.
INFO:gget.utils:Retrieving results...
/usr/local/lib/python3.10/dist-packages/gget/gget_blast.py:327: FutureWarning: Passing literal html to 'read_html' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.
  results_df = pd.read_html(str(dsc_table))[0]

	Description	Scientific Name	Common Name	Taxid	Max Score	Total Score	Query Cover	E value	Per. Ident	Acc. Len	Accession
0	Homo sapiens ribosomal protein L39 (RPL39), mRNA	Homo sapiens	human	9606	141	141	50%	3.000000e-29	100.00%	390	NM_001000.4
1	PREDICTED: Pan troglodytes ribosomal protein L...	Pan troglodytes	chimpanzee	9598	141	141	50%	3.000000e-29	100.00%	438	XM_063804925.1
2	PREDICTED: Pan paniscus ribosomal protein L39 ...	Pan paniscus	pygmy chimpanzee	9597	141	141	50%	3.000000e-29	100.00%	500	XM_014343539.5
3	Homo sapiens ribosomal protein L39, mRNA (cDNA...	Homo sapiens	human	9606	141	141	50%	3.000000e-29	100.00%	419	BC070205.1
4	PREDICTED: Gorilla gorilla gorilla ribosomal p...	Gorilla gorilla gorilla	western lowland gorilla	9595	141	141	50%	3.000000e-29	100.00%	484	XM_019019406.3

[ ]:

gget.blast(extracted_seqs_dict[virus_id][seq_idxs[2]], limit=5)

INFO:gget.utils:Sequence recognized as nucleotide sequence.
INFO:gget.utils:BLAST will use program 'blastn' with database 'nt'.
INFO:gget.utils:BLAST initiated with search ID S3E44SY2016. Estimated time to completion: 30 seconds.
INFO:gget.utils:Retrieving results...
/usr/local/lib/python3.10/dist-packages/gget/gget_blast.py:327: FutureWarning: Passing literal html to 'read_html' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.
  results_df = pd.read_html(str(dsc_table))[0]

	Description	Scientific Name	Common Name	Taxid	Max Score	Total Score	Query Cover	E value	Per. Ident	Acc. Len	Accession
0	Severe acute respiratory syndrome coronavirus ...	Severe acute respiratory syndrome coronavirus 2	NaN	2697049	78.7	78.7	40%	2.000000e-10	90.00%	29567	OY708460.1
1	Gossypium raimondii isolate Gr291 retrotranspo...	Gossypium raimondii	Peruvian cotton	29730	71.3	71.3	45%	3.000000e-08	85.29%	1708	KT895861.1
2	Severe acute respiratory syndrome coronavirus ...	Severe acute respiratory syndrome coronavirus 2	NaN	2697049	63.9	63.9	25%	6.000000e-06	97.30%	29816	OY714418.1
3	Synthetic construct LB106 DNA, synthetic long-...	synthetic construct	NaN	32630	54.7	54.7	19%	3.000000e-03	100.00%	572	LC597715.1
4	Gastrodia elata chromosome GE13 mitochondrion,...	Gastrodia elata	NaN	91201	52.8	52.8	19%	1.200000e-02	100.00%	34420	OP441103.1

Align data to virus reference genome

Some of the reads that aligned to u101227 BLASTed to SARS-CoV2 sequences. Given that u101227 was observed uniformly in all samples, which included two samples from patient PBMCs and two from a cell line, it is possible that these viral sequences were introduced by sample contamination. It is also possible that the BLAST hits reflect adapter sequence or contamination within the SARS-CoV-2 entries in the BLAST database itself, rather than true biological signal in the samples. The predicted taxonomy for u101227 in PalmDB is Picornaviridae, which includes disease-causing viruses like Enteroviruses. If this virus does not have a reference genome available, the BLAST results will not be very reliable.

We can align the dataset to one of the SARS-CoV2 reference genomes by rerunnning the kb ref and kb count commands from above, but providing the virus reference genome and omitting the --aa flag since this is now a nucleotide reference (instead of a protein reference):

[ ]:

# Download SARS-CoV2 reference genome from NCBI
# Finding the correct reference genome for the virus present in this dataset might take several tries and I would recommend alingning to all available reference genomes for a given virus
# We will perform an alignment to a single SARS-CoV2 reference genome here as an exampe
!wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/858/895/GCF_009858895.2_ASM985889v3/GCF_009858895.2_ASM985889v3_genomic.fna.gz
!wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/858/895/GCF_009858895.2_ASM985889v3/GCF_009858895.2_ASM985889v3_genomic.gtf.gz

--2025-01-10 22:21:53--  https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/858/895/GCF_009858895.2_ASM985889v3/GCF_009858895.2_ASM985889v3_genomic.fna.gz
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.31, 130.14.250.10, 130.14.250.11, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.31|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9591 (9.4K) [application/x-gzip]
Saving to: ‘GCF_009858895.2_ASM985889v3_genomic.fna.gz’

GCF_009858895.2_ASM 100%[===================>]   9.37K  --.-KB/s    in 0s

2025-01-10 22:21:53 (109 MB/s) - ‘GCF_009858895.2_ASM985889v3_genomic.fna.gz’ saved [9591/9591]

--2025-01-10 22:21:53--  https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/858/895/GCF_009858895.2_ASM985889v3/GCF_009858895.2_ASM985889v3_genomic.gtf.gz
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.12, 130.14.250.13, 130.14.250.7, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.12|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1218 (1.2K) [application/x-gzip]
Saving to: ‘GCF_009858895.2_ASM985889v3_genomic.gtf.gz’

GCF_009858895.2_ASM 100%[===================>]   1.19K  --.-KB/s    in 0s

2025-01-10 22:21:53 (75.3 MB/s) - ‘GCF_009858895.2_ASM985889v3_genomic.gtf.gz’ saved [1218/1218]

[ ]:

# We will use a slightly lower k for this alignment since the viral genome is smaller and more prone to mutations
# Set k=31 here for more specificity (at a trade-off with sensitivity)
k = 21

[ ]:

# Generate a SARS-CoV2 reference index
sars_cov2_genome = "GCF_009858895.2_ASM985889v3_genomic.fna.gz"
sars_cov2_gtf = "GCF_009858895.2_ASM985889v3_genomic.gtf.gz"

sars_reference_index = "sars_cov2_ref.idx"
sars_t2g = "sars_cov2_t2g.txt"
sars_fasta = "sars_cov2_transcripts.fa"

!kb ref \
    --kallisto $kallisto \
    --verbose \
    -t $threads \
    -k $k \
    --d-list $host_genome_transciptome \
    -i $sars_reference_index \
    -g $sars_t2g \
    -f1 $sars_fasta \
    $sars_cov2_genome $sars_cov2_gtf

# Align data to the SARS-CoV2 reference index generated above
sars_out_folder = "kb_output_sars_cov2"

!kb count \
    --kallisto $kallisto \
    --verbose \
    -t $threads \
    -k $k \
    -x 10xv2 \
    --parity single \
    -i $sars_reference_index \
    -g $sars_t2g \
    --h5ad \
    -o $sars_out_folder \
    --batch-barcodes \
    batch.txt

[2025-01-10 22:22:06,079]   DEBUG [main] Printing verbose output
[2025-01-10 22:22:08,288]   DEBUG [main] kallisto binary located at /content/kallisto/build/src/kallisto
[2025-01-10 22:22:08,289]   DEBUG [main] bustools binary located at /usr/local/lib/python3.10/dist-packages/kb_python/bins/linux/bustools/bustools
[2025-01-10 22:22:08,289]   DEBUG [main] Creating `tmp` directory
[2025-01-10 22:22:08,289]   DEBUG [main] Namespace(list=False, command='ref', tmp=None, keep_tmp=False, verbose=True, i='sars_cov2_ref.idx', g='sars_cov2_t2g.txt', f1='sars_cov2_transcripts.fa', include_attribute=None, exclude_attribute=None, f2=None, c1=None, c2=None, d=None, k=21, t=2, d_list='homo_sapiens.cdna_dna.fa.gz', d_list_overhang=1, aa=False, workflow='standard', distinguish=False, make_unique=False, overwrite=False, kallisto='kallisto/build/src/kallisto', bustools='/usr/local/lib/python3.10/dist-packages/kb_python/bins/linux/bustools/bustools', opt_off=False, fasta='GCF_009858895.2_ASM985889v3_genomic.fna.gz', gtf='GCF_009858895.2_ASM985889v3_genomic.gtf.gz', feature=None, no_mismatches=False, ec_max_size=None, flank=None)
[2025-01-10 22:22:08,289]    INFO [ref] Preparing GCF_009858895.2_ASM985889v3_genomic.fna.gz, GCF_009858895.2_ASM985889v3_genomic.gtf.gz
[2025-01-10 22:22:08,292] WARNING [main] Gene `GU280_gp01` has no transcripts. The entire gene will be marked as a transcript and an exon with ID `GU280_gp01`.
[2025-01-10 22:22:08,292] WARNING [main] Gene `GU280_gp02` has no transcripts. The entire gene will be marked as a transcript and an exon with ID `GU280_gp02`.
[2025-01-10 22:22:08,292] WARNING [main] Gene `GU280_gp03` has no transcripts. The entire gene will be marked as a transcript and an exon with ID `GU280_gp03`.
[2025-01-10 22:22:08,292] WARNING [main] Gene `GU280_gp04` has no transcripts. The entire gene will be marked as a transcript and an exon with ID `GU280_gp04`.
[2025-01-10 22:22:08,293] WARNING [main] Gene `GU280_gp05` has no transcripts. The entire gene will be marked as a transcript and an exon with ID `GU280_gp05`.
[2025-01-10 22:22:08,293] WARNING [main] Gene `GU280_gp06` has no transcripts. The entire gene will be marked as a transcript and an exon with ID `GU280_gp06`.
[2025-01-10 22:22:08,293] WARNING [main] Gene `GU280_gp07` has no transcripts. The entire gene will be marked as a transcript and an exon with ID `GU280_gp07`.
[2025-01-10 22:22:08,293] WARNING [main] Gene `GU280_gp08` has no transcripts. The entire gene will be marked as a transcript and an exon with ID `GU280_gp08`.
[2025-01-10 22:22:08,293] WARNING [main] Gene `GU280_gp09` has no transcripts. The entire gene will be marked as a transcript and an exon with ID `GU280_gp09`.
[2025-01-10 22:22:08,293] WARNING [main] Gene `GU280_gp10` has no transcripts. The entire gene will be marked as a transcript and an exon with ID `GU280_gp10`.
[2025-01-10 22:22:08,293] WARNING [main] Gene `GU280_gp11` has no transcripts. The entire gene will be marked as a transcript and an exon with ID `GU280_gp11`.
[2025-01-10 22:22:08,293]    INFO [ref] Splitting genome GCF_009858895.2_ASM985889v3_genomic.fna.gz into cDNA at /content/tmp/tmppr1s84kg
[2025-01-10 22:22:08,296]    INFO [ref] Concatenating 1 cDNAs to sars_cov2_transcripts.fa
[2025-01-10 22:22:08,297]    INFO [ref] Creating transcript-to-gene mapping at sars_cov2_t2g.txt
[2025-01-10 22:22:08,298] WARNING [ref] Using provided k-mer length 21 instead of optimal length 31
[2025-01-10 22:22:08,298]    INFO [ref] Indexing sars_cov2_transcripts.fa to sars_cov2_ref.idx
[2025-01-10 22:22:08,298]   DEBUG [ref] kallisto index -i sars_cov2_ref.idx -k 21 -t 2 -d homo_sapiens.cdna_dna.fa.gz sars_cov2_transcripts.fa
[2025-01-10 22:22:08,399]   DEBUG [ref]
[2025-01-10 22:22:08,400]   DEBUG [ref] [build] loading fasta file sars_cov2_transcripts.fa
[2025-01-10 22:22:08,400]   DEBUG [ref] [build] k-mer length: 21
[2025-01-10 22:22:08,400]   DEBUG [ref] KmerStream::KmerStream(): Start computing k-mer cardinality estimations (1/2)
[2025-01-10 22:22:08,400]   DEBUG [ref] KmerStream::KmerStream(): Start computing k-mer cardinality estimations (1/2)
[2025-01-10 22:22:08,400]   DEBUG [ref] KmerStream::KmerStream(): Finished
[2025-01-10 22:22:08,901]   DEBUG [ref] CompactedDBG::build(): Estimated number of k-mers occurring at least once: 29244
[2025-01-10 22:22:08,902]   DEBUG [ref] CompactedDBG::build(): Estimated number of minimizer occurring at least once: 7101
[2025-01-10 22:22:08,902]   DEBUG [ref] CompactedDBG::filter(): Processed 29044 k-mers in 11 reads
[2025-01-10 22:22:08,902]   DEBUG [ref] CompactedDBG::filter(): Found 29020 unique k-mers
[2025-01-10 22:22:08,902]   DEBUG [ref] CompactedDBG::filter(): Number of blocks in Bloom filter is 201
[2025-01-10 22:22:08,902]   DEBUG [ref] CompactedDBG::construct(): Extract approximate unitigs (1/2)
[2025-01-10 22:22:09,003]   DEBUG [ref] CompactedDBG::construct(): Extract approximate unitigs (2/2)
[2025-01-10 22:22:09,004]   DEBUG [ref] CompactedDBG::construct(): Closed all input files
[2025-01-10 22:22:09,004]   DEBUG [ref]
[2025-01-10 22:22:09,004]   DEBUG [ref] CompactedDBG::construct(): Splitting unitigs (1/2)
[2025-01-10 22:22:09,004]   DEBUG [ref]
[2025-01-10 22:22:09,004]   DEBUG [ref] CompactedDBG::construct(): Splitting unitigs (2/2)
[2025-01-10 22:22:09,004]   DEBUG [ref] CompactedDBG::construct(): Before split: 21 unitigs
[2025-01-10 22:22:09,004]   DEBUG [ref] CompactedDBG::construct(): After split (1/1): 21 unitigs
[2025-01-10 22:22:09,004]   DEBUG [ref] CompactedDBG::construct(): Unitigs split: 0
[2025-01-10 22:22:09,004]   DEBUG [ref] CompactedDBG::construct(): Unitigs deleted: 0
[2025-01-10 22:22:09,004]   DEBUG [ref]
[2025-01-10 22:22:09,004]   DEBUG [ref] CompactedDBG::construct(): Joining unitigs
[2025-01-10 22:22:09,004]   DEBUG [ref] CompactedDBG::construct(): After join: 12 unitigs
[2025-01-10 22:22:09,004]   DEBUG [ref] CompactedDBG::construct(): Joined 9 unitigs
[2025-01-10 22:22:09,004]   DEBUG [ref] [build] extracting distinguishing flanking k-mers from "homo_sapiens.cdna_dna.fa.gz"
[2025-01-10 22:52:25,513]   DEBUG [ref] [build] identified 192 distinguishing flanking k-mers
[2025-01-10 22:52:25,514]   DEBUG [ref] [build] building MPHF
[2025-01-10 22:52:25,614]   DEBUG [ref] [build] creating equivalence classes ...
[2025-01-10 22:52:25,614]   DEBUG [ref] [build] target de Bruijn graph has k-mer length 21 and minimizer length 13
[2025-01-10 22:52:25,614]   DEBUG [ref] [build] target de Bruijn graph has 385 contigs and contains 29232 k-mers
[2025-01-10 22:52:25,614]   DEBUG [ref]
[2025-01-10 22:52:26,716]   DEBUG [main] Removing `tmp` directory
[2025-01-10 22:52:53,155]   DEBUG [main] Printing verbose output
[2025-01-10 22:52:55,364]   DEBUG [main] kallisto binary located at /content/kallisto/build/src/kallisto
[2025-01-10 22:52:55,365]   DEBUG [main] bustools binary located at /usr/local/lib/python3.10/dist-packages/kb_python/bins/linux/bustools/bustools
[2025-01-10 22:52:55,369]   DEBUG [main] Creating `kb_output_sars_cov2/tmp` directory
[2025-01-10 22:52:55,369]   DEBUG [main] Namespace(list=False, command='count', tmp=None, keep_tmp=False, verbose=True, i='sars_cov2_ref.idx', g='sars_cov2_t2g.txt', x='10xv2', o='kb_output_sars_cov2', num=False, w=None, r=None, t=2, m='2G', strand=None, inleaved=False, genomebam=False, aa=False, gtf=None, chromosomes=None, workflow='standard', em=False, mm=False, tcc=False, filter=None, filter_threshold=None, c1=None, c2=None, overwrite=False, dry_run=False, batch_barcodes=True, loom=False, h5ad=True, loom_names='barcode,target_name', sum='none', cellranger=False, gene_names=False, N=None, report=False, no_inspect=False, long=False, threshold=0.8, error_rate=None, platform='ONT', kallisto='kallisto/build/src/kallisto', bustools='/usr/local/lib/python3.10/dist-packages/kb_python/bins/linux/bustools/bustools', opt_off=False, k=21, no_validate=False, no_fragment=False, union=False, no_jump=False, quant_umis=False, keep_flags=False, parity='single', fragment_l=None, fragment_s=None, bootstraps=None, matrix_to_files=False, matrix_to_directories=False, fastqs=['batch.txt'])
[2025-01-10 22:53:00,312]    INFO [count] Using index sars_cov2_ref.idx to generate BUS file to kb_output_sars_cov2 from
[2025-01-10 22:53:00,312]    INFO [count]         /content/kb_output_sars_cov2/tmp/tmp4p630fag
[2025-01-10 22:53:00,313]   DEBUG [count] kallisto bus -i sars_cov2_ref.idx -o kb_output_sars_cov2 -x 10xv2 -t 2 --batch-barcodes --batch /content/kb_output_sars_cov2/tmp/tmp4p630fag
[2025-01-10 22:53:00,414]   DEBUG [count]
[2025-01-10 22:53:00,414]   DEBUG [count] [bus] will try running read files supplied in batch file
[2025-01-10 22:53:00,414]   DEBUG [count] [bus] Note: Strand option was not specified; setting it to --fr-stranded for specified technology
[2025-01-10 22:53:00,414]   DEBUG [count] [index] k-mer length: 21
[2025-01-10 22:53:00,414]   DEBUG [count] [index] number of targets: 11
[2025-01-10 22:53:00,414]   DEBUG [count] [index] number of k-mers: 29,232
[2025-01-10 22:53:00,414]   DEBUG [count] [index] number of distinguishing flanking k-mers: 192
[2025-01-10 22:53:00,414]   DEBUG [count] [quant] running in single-end mode
[2025-01-10 22:53:00,414]   DEBUG [count] [quant] will process file 1: /content/SRR11321526_1_short.fastq
[2025-01-10 22:53:00,414]   DEBUG [count] [quant] will process file 2: /content/SRR11321526_2_short.fastq
[2025-01-10 22:53:00,415]   DEBUG [count] [quant] will process file 1: /content/SRR11321528_1_short.fastq
[2025-01-10 22:53:00,415]   DEBUG [count] [quant] will process file 2: /content/SRR11321528_2_short.fastq
[2025-01-10 22:53:00,415]   DEBUG [count] [quant] will process file 1: /content/SRR11321027_1_short.fastq
[2025-01-10 22:53:00,415]   DEBUG [count] [quant] will process file 2: /content/SRR11321027_2_short.fastq
[2025-01-10 22:53:00,415]   DEBUG [count] [quant] will process file 1: /content/SRR11321026_1_short.fastq
[2025-01-10 22:53:00,415]   DEBUG [count] [quant] will process file 2: /content/SRR11321026_2_short.fastq
[2025-01-10 22:54:13,017]   DEBUG [count] [quant] finding pseudoalignments for all files ...
[2025-01-10 22:55:26,842]   DEBUG [count] [progress] 1M reads processed (0.0% mapped)
[2025-01-10 22:56:38,616]   DEBUG [count] [progress] 2M reads processed (0.0% mapped)
[2025-01-10 22:57:51,633]   DEBUG [count] [progress] 3M reads processed (0.0% mapped)
[2025-01-10 22:59:04,247]   DEBUG [count] [progress] 4M reads processed (0.0% mapped)
[2025-01-10 23:00:18,133]   DEBUG [count] [progress] 5M reads processed (0.0% mapped)
[2025-01-10 23:01:29,261]   DEBUG [count] [progress] 6M reads processed (0.0% mapped)
[2025-01-10 23:02:44,006]   DEBUG [count] [progress] 7M reads processed (0.0% mapped)
[2025-01-10 23:03:58,135]   DEBUG [count] [progress] 8M reads processed (0.0% mapped)
[2025-01-10 23:05:12,056]   DEBUG [count] [progress] 9M reads processed (0.0% mapped)
[2025-01-10 23:06:27,251]   DEBUG [count] [progress] 10M reads processed (0.0% mapped)
[2025-01-10 23:07:41,214]   DEBUG [count] [progress] 11M reads processed (0.0% mapped)
[2025-01-10 23:08:55,498]   DEBUG [count] [progress] 12M reads processed (0.0% mapped)
[2025-01-10 23:10:09,482]   DEBUG [count] [progress] 13M reads processed (0.0% mapped)
[2025-01-10 23:11:23,321]   DEBUG [count] [progress] 14M reads processed (0.0% mapped)
[2025-01-10 23:12:38,384]   DEBUG [count] [progress] 15M reads processed (0.0% mapped)
[2025-01-10 23:13:53,246]   DEBUG [count] [progress] 16M reads processed (0.0% mapped)
[2025-01-10 23:15:06,130]   DEBUG [count] [progress] 17M reads processed (0.0% mapped)
[2025-01-10 23:16:20,647]   DEBUG [count] [progress] 18M reads processed (0.0% mapped)
[2025-01-10 23:17:34,493]   DEBUG [count] [progress] 19M reads processed (0.0% mapped)
[2025-01-10 23:18:41,643]   DEBUG [count] [progress] 20M reads processed (0.0% mapped)
[2025-01-10 23:19:57,201]   DEBUG [count] [progress] 21M reads processed (0.0% mapped)
[2025-01-10 23:21:12,646]   DEBUG [count] [progress] 22M reads processed (0.0% mapped)
[2025-01-10 23:22:26,495]   DEBUG [count] [progress] 23M reads processed (0.0% mapped)
[2025-01-10 23:23:41,101]   DEBUG [count] [progress] 24M reads processed (0.0% mapped)
[2025-01-10 23:25:00,423]   DEBUG [count] [progress] 25M reads processed (0.0% mapped)
[2025-01-10 23:26:15,749]   DEBUG [count] [progress] 26M reads processed (0.0% mapped)
[2025-01-10 23:27:30,788]   DEBUG [count] [progress] 27M reads processed (0.0% mapped)
[2025-01-10 23:28:44,837]   DEBUG [count] [progress] 28M reads processed (0.0% mapped)
[2025-01-10 23:30:00,421]   DEBUG [count] [progress] 29M reads processed (0.0% mapped)
[2025-01-10 23:31:15,614]   DEBUG [count] [progress] 30M reads processed (0.0% mapped)
[2025-01-10 23:32:29,335]   DEBUG [count] [progress] 31M reads processed (0.0% mapped)
[2025-01-10 23:33:42,670]   DEBUG [count] [progress] 32M reads processed (0.0% mapped)
[2025-01-10 23:34:55,149]   DEBUG [count] [progress] 33M reads processed (0.0% mapped)
[2025-01-10 23:36:08,691]   DEBUG [count] [progress] 34M reads processed (0.0% mapped)
[2025-01-10 23:37:21,768]   DEBUG [count] [progress] 35M reads processed (0.0% mapped)
[2025-01-10 23:38:34,436]   DEBUG [count] [progress] 36M reads processed (0.0% mapped)
[2025-01-10 23:39:46,948]   DEBUG [count] [progress] 37M reads processed (0.0% mapped)
[2025-01-10 23:40:59,180]   DEBUG [count] [progress] 38M reads processed (0.0% mapped)
[2025-01-10 23:42:13,862]   DEBUG [count] [progress] 39M reads processed (0.0% mapped)
[2025-01-10 23:43:28,554]   DEBUG [count] [progress] 40M reads processed (0.0% mapped)
[2025-01-10 23:44:42,803]   DEBUG [count] [progress] 41M reads processed (0.0% mapped)
[2025-01-10 23:46:00,436]   DEBUG [count] [progress] 42M reads processed (0.0% mapped)
[2025-01-10 23:47:17,115]   DEBUG [count] [progress] 43M reads processed (0.0% mapped)
[2025-01-10 23:48:33,170]   DEBUG [count] [progress] 44M reads processed (0.0% mapped)
[2025-01-10 23:49:51,295]   DEBUG [count] [progress] 45M reads processed (0.0% mapped)
[2025-01-10 23:51:09,533]   DEBUG [count] [progress] 46M reads processed (0.0% mapped)
[2025-01-10 23:52:27,425]   DEBUG [count] [progress] 47M reads processed (0.0% mapped)
[2025-01-10 23:53:40,746]   DEBUG [count] [progress] 48M reads processed (0.0% mapped)
[2025-01-10 23:53:40,846]   DEBUG [count] [progress] 50M reads processed (0.0% mapped)              done
[2025-01-10 23:53:40,847]   DEBUG [count] [quant] processed 50,000,000 reads, 14 reads pseudoaligned
[2025-01-10 23:53:41,948]   DEBUG [count]
[2025-01-10 23:53:41,964]    INFO [count] Sorting BUS file kb_output_sars_cov2/output.bus to kb_output_sars_cov2/tmp/output.s.bus
[2025-01-10 23:53:41,964]   DEBUG [count] bustools sort -o kb_output_sars_cov2/tmp/output.s.bus -T kb_output_sars_cov2/tmp -t 2 -m 2G kb_output_sars_cov2/output.bus
[2025-01-10 23:53:44,131]   DEBUG [count] all fits in buffer
[2025-01-10 23:53:45,432]   DEBUG [count] Read in 14 BUS records
[2025-01-10 23:53:45,432]   DEBUG [count] reading time 5e-06s
[2025-01-10 23:53:45,432]   DEBUG [count] sorting time 2e-06s
[2025-01-10 23:53:45,432]   DEBUG [count] writing time 5.8e-05s
[2025-01-10 23:53:45,433]    INFO [count] On-list not provided
[2025-01-10 23:53:45,433]    INFO [count] Copying pre-packaged 10XV2 on-list to kb_output_sars_cov2
[2025-01-10 23:53:45,607]    INFO [count] Inspecting BUS file kb_output_sars_cov2/tmp/output.s.bus
[2025-01-10 23:53:45,607]   DEBUG [count] bustools inspect -o kb_output_sars_cov2/inspect.json -w kb_output_sars_cov2/10x_version2_whitelist.txt kb_output_sars_cov2/tmp/output.s.bus
[2025-01-10 23:53:47,311]    INFO [count] Correcting BUS records in kb_output_sars_cov2/tmp/output.s.bus to kb_output_sars_cov2/tmp/output.s.c.bus with on-list kb_output_sars_cov2/10x_version2_whitelist.txt
[2025-01-10 23:53:47,312]   DEBUG [count] bustools correct -o kb_output_sars_cov2/tmp/output.s.c.bus -w kb_output_sars_cov2/10x_version2_whitelist.txt kb_output_sars_cov2/tmp/output.s.bus
[2025-01-10 23:53:48,415]   DEBUG [count] Found 737280 barcodes in the on-list
[2025-01-10 23:53:48,816]   DEBUG [count] Processed 14 BUS records
[2025-01-10 23:53:48,816]   DEBUG [count] In on-list = 13
[2025-01-10 23:53:48,816]   DEBUG [count] Corrected    = 0
[2025-01-10 23:53:48,816]   DEBUG [count] Uncorrected  = 1
[2025-01-10 23:53:49,918]    INFO [count] Sorting BUS file kb_output_sars_cov2/tmp/output.s.c.bus to kb_output_sars_cov2/output.unfiltered.bus
[2025-01-10 23:53:49,918]   DEBUG [count] bustools sort -o kb_output_sars_cov2/output.unfiltered.bus -T kb_output_sars_cov2/tmp -t 2 -m 2G kb_output_sars_cov2/tmp/output.s.c.bus
[2025-01-10 23:53:52,025]   DEBUG [count] all fits in buffer
[2025-01-10 23:53:53,326]   DEBUG [count] Read in 13 BUS records
[2025-01-10 23:53:53,326]   DEBUG [count] reading time 5e-06s
[2025-01-10 23:53:53,326]   DEBUG [count] sorting time 1e-06s
[2025-01-10 23:53:53,326]   DEBUG [count] writing time 5.4e-05s
[2025-01-10 23:53:53,330]    INFO [count] Generating count matrix kb_output_sars_cov2/counts_unfiltered/cells_x_genes from BUS file kb_output_sars_cov2/output.unfiltered.bus
[2025-01-10 23:53:53,330]   DEBUG [count] bustools count -o kb_output_sars_cov2/counts_unfiltered/cells_x_genes -g sars_cov2_t2g.txt -e kb_output_sars_cov2/matrix.ec -t kb_output_sars_cov2/transcripts.txt --genecounts --umi-gene kb_output_sars_cov2/output.unfiltered.bus
[2025-01-10 23:53:54,995]   DEBUG [count] kb_output_sars_cov2/counts_unfiltered/cells_x_genes.mtx passed validation
[2025-01-10 23:53:54,996]    INFO [count] Writing gene names to file kb_output_sars_cov2/counts_unfiltered/cells_x_genes.genes.names.txt
[2025-01-10 23:53:54,996] WARNING [count] 11 gene IDs do not have corresponding valid gene names. These genes will use their gene IDs instead.
[2025-01-10 23:53:54,996]    INFO [count] Reading matrix kb_output_sars_cov2/counts_unfiltered/cells_x_genes.mtx
[2025-01-10 23:53:55,302]    INFO [count] Writing matrix to h5ad kb_output_sars_cov2/counts_unfiltered/adata.h5ad
[2025-01-10 23:53:55,622]   DEBUG [main] Removing `kb_output_sars_cov2/tmp` directory

Though some reads aligned, only 14 out of 50,000,000 reads mapped to this SARS-CoV2 genome, so this was likely not the correct viral reference genome. You could also confirm the presence of virus using the assembly of contigs from the raw sequencing data (for example, using a tool like SPAdes) and the subsequent alignment of the contigs to all viral reference genomes.

For novel viruses, the parallel analysis of viral presence and host gene expression can provide further insights, as was done here (investigate viral tropism) and here (correlate viral presence with altered host gene expression at single-cell resolution).

For inspiration regarding further analysis, each figure from our manuscript can be reproduced using these notebooks.