Viral proteins in single-cell RNA seq data
In this tutorial, we will align single-cell RNA sequencing data collected from Dengue virus (DENV)-infected CEM.NK\(^{R}\) cells as well as peripheral blood mononuclear cells (PBMC) data collected from a patient suffering from DENV (during and after infection) to viral RdRP protein sequences. Let’s see if we can detect DENV RdRP sequences as expected.
Written by: Laura Luebbert (last updated: January 7th, 2025)
[ ]:
# SRR numbers of the sequencing dataset(s) to analyze
srr_numbers = [
"SRR11321526",
"SRR11321528",
"SRR11321027",
"SRR11321026"
]
# Genus and species of the host (separate by "_")
host_species = "homo_sapiens"
# Set this to False to download the complete RNA seq dataset (note: this might exceed the Google Colab disk space)
# With subset_of_data=True, only a subset of the DENV dataset is aligned in this example notebook to decrease runtime and required disk space
subset_of_data = True
# Number of threads to use during alignments
threads = 2
# k-mer length: Increase k for increased specificity (at a potential trade-off with sensitivity)
# k should be an odd integer and should be >=31 for translated alignment
# Since k relates to nucleotides and not amino acids, k corresponds to matching ⌊k/3⌋ amino acids (for k=31, this corresponds to ⌊31/3⌋=10 amino acids)
k = 31
Install software
[ ]:
!pip install -q ffq gget kb_python==0.29.1 anndata==0.10.9
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 49.6/49.6 kB 2.2 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
Preparing metadata (setup.py) ... done
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36.5/36.5 MB 9.5 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 129.0/129.0 kB 8.4 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 43.1/43.1 MB 13.5 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 50.4/50.4 kB 2.5 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.2/3.2 MB 35.2 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34.4/34.4 MB 12.0 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 50.9/50.9 MB 11.3 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 30.2 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 33.3 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.9/56.9 kB 3.3 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 22.0/22.0 MB 33.8 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88.8/88.8 kB 6.0 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 40.6/40.6 kB 2.3 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 83.6/83.6 kB 5.4 MB/s eta 0:00:00
Building wheel for loompy (setup.py) ... done
Building wheel for session-info (setup.py) ... done
[ ]:
import anndata
import gget
import glob
import gzip
import json
import matplotlib.colors
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import platform
from Bio import SeqIO
%config InlineBackend.figure_format='retina'
[ ]:
# This step will not be necessary after the next kb release
if k<32:
# Install kallisto v0.50.0 from source
# This version performs translated search faster while providing the same results as later versions, but it does not support k>31
# (the features to increase runtime and support k>31 will be incorporated into the next version of kb, so this step will not be necessary after the next release)
!git clone https://github.com/pachterlab/kallisto.git --branch v0.50.0
!cd kallisto && mkdir build && cd build && cmake .. && make
kallisto = "kallisto/build/src/kallisto"
else:
# Find the kallisto binary installed automatically with kb (or simply remove the --kallisto arguments below)
kallisto = glob.glob(f"/usr/local/lib/python*/dist-packages/kb_python/bins/{platform.system().lower()}/kallisto/kallisto_k64")[0]
Download raw sequencing data
Here, we will use ffq to download the sequencing data from DENV1 infected CEM.NK^R cells (SRR11321027) and the uninfected control (SRR11321026) as well as PBMC data collected from a patient experiencing a DENV infection at fever day -2 (SRR11321526) and 180 (baseline post-infection) (SRR11321528):
[ ]:
%%time
## Get the FTP download links for raw data using ffq and store the results in a json file
srr_numbers = " ".join(srr_numbers)
!ffq $srr_numbers \
--ftp \
-o ffq.json
## Load ffq output
f = open("ffq.json")
data_json = json.load(f)
f.close()
## Download raw data using FTP links fetched by ffq
# NOTE: To decrease the runtime and required disk space of this example notebook,
# we will only download the first [gb_to_download] GB of each fastq file when subset_of_data=True
gb_to_download = 3
# Convert GB to bytes
max_bytes = gb_to_download * 1073741824
# Download data
for dataset in data_json:
url = dataset["url"]
if subset_of_data:
!curl -r 0-$max_bytes -O $url
else:
!curl -O $url
# Remove unnecessary file to save disk space
!rm SRR11321528.fastq.gz
[2025-01-10 15:00:38,840] INFO Parsing run SRR11321526
[2025-01-10 15:00:40,240] INFO Parsing run SRR11321528
[2025-01-10 15:00:41,497] INFO Parsing run SRR11321027
[2025-01-10 15:00:42,746] INFO Parsing run SRR11321026
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 3072M 100 3072M 0 0 44.8M 0 0:01:08 0:01:08 --:--:-- 22.5M
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 3072M 100 3072M 0 0 49.8M 0 0:01:01 0:01:01 --:--:-- 51.1M
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 3072M 100 3072M 0 0 49.1M 0 0:01:02 0:01:02 --:--:-- 51.0M
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 3072M 100 3072M 0 0 50.4M 0 0:01:00 0:01:00 --:--:-- 51.7M
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 3072M 100 3072M 0 0 40.0M 0 0:01:16 0:01:16 --:--:-- 49.9M
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 3072M 100 3072M 0 0 49.3M 0 0:01:02 0:01:02 --:--:-- 50.8M
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 3072M 100 3072M 0 0 49.6M 0 0:01:01 0:01:01 --:--:-- 51.1M
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 3072M 100 3072M 0 0 49.7M 0 0:01:01 0:01:01 --:--:-- 50.9M
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 3072M 100 3072M 0 0 50.3M 0 0:01:00 0:01:00 --:--:-- 51.7M
CPU times: user 4.67 s, sys: 768 ms, total: 5.44 s
Wall time: 9min 44s
Align data using kallisto translated search
Create a batch.txt file so we can run all fastq files simultaneously (to learn more about batch files, see Box 7 in the Protocols paper). To keep track of which cell barcodes belonged to which SRR file, we will activate the --batch-barcodes option in the alignment step below:
[ ]:
%%time
samples = srr_numbers.split(" ")
# NOTE: To decrease the runtime and disk space of this example notebook,
# we will only align the top [n_seq_to_keep[i]] sequences in each fastq file when subset_of_data=True
n_seq_to_keep = [
20000000,
20000000,
5000000, # No need to keep as many reads for the cell line samples as viral read % is higher
5000000
]
with open("batch.txt", "w") as batch_file:
for i, sample in enumerate(samples):
R1 = f"/content/{sample}_1.fastq.gz"
R2 = f"/content/{sample}_2.fastq.gz"
if subset_of_data:
# Shorten fastq files (skip this step during a real analysis)
n_rows_to_keep = n_seq_to_keep[i] * 4
R1_short = R1.split(".fastq.gz")[0] + "_short.fastq"
R2_short = R2.split(".fastq.gz")[0] + "_short.fastq"
!zcat $R1 | head -$n_rows_to_keep > $R1_short
!zcat $R2 | head -$n_rows_to_keep > $R2_short
# Delete original files to save disk space
!rm $R1 $R2
# Write batch file in the following format:
# sample_name \t R1_filepath \t R2_filepath
batch_file.write(sample + "\t" + R1_short + "\t" + R2_short + "\n")
else:
batch_file.write(sample + "\t" + R1 + "\t" + R2 + "\n")
CPU times: user 3.07 s, sys: 458 ms, total: 3.53 s
Wall time: 7min 41s
Align FASTQs using kallisto translated search:
--aa argument tells kb that this is an amino acid reference index.`-x argument <https://kallisto.readthedocs.io/en/latest/sc/technologies.html>`__ tells kb where to find the cell barcodes and UMIs in the data. Here, we set it to 10xv2 because this data was generated using the Chromium Single-Cell 5′ Reagent version 2 kit.--h5ad flag generates a .h5ad file from the count matrix.NOTE for paired read data: kallisto translated search currently does not yet support the simultaneous alignment of paired reads, so parity should always be set to ‘single’ (--parity single) and the R1 and R2 files obtained using paired read sequencing should be aligned separately. The resulting matrices from the R1 and R2 reads should provide similar viral counts.
[ ]:
%%time
out_folder = "kb_output"
!kb count \
--kallisto $kallisto \
--verbose \
-t $threads \
--aa \
-k $k \
-x 10xv2 \
--parity single \
-i $reference_index \
-g palmdb_clustered_t2g.txt \
--h5ad \
-o $out_folder \
--batch-barcodes \
batch.txt
[2025-01-10 15:52:50,983] DEBUG [main] Printing verbose output
[2025-01-10 15:52:53,191] DEBUG [main] kallisto binary located at /content/kallisto/build/src/kallisto
[2025-01-10 15:52:53,192] DEBUG [main] bustools binary located at /usr/local/lib/python3.10/dist-packages/kb_python/bins/linux/bustools/bustools
[2025-01-10 15:52:53,192] DEBUG [main] Creating `kb_output/tmp` directory
[2025-01-10 15:52:53,194] DEBUG [main] Namespace(list=False, command='count', tmp=None, keep_tmp=False, verbose=True, i='palmdb_homo_sapiens_dlist_cdna_dna.idx', g='palmdb_clustered_t2g.txt', x='10xv2', o='kb_output', num=False, w=None, r=None, t=2, m='2G', strand=None, inleaved=False, genomebam=False, aa=True, gtf=None, chromosomes=None, workflow='standard', em=False, mm=False, tcc=False, filter=None, filter_threshold=None, c1=None, c2=None, overwrite=False, dry_run=False, batch_barcodes=True, loom=False, h5ad=True, loom_names='barcode,target_name', sum='none', cellranger=False, gene_names=False, N=None, report=False, no_inspect=False, long=False, threshold=0.8, error_rate=None, platform='ONT', kallisto='kallisto/build/src/kallisto', bustools='/usr/local/lib/python3.10/dist-packages/kb_python/bins/linux/bustools/bustools', opt_off=False, k=31, no_validate=False, no_fragment=False, union=False, no_jump=False, quant_umis=False, keep_flags=False, parity='single', fragment_l=None, fragment_s=None, bootstraps=None, matrix_to_files=False, matrix_to_directories=False, fastqs=['batch.txt'])
[2025-01-10 15:52:57,069] INFO [count] Using index palmdb_homo_sapiens_dlist_cdna_dna.idx to generate BUS file to kb_output from
[2025-01-10 15:52:57,069] INFO [count] /content/kb_output/tmp/tmpsb1qpyjx
[2025-01-10 15:52:57,069] DEBUG [count] kallisto bus -i palmdb_homo_sapiens_dlist_cdna_dna.idx -o kb_output -x 10xv2 -t 2 --aa --batch-barcodes --batch /content/kb_output/tmp/tmpsb1qpyjx
[2025-01-10 15:52:57,171] DEBUG [count]
[2025-01-10 15:52:57,171] DEBUG [count] [bus] will try running read files supplied in batch file
[2025-01-10 15:52:57,171] DEBUG [count] [bus] Note: Strand option was not specified; setting it to --fr-stranded for specified technology
[2025-01-10 15:53:07,012] DEBUG [count] [index] k-mer length: 31
[2025-01-10 15:53:29,938] DEBUG [count] [index] number of targets: 296,561
[2025-01-10 15:53:29,939] DEBUG [count] [index] number of k-mers: 37,541,835
[2025-01-10 15:53:29,939] DEBUG [count] [index] number of distinguishing flanking k-mers: 130
[2025-01-10 15:53:29,939] DEBUG [count] [quant] running in single-end mode
[2025-01-10 15:53:29,939] DEBUG [count] [quant] will process file 1: /content/SRR11321526_1_short.fastq
[2025-01-10 15:53:29,939] DEBUG [count] [quant] will process file 2: /content/SRR11321526_2_short.fastq
[2025-01-10 15:53:29,939] DEBUG [count] [quant] will process file 1: /content/SRR11321528_1_short.fastq
[2025-01-10 15:53:29,939] DEBUG [count] [quant] will process file 2: /content/SRR11321528_2_short.fastq
[2025-01-10 15:53:29,939] DEBUG [count] [quant] will process file 1: /content/SRR11321027_1_short.fastq
[2025-01-10 15:53:29,939] DEBUG [count] [quant] will process file 2: /content/SRR11321027_2_short.fastq
[2025-01-10 15:53:29,939] DEBUG [count] [quant] will process file 1: /content/SRR11321026_1_short.fastq
[2025-01-10 15:53:29,939] DEBUG [count] [quant] will process file 2: /content/SRR11321026_2_short.fastq
[2025-01-10 15:59:40,532] DEBUG [count] [quant] finding pseudoalignments for all files ...
[2025-01-10 16:05:40,481] DEBUG [count] [progress] 1M reads processed (0.2% mapped)
[2025-01-10 16:11:19,599] DEBUG [count] [progress] 2M reads processed (0.2% mapped)
[2025-01-10 16:16:56,456] DEBUG [count] [progress] 3M reads processed (0.2% mapped)
[2025-01-10 16:22:43,587] DEBUG [count] [progress] 4M reads processed (0.2% mapped)
[2025-01-10 16:28:31,712] DEBUG [count] [progress] 5M reads processed (0.2% mapped)
[2025-01-10 16:34:26,790] DEBUG [count] [progress] 6M reads processed (0.2% mapped)
[2025-01-10 16:40:07,766] DEBUG [count] [progress] 7M reads processed (0.2% mapped)
[2025-01-10 16:45:46,763] DEBUG [count] [progress] 8M reads processed (0.2% mapped)
[2025-01-10 16:51:19,055] DEBUG [count] [progress] 9M reads processed (0.2% mapped)
[2025-01-10 16:56:52,839] DEBUG [count] [progress] 10M reads processed (0.2% mapped)
[2025-01-10 17:02:29,792] DEBUG [count] [progress] 11M reads processed (0.2% mapped)
[2025-01-10 17:08:07,981] DEBUG [count] [progress] 12M reads processed (0.2% mapped)
[2025-01-10 17:13:39,961] DEBUG [count] [progress] 13M reads processed (0.2% mapped)
[2025-01-10 17:19:20,291] DEBUG [count] [progress] 14M reads processed (0.2% mapped)
[2025-01-10 17:24:58,571] DEBUG [count] [progress] 15M reads processed (0.2% mapped)
[2025-01-10 17:30:22,827] DEBUG [count] [progress] 16M reads processed (0.2% mapped)
[2025-01-10 17:35:13,251] DEBUG [count] [progress] 17M reads processed (0.2% mapped)
[2025-01-10 17:40:20,934] DEBUG [count] [progress] 18M reads processed (0.2% mapped)
[2025-01-10 17:45:16,769] DEBUG [count] [progress] 19M reads processed (0.2% mapped)
[2025-01-10 17:50:11,728] DEBUG [count] [progress] 20M reads processed (0.2% mapped)
[2025-01-10 17:55:20,603] DEBUG [count] [progress] 21M reads processed (0.2% mapped)
[2025-01-10 18:00:11,547] DEBUG [count] [progress] 22M reads processed (0.2% mapped)
[2025-01-10 18:05:09,311] DEBUG [count] [progress] 23M reads processed (0.2% mapped)
[2025-01-10 18:10:18,138] DEBUG [count] [progress] 24M reads processed (0.2% mapped)
[2025-01-10 18:15:09,731] DEBUG [count] [progress] 25M reads processed (0.2% mapped)
[2025-01-10 18:20:10,316] DEBUG [count] [progress] 26M reads processed (0.2% mapped)
[2025-01-10 18:25:08,681] DEBUG [count] [progress] 27M reads processed (0.2% mapped)
[2025-01-10 18:30:06,926] DEBUG [count] [progress] 28M reads processed (0.2% mapped)
[2025-01-10 18:35:02,398] DEBUG [count] [progress] 29M reads processed (0.2% mapped)
[2025-01-10 18:40:06,895] DEBUG [count] [progress] 30M reads processed (0.2% mapped)
[2025-01-10 18:45:03,551] DEBUG [count] [progress] 31M reads processed (0.2% mapped)
[2025-01-10 18:50:46,827] DEBUG [count] [progress] 32M reads processed (0.2% mapped)
[2025-01-10 18:56:18,283] DEBUG [count] [progress] 33M reads processed (0.2% mapped)
[2025-01-10 19:01:44,666] DEBUG [count] [progress] 34M reads processed (0.2% mapped)
[2025-01-10 19:07:33,105] DEBUG [count] [progress] 35M reads processed (0.2% mapped)
[2025-01-10 19:13:10,611] DEBUG [count] [progress] 36M reads processed (0.2% mapped)
[2025-01-10 19:18:39,088] DEBUG [count] [progress] 37M reads processed (0.2% mapped)
[2025-01-10 19:24:19,279] DEBUG [count] [progress] 38M reads processed (0.2% mapped)
[2025-01-10 19:29:47,357] DEBUG [count] [progress] 39M reads processed (0.2% mapped)
[2025-01-10 19:35:35,178] DEBUG [count] [progress] 40M reads processed (0.2% mapped)
[2025-01-10 19:41:12,484] DEBUG [count] [progress] 41M reads processed (0.2% mapped)
[2025-01-10 19:47:02,956] DEBUG [count] [progress] 42M reads processed (0.2% mapped)
[2025-01-10 19:52:26,931] DEBUG [count] [progress] 43M reads processed (0.2% mapped)
[2025-01-10 19:58:33,900] DEBUG [count] [progress] 44M reads processed (0.2% mapped)
[2025-01-10 20:04:37,109] DEBUG [count] [progress] 45M reads processed (0.2% mapped)
[2025-01-10 20:10:22,079] DEBUG [count] [progress] 46M reads processed (0.2% mapped)
[2025-01-10 20:15:53,707] DEBUG [count] [progress] 47M reads processed (0.2% mapped)
[2025-01-10 20:21:27,294] DEBUG [count] [progress] 48M reads processed (0.2% mapped)
[2025-01-10 20:21:27,395] DEBUG [count] [progress] 50M reads processed (0.2% mapped) done
[2025-01-10 20:21:27,395] DEBUG [count] [quant] processed 50,000,000 reads, 97,542 reads pseudoaligned
[2025-01-10 20:21:27,596] DEBUG [count]
[2025-01-10 20:21:30,323] INFO [count] Sorting BUS file kb_output/output.bus to kb_output/tmp/output.s.bus
[2025-01-10 20:21:30,323] DEBUG [count] bustools sort -o kb_output/tmp/output.s.bus -T kb_output/tmp -t 2 -m 2G kb_output/output.bus
[2025-01-10 20:21:32,449] DEBUG [count] all fits in buffer
[2025-01-10 20:21:33,751] DEBUG [count] Read in 97542 BUS records
[2025-01-10 20:21:33,751] DEBUG [count] reading time 0.001918s
[2025-01-10 20:21:33,751] DEBUG [count] sorting time 0.017437s
[2025-01-10 20:21:33,752] DEBUG [count] writing time 0.00732s
[2025-01-10 20:21:33,752] INFO [count] On-list not provided
[2025-01-10 20:21:33,752] INFO [count] Copying pre-packaged 10XV2 on-list to kb_output
[2025-01-10 20:21:33,980] INFO [count] Inspecting BUS file kb_output/tmp/output.s.bus
[2025-01-10 20:21:33,980] DEBUG [count] bustools inspect -o kb_output/inspect.json -w kb_output/10x_version2_whitelist.txt kb_output/tmp/output.s.bus
[2025-01-10 20:21:35,692] INFO [count] Correcting BUS records in kb_output/tmp/output.s.bus to kb_output/tmp/output.s.c.bus with on-list kb_output/10x_version2_whitelist.txt
[2025-01-10 20:21:35,693] DEBUG [count] bustools correct -o kb_output/tmp/output.s.c.bus -w kb_output/10x_version2_whitelist.txt kb_output/tmp/output.s.bus
[2025-01-10 20:21:36,899] DEBUG [count] Found 737280 barcodes in the on-list
[2025-01-10 20:21:37,300] DEBUG [count] Processed 86901 BUS records
[2025-01-10 20:21:37,300] DEBUG [count] In on-list = 77345
[2025-01-10 20:21:37,301] DEBUG [count] Corrected = 1231
[2025-01-10 20:21:37,301] DEBUG [count] Uncorrected = 8325
[2025-01-10 20:21:38,402] INFO [count] Sorting BUS file kb_output/tmp/output.s.c.bus to kb_output/output.unfiltered.bus
[2025-01-10 20:21:38,403] DEBUG [count] bustools sort -o kb_output/output.unfiltered.bus -T kb_output/tmp -t 2 -m 2G kb_output/tmp/output.s.c.bus
[2025-01-10 20:21:40,609] DEBUG [count] all fits in buffer
[2025-01-10 20:21:41,811] DEBUG [count] Read in 78576 BUS records
[2025-01-10 20:21:41,811] DEBUG [count] reading time 0.001695s
[2025-01-10 20:21:41,811] DEBUG [count] sorting time 0.011695s
[2025-01-10 20:21:41,812] DEBUG [count] writing time 0.005858s
[2025-01-10 20:21:41,815] INFO [count] Generating count matrix kb_output/counts_unfiltered/cells_x_genes from BUS file kb_output/output.unfiltered.bus
[2025-01-10 20:21:41,815] DEBUG [count] bustools count -o kb_output/counts_unfiltered/cells_x_genes -g palmdb_clustered_t2g.txt -e kb_output/matrix.ec -t kb_output/transcripts.txt --genecounts --umi-gene kb_output/output.unfiltered.bus
[2025-01-10 20:21:44,995] DEBUG [count] kb_output/counts_unfiltered/cells_x_genes.mtx passed validation
[2025-01-10 20:21:44,995] INFO [count] Writing gene names to file kb_output/counts_unfiltered/cells_x_genes.genes.names.txt
[2025-01-10 20:21:45,561] WARNING [count] 99228 gene IDs do not have corresponding valid gene names. These genes will use their gene IDs instead.
[2025-01-10 20:21:45,609] INFO [count] Reading matrix kb_output/counts_unfiltered/cells_x_genes.mtx
[2025-01-10 20:21:46,111] INFO [count] Writing matrix to h5ad kb_output/counts_unfiltered/adata.h5ad
[2025-01-10 20:21:46,519] DEBUG [main] Removing `kb_output/tmp` directory
CPU times: user 1min 44s, sys: 15.2 s, total: 1min 59s
Wall time: 4h 29min 17s
Load the generated count matrix
[ ]:
# Open count matrix (AnnData object in h5ad format)
adata = anndata.read_h5ad(f"{out_folder}/counts_unfiltered/adata.h5ad")
adata
AnnData object with n_obs × n_vars = 15369 × 99228
[ ]:
# 'barcode' here includes batch/sample barcode (first 16 characters) + cell barcode
adata.obs
| barcode |
|---|
| AAAAAAAAAAAAAAAAAAACCTGAGATGTCGG |
| AAAAAAAAAAAAAAAAAAACCTGAGTGAATTG |
| AAAAAAAAAAAAAAAAAAACCTGTCAATACCG |
| AAAAAAAAAAAAAAAAAAACCTGTCCACTCCA |
| AAAAAAAAAAAAAAAAAAACCTGTCCATGCTC |
| ... |
| AAAAAAAAAAAAAAATTTTGCGCCAACGATGG |
| AAAAAAAAAAAAAAATTTTGGTTAGGTAAACT |
| AAAAAAAAAAAAAAATTTTGGTTTCTCTGAGA |
| AAAAAAAAAAAAAAATTTTGTCACAGCTCGAC |
| AAAAAAAAAAAAAAATTTTGTCAGTACTTCTT |
15369 rows × 0 columns
Use the sample barcodes to add a column with the original SRR file for each cell:
[ ]:
# Load the sample/batch barcodes
b_file = open("kb_output/matrix.sample.barcodes")
barcodes = b_file.read().splitlines()
b_file.close()
# Load the sample/batch names
s_file = open("kb_output/matrix.cells")
samples = s_file.read().splitlines()
s_file.close()
# Create dataframe that maps sample barcodes to sample names
bc2sample_df = pd.DataFrame()
bc2sample_df["sample_barcode"] = barcodes
bc2sample_df["srr"] = samples
# Add sample name to AnnData object
adata.obs["barcode"] = adata.obs.index.values
adata.obs["sample_barcode"] = [bc[:-16] for bc in adata.obs.index.values]
adata.obs["cell_barcode"] = [bc[-16:] for bc in adata.obs.index.values]
adata.obs = adata.obs.merge(bc2sample_df, on="sample_barcode", how="left").set_index("cell_barcode", drop=False)
adata.obs
| barcode | sample_barcode | cell_barcode | srr | |
|---|---|---|---|---|
| cell_barcode | ||||
| AAACCTGAGATGTCGG | AAAAAAAAAAAAAAAAAAACCTGAGATGTCGG | AAAAAAAAAAAAAAAA | AAACCTGAGATGTCGG | SRR11321526 |
| AAACCTGAGTGAATTG | AAAAAAAAAAAAAAAAAAACCTGAGTGAATTG | AAAAAAAAAAAAAAAA | AAACCTGAGTGAATTG | SRR11321526 |
| AAACCTGTCAATACCG | AAAAAAAAAAAAAAAAAAACCTGTCAATACCG | AAAAAAAAAAAAAAAA | AAACCTGTCAATACCG | SRR11321526 |
| AAACCTGTCCACTCCA | AAAAAAAAAAAAAAAAAAACCTGTCCACTCCA | AAAAAAAAAAAAAAAA | AAACCTGTCCACTCCA | SRR11321526 |
| AAACCTGTCCATGCTC | AAAAAAAAAAAAAAAAAAACCTGTCCATGCTC | AAAAAAAAAAAAAAAA | AAACCTGTCCATGCTC | SRR11321526 |
| ... | ... | ... | ... | ... |
| TTTGCGCCAACGATGG | AAAAAAAAAAAAAAATTTTGCGCCAACGATGG | AAAAAAAAAAAAAAAT | TTTGCGCCAACGATGG | SRR11321026 |
| TTTGGTTAGGTAAACT | AAAAAAAAAAAAAAATTTTGGTTAGGTAAACT | AAAAAAAAAAAAAAAT | TTTGGTTAGGTAAACT | SRR11321026 |
| TTTGGTTTCTCTGAGA | AAAAAAAAAAAAAAATTTTGGTTTCTCTGAGA | AAAAAAAAAAAAAAAT | TTTGGTTTCTCTGAGA | SRR11321026 |
| TTTGTCACAGCTCGAC | AAAAAAAAAAAAAAATTTTGTCACAGCTCGAC | AAAAAAAAAAAAAAAT | TTTGTCACAGCTCGAC | SRR11321026 |
| TTTGTCAGTACTTCTT | AAAAAAAAAAAAAAATTTTGTCAGTACTTCTT | AAAAAAAAAAAAAAAT | TTTGTCAGTACTTCTT | SRR11321026 |
15369 rows × 4 columns
[ ]:
# Create dataframe with a label for each SRR library
samples = [
'SRR11321526',
'SRR11321528',
'SRR11321027',
'SRR11321026'
]
# Label corresponding to each SRR library (samples)
labels = [
'Patient PBMC\n(fever day -2)',
'Patient PBMC\n(post-infection control)',
'DENV-infected CEM.NK$^R$',
'Uninfected\nControl (CEM.NK$^R$)'
]
metadata_df = pd.DataFrame()
metadata_df["SRR"] = samples
metadata_df["label"] = labels
metadata_df
| SRR | label | |
|---|---|---|
| 0 | SRR11321526 | Patient PBMC\n(fever day -2) |
| 1 | SRR11321528 | Patient PBMC\n(post-infection control) |
| 2 | SRR11321027 | DENV-infected CEM.NK$^R$ |
| 3 | SRR11321026 | Uninfected\nControl (CEM.NK$^R$) |
Plot counts for Dengue virus (DENV) in each sample
Find the virus IDs for Dengue virus in the PalmDB viral protein reference:
[ ]:
# Load the PalmDB virus ID to virus taxonomy mapping
u_tax_csv = "ID_to_taxonomy_mapping.csv"
tax_df = pd.read_csv(u_tax_csv)
# Find the reference IDs for Dengue virus RdRPs
# Note: Only the rep_ID (representative ID) will occur in the count matrix
tax_df[tax_df["species"].str.contains("Dengue virus")]
| ID | rep_ID | phylum | class | order | family | genus | species | strandedness | |
|---|---|---|---|---|---|---|---|---|---|
| 7298 | u1010 | u1010 | Kitrinoviricota | Flasuviricetes | Amarillovirales | Flaviviridae | Flavivirus | Dengue virus | +ssRNA |
| 7299 | u1014 | u1010 | Kitrinoviricota | Flasuviricetes | Amarillovirales | Flaviviridae | Flavivirus | Dengue virus | +ssRNA |
| 7300 | u1015 | u1010 | Kitrinoviricota | Flasuviricetes | Amarillovirales | Flaviviridae | Flavivirus | Dengue virus | +ssRNA |
| 7301 | u1025 | u1010 | Kitrinoviricota | Flasuviricetes | Amarillovirales | Flaviviridae | Flavivirus | Dengue virus | +ssRNA |
| 7302 | u10313 | u1010 | Kitrinoviricota | Flasuviricetes | Amarillovirales | Flaviviridae | Flavivirus | Dengue virus | +ssRNA |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 8360 | u9796 | u1010 | Kitrinoviricota | Flasuviricetes | Amarillovirales | Flaviviridae | Flavivirus | Dengue virus | +ssRNA |
| 8361 | u9846 | u1010 | Kitrinoviricota | Flasuviricetes | Amarillovirales | Flaviviridae | Flavivirus | Dengue virus | +ssRNA |
| 8362 | u9911 | u1010 | Kitrinoviricota | Flasuviricetes | Amarillovirales | Flaviviridae | Flavivirus | Dengue virus | +ssRNA |
| 8363 | u9935 | u1010 | Kitrinoviricota | Flasuviricetes | Amarillovirales | Flaviviridae | Flavivirus | Dengue virus | +ssRNA |
| 8364 | u9989 | u1010 | Kitrinoviricota | Flasuviricetes | Amarillovirales | Flaviviridae | Flavivirus | Dengue virus | +ssRNA |
1067 rows × 9 columns
[ ]:
# Target virus IDs to plot (here, Dengue virus)
target_virus = "Dengue virus"
target_ids = tax_df[tax_df["species"].str.contains(target_virus)]["rep_ID"].unique()
target_ids
array(['u1010'], dtype=object)
Create bar plot showing raw counts for DENV:
[ ]:
fig, ax = plt.subplots(figsize=(5, 5))
fontsize = 14
width = 0.75
counts = []
for sample in metadata_df["SRR"].values:
counts.append(adata.X[adata.obs["srr"].str.contains(sample), adata.var.index.isin(target_ids)].sum())
x = np.arange(len(metadata_df["SRR"].values))
ax.bar(x, counts, width=width, color="#003049", edgecolor="black")
# Adjust figure labels and axes
ax.set_title(f"Counts for Dengue virus observed in\nCENV-infected patient and cell line", fontsize=fontsize+2)
ax.set_yscale("symlog")
ax.set_ylabel(f"Total counts for {target_ids[0]} (Dengue virus))", fontsize=fontsize)
# ax.set_xlabel("Sample", fontsize=fontsize)
ax.set_xticks(x, metadata_df["label"].values, rotation=45, ha="right")
ax.tick_params(axis="both", labelsize=fontsize-2)
ax.grid(True, which="both", color="lightgray", ls="--", lw=1)
ax.set_axisbelow(True)
# Save figure
plt.savefig("denv_rdrp_count.png", dpi=300, bbox_inches="tight")
fig.show()
[ ]:
# NOTE: These counts are low since we only aligned a small subset of the data in this example notebook
counts
[6.0, 0.0, 219.0, 0.0]
Visualize other virus-like sequences observed in the dataset
When interpreting the presence of other virus- or RdRP-like sequences (each identified by its own virus ID), keep in mind that there will likely be many RdRP-like sequences introduced by contamination of laboratory reagents. A (non-comprehensive) list of virus IDs observed in blank sequencing data is available here and we will download it below for comparison to the virus IDs observed in our dataset. Another indication that a sequence may have originated from sample contamination is its uniform presence across all sequenced samples.
[ ]:
# Download list of virus IDs seen in blank sequencing reagents
!wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/refs/heads/main/viruses_in_blank_reagents/total_raw_count_per_virus_id_in_laboratory_reagents.csv
blank_df = pd.read_csv("total_raw_count_per_virus_id_in_laboratory_reagents.csv")
blank_df
--2025-01-10 20:22:00-- https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/refs/heads/main/viruses_in_blank_reagents/total_raw_count_per_virus_id_in_laboratory_reagents.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 54642 (53K) [text/plain]
Saving to: ‘total_raw_count_per_virus_id_in_laboratory_reagents.csv’
total_raw_count_per 100%[===================>] 53.36K --.-KB/s in 0.002s
2025-01-10 20:22:01 (33.5 MB/s) - ‘total_raw_count_per_virus_id_in_laboratory_reagents.csv’ saved [54642/54642]
| count | virus | |
|---|---|---|
| 0 | 5374210.0 | u172514 |
| 1 | 4301784.0 | u226460 |
| 2 | 752771.0 | u237705 |
| 3 | 456424.0 | u202260 |
| 4 | 359783.0 | u223701 |
| ... | ... | ... |
| 4573 | 1.0 | u169611 |
| 4574 | 1.0 | u41840 |
| 4575 | 1.0 | u169761 |
| 4576 | 1.0 | u169999 |
| 4577 | 1.0 | u158491 |
4578 rows × 2 columns
Plot the top 20 virus IDs observed in this dataset (raw counts are normalized to the total number of viral reads in each sample, and virus IDs also observed in blank sequencing reagents are marked in red):
[ ]:
# Plot the top 20 virus IDs (counts will be normalized to total counts in each SRR library)
def plot_top_virs_stacked_bar(adata, top_n=20):
# Convert sparse matrix to dense if needed
counts_matrix = adata.X
if not isinstance(counts_matrix, np.ndarray):
counts_matrix = counts_matrix.toarray()
# Create a DataFrame with counts and group by SRR library
df_counts = pd.DataFrame(counts_matrix, columns=adata.var_names, index=adata.obs["srr"])
grouped_counts = df_counts.groupby(level=0).sum()
# Normalize counts by total read depth per SRR library
grouped_counts_normalized = grouped_counts.div(grouped_counts.sum(axis=1), axis=0)
# Sum normalized counts across all SRR libraries for each virus
vir_totals = grouped_counts_normalized.sum(axis=0)
# Get the top N viruses across all SRR libraries
top_vir_indices = np.argsort(vir_totals)[-top_n:][::-1]
top_virs = grouped_counts_normalized.columns[top_vir_indices]
top_counts = grouped_counts_normalized[top_virs]
# Create a stacked bar plot
fig, ax = plt.subplots(figsize=(7, 5))
colors = ['#e34a33', '#fdbb84', '#2b8cbe', '#a6bddb'] * (len(top_counts) // 6 + 1)
fontsize = 14
bar_bottom = np.zeros(len(top_virs))
for i, (srr_id, row_counts) in enumerate(top_counts.iterrows()):
ax.bar(
top_virs,
row_counts,
bottom=bar_bottom,
color=colors[i],
label=metadata_df[metadata_df["SRR"] == srr_id]["label"].values[0].replace("\n", " ")
)
bar_bottom += row_counts
# Adjust figure labels and axes
ax.set_title(f"Top {top_n} Virus IDs", fontsize=fontsize+2)
ax.set_xlabel("Virus ID", fontsize=fontsize)
ax.set_ylabel("Fraction of Total Reads in Sample", fontsize=fontsize)
ax.set_xticklabels(top_virs, rotation=45, ha='right')
# Mark virus IDs that have been observed in blank sequencing reagents in red
for i, label in enumerate(top_virs):
if label in blank_df["virus"].values:
ax.get_xticklabels()[i].set_color('red')
ax.legend(title="Sample", loc='upper right', fontsize=fontsize-2, title_fontsize=fontsize)
ax.tick_params(axis='both', which='major', labelsize=fontsize-2)
ax.grid(axis='y', linestyle='--', color="lightgrey", alpha=1)
ax.set_axisbelow(True)
ax.margins(x=0.01)
# Save figure
fig.savefig("top_virus_IDs_per_sample.png", dpi=300, bbox_inches="tight")
fig.show()
return top_virs
top_virs = plot_top_virs_stacked_bar(adata, top_n=20)
Show the predicted taxonomies for each virus ID shown in the plot above (these are the species-like operational taxonomic units (OTUs) provided by PalmDB):
[ ]:
top_vir_tax = tax_df[tax_df["rep_ID"].isin(top_virs)][tax_df.columns[1:]].drop_duplicates()
top_vir_tax['rep_ID'] = pd.Categorical(top_vir_tax['rep_ID'], categories=top_virs, ordered=True)
top_vir_tax = top_vir_tax.sort_values('rep_ID')
top_vir_tax
| rep_ID | phylum | class | order | family | genus | species | strandedness | |
|---|---|---|---|---|---|---|---|---|
| 212430 | u101227 | Pisuviricota | Pisoniviricetes | Picornavirales | Picornaviridae | . | . | +ssRNA |
| 34276 | u27694 | Peploviricota | Herviviricetes | Herpesvirales | Herpesviridae | Varicellovirus | Bubaline alphaherpesvirus 1 | dsDNA |
| 263358 | u202260 | . | . | . | . | . | . | unknown |
| 35782 | u34159 | . | . | . | . | . | . | unknown |
| 66423 | u100002 | Lenarviricota | Allassoviricetes | Levivirales | . | . | . | +ssRNA |
| 256038 | u181379 | . | . | . | . | . | . | unknown |
| 39629 | u45195 | . | . | . | . | . | . | unknown |
| 238847 | u135858 | . | . | . | . | . | . | unknown |
| 38543 | u41991 | . | . | . | . | . | . | unknown |
| 219921 | u103829 | Negarnaviricota | Insthoviricetes | Articulavirales | Orthomyxoviridae | . | . | -ssRNA |
| 142323 | u100041 | Kitrinoviricota | . | . | . | . | . | +ssRNA |
| 140792 | u100031 | Kitrinoviricota | Alsuviricetes | . | . | . | . | +ssRNA |
| 95765 | u100012 | Lenarviricota | Allassoviricetes | . | . | . | . | +ssRNA |
| 226917 | u110641 | Duplornaviricota | Resentoviricetes | Reovirales | Reoviridae | . | . | dsRNA |
| 35684 | u33987 | . | . | . | . | . | . | unknown |
| 36272 | u35686 | . | . | . | . | . | . | unknown |
| 207936 | u100644 | Lenarviricota | Amabiliviricetes | Wolframvirales | . | . | . | +ssRNA |
| 7298 | u1010 | Kitrinoviricota | Flasuviricetes | Amarillovirales | Flaviviridae | Flavivirus | Dengue virus | +ssRNA |
| 144317 | u100048 | Lenarviricota | Amabiliviricetes | . | . | . | . | +ssRNA |
| 34466 | u30588 | . | . | . | . | . | . | unknown |
Extract reads that aligned to specific virus IDs for further analysis
To confirm the presence of a known virus for which a reference genome is available, you can align the data to the viral (nucleotide) reference genome using the standard kb workflow (simply omit the --aa flag in the kb ref and kb count commands and supply the viral reference genome to kb ref instead of the PalmDB protein reference).
If this fails or no reference genome is available, use kb extract to extract raw reads that aligned to a given virus ID for further analysis (such as BLAST alignment):
[ ]:
# Sequences that matched to [vir_ids_to_extract] will be extracted
# Here, we will extract reads that mapped to the virus ID u101227
# u101227 is the virus ID with the most counts across all samples so we might like to find out more about it
vir_ids_to_extract = ["u101227"]
# Provide a single fastq file to extract sequences from (here, we are using "Patient PBMC (post-infection control)")
fastq_file = "/content/SRR11321528_1_short.fastq"
[ ]:
%%time
# Extract sequences using kb extract
outfolder_extract = "extracted_seqs"
targets = " ".join(vir_ids_to_extract)
!kb extract \
--kallisto $kallisto \
--verbose \
-t $threads \
--aa \
-k $k \
-g palmdb_clustered_t2g.txt \
-i $reference_index \
-ts $targets \
-o $outfolder_extract \
$fastq_file
[2025-01-10 20:22:36,217] DEBUG [main] Printing verbose output
[2025-01-10 20:22:38,426] DEBUG [main] kallisto binary located at /content/kallisto/build/src/kallisto
[2025-01-10 20:22:38,427] DEBUG [main] bustools binary located at /usr/local/lib/python3.10/dist-packages/kb_python/bins/linux/bustools/bustools
[2025-01-10 20:22:38,427] DEBUG [main] Creating `extracted_seqs/tmp` directory
[2025-01-10 20:22:38,428] DEBUG [main] Namespace(list=False, command='extract', tmp=None, keep_tmp=False, verbose=True, fastq='/content/SRR11321528_1_short.fastq', i='palmdb_homo_sapiens_dlist_cdna_dna.idx', targets=['u101227'], target_type='gene', extract_all=False, extract_all_fast=False, extract_all_unmapped=False, mm=False, g='palmdb_clustered_t2g.txt', o='extracted_seqs', t=2, strand=None, aa=True, N=None, kallisto='kallisto/build/src/kallisto', bustools='/usr/local/lib/python3.10/dist-packages/kb_python/bins/linux/bustools/bustools', opt_off=False, k=31)
[2025-01-10 20:22:41,916] INFO [extract] Performing alignment using kallisto...
[2025-01-10 20:22:41,916] INFO [extract] Using index palmdb_homo_sapiens_dlist_cdna_dna.idx to generate BUS file to extracted_seqs/tmp from
[2025-01-10 20:22:41,916] INFO [extract] /content/SRR11321528_1_short.fastq
[2025-01-10 20:22:41,916] DEBUG [extract] kallisto bus -i palmdb_homo_sapiens_dlist_cdna_dna.idx -o extracted_seqs/tmp -x bulk -t 2 --num --aa /content/SRR11321528_1_short.fastq
[2025-01-10 20:22:42,018] DEBUG [extract]
[2025-01-10 20:22:52,284] DEBUG [extract] [index] k-mer length: 31
[2025-01-10 20:23:17,325] DEBUG [extract] [index] number of targets: 296,561
[2025-01-10 20:23:17,426] DEBUG [extract] [index] number of k-mers: 37,541,835
[2025-01-10 20:23:17,427] DEBUG [extract] [index] number of distinguishing flanking k-mers: 130
[2025-01-10 20:23:17,427] DEBUG [extract] [quant] running in single-end mode
[2025-01-10 20:23:17,427] DEBUG [extract] [quant] will process file 1: /content/SRR11321528_1_short.fastq
[2025-01-10 20:29:38,975] DEBUG [extract] [quant] finding pseudoalignments for all files ...
[2025-01-10 20:35:12,343] DEBUG [extract] [progress] 1M reads processed (0.2% mapped)
[2025-01-10 20:41:23,853] DEBUG [extract] [progress] 2M reads processed (0.2% mapped)
[2025-01-10 20:47:13,944] DEBUG [extract] [progress] 3M reads processed (0.2% mapped)
[2025-01-10 20:53:03,094] DEBUG [extract] [progress] 4M reads processed (0.2% mapped)
[2025-01-10 20:59:01,997] DEBUG [extract] [progress] 5M reads processed (0.2% mapped)
[2025-01-10 21:05:06,825] DEBUG [extract] [progress] 6M reads processed (0.2% mapped)
[2025-01-10 21:11:00,966] DEBUG [extract] [progress] 7M reads processed (0.2% mapped)
[2025-01-10 21:16:30,838] DEBUG [extract] [progress] 8M reads processed (0.2% mapped)
[2025-01-10 21:22:43,933] DEBUG [extract] [progress] 9M reads processed (0.2% mapped)
[2025-01-10 21:28:20,131] DEBUG [extract] [progress] 10M reads processed (0.2% mapped)
[2025-01-10 21:34:19,356] DEBUG [extract] [progress] 11M reads processed (0.2% mapped)
[2025-01-10 21:40:08,500] DEBUG [extract] [progress] 12M reads processed (0.2% mapped)
[2025-01-10 21:46:17,800] DEBUG [extract] [progress] 13M reads processed (0.2% mapped)
[2025-01-10 21:51:49,182] DEBUG [extract] [progress] 14M reads processed (0.2% mapped)
[2025-01-10 21:58:01,668] DEBUG [extract] [progress] 15M reads processed (0.2% mapped)
[2025-01-10 22:03:50,793] DEBUG [extract] [progress] 16M reads processed (0.2% mapped)
[2025-01-10 22:10:08,338] DEBUG [extract] [progress] 17M reads processed (0.2% mapped)
[2025-01-10 22:15:51,016] DEBUG [extract] [progress] 18M reads processed (0.2% mapped)
[2025-01-10 22:16:16,355] DEBUG [extract] [progress] 19M reads processed (0.2% mapped) done
[2025-01-10 22:16:16,355] DEBUG [extract] [quant] processed 20,000,000 reads, 39,165 reads pseudoaligned
[2025-01-10 22:16:16,456] DEBUG [extract]
[2025-01-10 22:16:19,313] INFO [extract] Alignment complete. Beginning extraction of reads using bustools...
[2025-01-10 22:16:19,313] DEBUG [extract] Replacing transcript entries with -1 for equivalence classes that map to multiple genes from extracted_seqs/tmp/matrix.ec
[2025-01-10 22:16:19,313] DEBUG [extract] Fetching equivalence classes that map to multiple genes...
[2025-01-10 22:18:02,036] DEBUG [extract] Found the following equivalence classes that map to multiple genes: 2 ,12 ,21 ,32 ,33 ,46 ,47 ,50 ,74 ,79 ,82 ,88 ,94 ,105 ,120 ,122 ,127 ,130 ,133 ,134 ,145 ,148 ,155 ,159 ,163 ,175 ,176 ,178 ,188 ,192 ,196 ,205 ,208 ,219 ,221 ,228 ,234 ,235 ,236 ,246 ,250 ,251 ,257 ,281 ,284 ,290 ,292 ,300 ,301 ,306 ,317 ,318 ,330 ,345 ,355 ,363 ,366 ,370 ,378 ,381 ,391 ,393 ,398 ,408 ,415 ,424 ,461 ,464 ,485 ,488 ,491 ,500 ,501 ,508 ,514 ,515 ,516 ,523 ,538 ,555 ,559 ,566 ,600 ,602 ,618 ,627 ,640 ,658 ,663 ,665 ,670 ,672 ,678 ,689 ,709 ,717 ,718 ,730 ,734 ,741 ,744 ,753 ,755 ,773 ,777 ,778 ,783 ,785 ,798 ,799 ,806 ,808 ,811 ,817 ,839 ,855 ,866 ,879 ,887 ,901 ,915 ,919 ,937 ,945 ,948 ,955 ,959 ,963 ,966 ,969 ,978 ,981 ,982 ,984 ,991 ,992
[2025-01-10 22:18:02,508] DEBUG [extract] matrix.ec file where transcript entries were replaced with -1 for equivalence classes that map to multiple genes saved at extracted_seqs/tmp/matrix_no_mm.ec
[2025-01-10 22:18:03,075] INFO [extract] Extracting reads for following transcripts for gene ID u101227: u101227, u102725, u102856, u102947, u10335, u103456, u103675, u104127, u10482, u105052, u10615, u106215, u10770, u110499, u111077, u112094, u112930, u11566, u11569, u11836, u11900, u12042, u12244, u123282, u126493, u129004, u129529, u131170, u132107, u132728, u13403, u13457, u134583, u134791, u135143, u135699, u13571, u136670, u137293, u137756, u139241, u139747, u140312, u140369, u14054, u140956, u141387, u143726, u144916, u14643, u148403, u149735, u15117, u152755, u152787, u152990, u153247, u15348, u16191, u163680, u164306, u164471, u164858, u164862, u16616, u16650, u16673, u167681, u168111, u168557, u169370, u169890, u170485, u170643, u170684, u172845, u173262, u173267, u173273, u174860, u17506, u175136, u17570, u176023, u176354, u176685, u176766, u177572, u180974, u18133, u181372, u182074, u182409, u182729, u18497, u18522, u18526, u18552, u187751, u18816, u188403, u189847, u19125, u192470, u192506, u192594, u19269, u192756, u192778, u19413, u19414, u195952, u195974, u196140, u196146, u19676, u19683, u197076, u197735, u198145, u198863, u19904, u199489, u200181, u20080, u20401, u20905, u211976, u21252, u21348, u215888, u215931, u21623, u216267, u216535, u217051, u217464, u217631, u21838, u218639, u21883, u219099, u219536, u220867, u226460, u22963, u23001, u230077, u231334, u232374, u235197, u235199, u237195, u237646, u237705, u237861, u237874, u238129, u238297, u240678, u24119, u242498, u242575, u243228, u24451, u249846, u250515, u250842, u25121, u25122, u251542, u252286, u25326, u253263, u25488, u25633, u25834, u25941, u260723, u260863, u26195, u262007, u26234, u266280, u266533, u266575, u26809, u26837, u26912, u269394, u269502, u27033, u27043, u270944, u271418, u271476, u271967, u273827, u274926, u275199, u276589, u278285, u278547, u278745, u278897, u278900, u278917, u285756, u285785, u285886, u286675, u287636, u294801, u295612, u33024, u33728, u33972, u34195, u37215, u37275, u37365, u37907, u38885, u41220, u41917, u42877, u43050, u43075, u45290, u45700, u45990, u45993, u46809, u47157, u47625, u47650, u47921, u50707, u5313, u55146, u5733, u59672, u61411, u64246, u66038, u67495, u67526, u70657, u7131, u72282, u74997, u7838, u79493, u79948, u80423, u80961, u82533, u83982, u84674, u86925, u88655, u89511, u89629, u89858, u92526, u9318, u93829, u93893, u94411, u94453, u94828, u9563, u9592, u9594, u9731, u97337, u97563, u98014, u9917, u99271, u9953
[2025-01-10 22:18:03,075] INFO [extract] Capturing records from BUS file extracted_seqs/tmp/output.bus to extracted_seqs/tmp/output_extracted_u101227.bus with capture list extracted_seqs/tmp/pull_out_reads_transcript_ids_temp.txt
[2025-01-10 22:18:03,075] DEBUG [extract] bustools capture -o extracted_seqs/tmp/output_extracted_u101227.bus -c extracted_seqs/tmp/pull_out_reads_transcript_ids_temp.txt -e extracted_seqs/tmp/matrix_no_mm.ec -t extracted_seqs/tmp/transcripts.txt --transcripts extracted_seqs/tmp/output.bus
[2025-01-10 22:18:03,481] DEBUG [extract] Parsing transcripts .. done
[2025-01-10 22:18:03,481] DEBUG [extract] Parsing ECs .. done
[2025-01-10 22:18:03,481] DEBUG [extract] Parsing capture list .. done
[2025-01-10 22:18:04,682] DEBUG [extract] Read in 39165 BUS records, wrote 741 BUS records
[2025-01-10 22:18:05,797] DEBUG [extract] extracted_seqs/tmp/output_extracted_u101227.bus passed validation
[2025-01-10 22:18:05,797] INFO [extract] Sorting BUS file extracted_seqs/tmp/output_extracted_u101227.bus to extracted_seqs/tmp/output_extracted_u101227_sorted.bus
[2025-01-10 22:18:05,798] DEBUG [extract] bustools sort -o extracted_seqs/tmp/output_extracted_u101227_sorted.bus -T tmp -t 8 -m 2G --flags extracted_seqs/tmp/output_extracted_u101227.bus
[2025-01-10 22:18:05,899] DEBUG [extract] Warning: Number of threads cannot be greater than or equal to 2. Setting number of threads to 2
[2025-01-10 22:18:08,407] DEBUG [extract] all fits in buffer
[2025-01-10 22:18:09,809] DEBUG [extract] Read in 741 BUS records
[2025-01-10 22:18:09,809] DEBUG [extract] reading time 5.5e-05s
[2025-01-10 22:18:09,809] DEBUG [extract] sorting time 7.1e-05s
[2025-01-10 22:18:09,809] DEBUG [extract] writing time 0.000146s
[2025-01-10 22:18:09,810] INFO [extract] Extracting BUS file extracted_seqs/tmp/output_extracted_u101227_sorted.bus to extracted_seqs/u101227
[2025-01-10 22:18:09,810] DEBUG [extract] bustools extract -o extracted_seqs/u101227 -f /content/SRR11321528_1_short.fastq -N 1 extracted_seqs/tmp/output_extracted_u101227_sorted.bus
[2025-01-10 22:19:34,572] DEBUG [extract] Read in 741 BUS records
[2025-01-10 22:19:34,695] DEBUG [main] Removing `extracted_seqs/tmp` directory
CPU times: user 30.3 s, sys: 4.07 s, total: 34.4 s
Wall time: 1h 57min 29s
[ ]:
# Grab the reads extracted by kb extract for each virus ID
def get_all_sequences(fastq_file):
"""Extract all sequences from a gzipped FASTQ file."""
sequences = []
with gzip.open(fastq_file, "rt") as f:
for record in SeqIO.parse(f, "fastq"):
sequences.append(str(record.seq))
return sequences
extracted_seqs_dict = {}
for vir_id in vir_ids_to_extract:
# Grab the reads from the fastq file returned by kb extract
seqs = get_all_sequences(f"{outfolder_extract}/{vir_id}/1.fastq.gz")
# Save the extracted reads for each virus ID in a dictionary
extracted_seqs_dict[vir_id] = seqs
BLAST extracted reads
BLAST the first read extracted for virus ID u101227 using `gget blast <https://pachterlab.github.io/gget/en/blast.html>`__ to see which known reference sequences it aligns to:
[ ]:
# BLAST the first read extracted for u101227 using gget
seq_idx = 0
virus_id = "u101227"
raw_read = extracted_seqs_dict[virus_id][seq_idx]
blast_df = gget.blast(raw_read)
blast_df
INFO:gget.utils:Sequence recognized as nucleotide sequence.
INFO:gget.utils:BLAST will use program 'blastn' with database 'nt'.
INFO:gget.utils:BLAST initiated with search ID S3781DRZ013. Estimated time to completion: 30 seconds.
INFO:gget.utils:Retrieving results...
/usr/local/lib/python3.10/dist-packages/gget/gget_blast.py:327: FutureWarning: Passing literal html to 'read_html' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.
results_df = pd.read_html(str(dsc_table))[0]
| Description | Scientific Name | Common Name | Taxid | Max Score | Total Score | Query Cover | E value | Per. Ident | Acc. Len | Accession | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Severe acute respiratory syndrome coronavirus ... | Severe acute respiratory syndrome coronavirus 2 | NaN | 2697049 | 84.2 | 84.2 | 40% | 4.000000e-12 | 91.67% | 29567 | OY708460.1 |
| 1 | Gossypium raimondii isolate Gr291 retrotranspo... | Gossypium raimondii | Peruvian cotton | 29730 | 71.3 | 71.3 | 45% | 3.000000e-08 | 85.29% | 1708 | KT895861.1 |
| 2 | Severe acute respiratory syndrome coronavirus ... | Severe acute respiratory syndrome coronavirus 2 | NaN | 2697049 | 58.4 | 58.4 | 23% | 3.000000e-04 | 97.06% | 29816 | OY714418.1 |
| 3 | Synthetic construct LB106 DNA, synthetic long-... | synthetic construct | NaN | 32630 | 54.7 | 54.7 | 19% | 3.000000e-03 | 100.00% | 572 | LC597715.1 |
| 4 | Gastrodia elata chromosome GE13 mitochondrion,... | Gastrodia elata | NaN | 91201 | 52.8 | 52.8 | 19% | 1.200000e-02 | 100.00% | 34420 | OP441103.1 |
BLAST a few more of the extracted sequences to see if the BLAST results reproduce:
[ ]:
# Set the seed for reproducibility
np.random.seed(0)
# Randomly select indices for three of the extracted sequences
seq_idxs = np.random.randint(0, 200, size=3)
seq_idxs
array([172, 47, 117])
[ ]:
gget.blast(extracted_seqs_dict[virus_id][seq_idxs[0]], limit=5)
INFO:gget.utils:Sequence recognized as nucleotide sequence.
INFO:gget.utils:BLAST will use program 'blastn' with database 'nt'.
INFO:gget.utils:BLAST initiated with search ID S3E24YMB013. Estimated time to completion: 30 seconds.
INFO:gget.utils:Retrieving results...
/usr/local/lib/python3.10/dist-packages/gget/gget_blast.py:327: FutureWarning: Passing literal html to 'read_html' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.
results_df = pd.read_html(str(dsc_table))[0]
| Description | Scientific Name | Common Name | Taxid | Max Score | Total Score | Query Cover | E value | Per. Ident | Acc. Len | Accession | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Severe acute respiratory syndrome coronavirus ... | Severe acute respiratory syndrome coronavirus 2 | NaN | 2697049 | 58.4 | 58.4 | 23% | 0.0003 | 97.06% | 29829 | OY695180.1 |
| 1 | Gastrodia elata chromosome GE13 mitochondrion,... | Gastrodia elata | NaN | 91201 | 52.8 | 52.8 | 19% | 0.0120 | 100.00% | 34420 | OP441103.1 |
[ ]:
gget.blast(extracted_seqs_dict[virus_id][seq_idxs[1]], limit=5)
INFO:gget.utils:Sequence recognized as nucleotide sequence.
INFO:gget.utils:BLAST will use program 'blastn' with database 'nt'.
INFO:gget.utils:BLAST initiated with search ID S3E34MM6013. Estimated time to completion: 30 seconds.
INFO:gget.utils:Retrieving results...
/usr/local/lib/python3.10/dist-packages/gget/gget_blast.py:327: FutureWarning: Passing literal html to 'read_html' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.
results_df = pd.read_html(str(dsc_table))[0]
| Description | Scientific Name | Common Name | Taxid | Max Score | Total Score | Query Cover | E value | Per. Ident | Acc. Len | Accession | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Homo sapiens ribosomal protein L39 (RPL39), mRNA | Homo sapiens | human | 9606 | 141 | 141 | 50% | 3.000000e-29 | 100.00% | 390 | NM_001000.4 |
| 1 | PREDICTED: Pan troglodytes ribosomal protein L... | Pan troglodytes | chimpanzee | 9598 | 141 | 141 | 50% | 3.000000e-29 | 100.00% | 438 | XM_063804925.1 |
| 2 | PREDICTED: Pan paniscus ribosomal protein L39 ... | Pan paniscus | pygmy chimpanzee | 9597 | 141 | 141 | 50% | 3.000000e-29 | 100.00% | 500 | XM_014343539.5 |
| 3 | Homo sapiens ribosomal protein L39, mRNA (cDNA... | Homo sapiens | human | 9606 | 141 | 141 | 50% | 3.000000e-29 | 100.00% | 419 | BC070205.1 |
| 4 | PREDICTED: Gorilla gorilla gorilla ribosomal p... | Gorilla gorilla gorilla | western lowland gorilla | 9595 | 141 | 141 | 50% | 3.000000e-29 | 100.00% | 484 | XM_019019406.3 |
[ ]:
gget.blast(extracted_seqs_dict[virus_id][seq_idxs[2]], limit=5)
INFO:gget.utils:Sequence recognized as nucleotide sequence.
INFO:gget.utils:BLAST will use program 'blastn' with database 'nt'.
INFO:gget.utils:BLAST initiated with search ID S3E44SY2016. Estimated time to completion: 30 seconds.
INFO:gget.utils:Retrieving results...
/usr/local/lib/python3.10/dist-packages/gget/gget_blast.py:327: FutureWarning: Passing literal html to 'read_html' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.
results_df = pd.read_html(str(dsc_table))[0]
| Description | Scientific Name | Common Name | Taxid | Max Score | Total Score | Query Cover | E value | Per. Ident | Acc. Len | Accession | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Severe acute respiratory syndrome coronavirus ... | Severe acute respiratory syndrome coronavirus 2 | NaN | 2697049 | 78.7 | 78.7 | 40% | 2.000000e-10 | 90.00% | 29567 | OY708460.1 |
| 1 | Gossypium raimondii isolate Gr291 retrotranspo... | Gossypium raimondii | Peruvian cotton | 29730 | 71.3 | 71.3 | 45% | 3.000000e-08 | 85.29% | 1708 | KT895861.1 |
| 2 | Severe acute respiratory syndrome coronavirus ... | Severe acute respiratory syndrome coronavirus 2 | NaN | 2697049 | 63.9 | 63.9 | 25% | 6.000000e-06 | 97.30% | 29816 | OY714418.1 |
| 3 | Synthetic construct LB106 DNA, synthetic long-... | synthetic construct | NaN | 32630 | 54.7 | 54.7 | 19% | 3.000000e-03 | 100.00% | 572 | LC597715.1 |
| 4 | Gastrodia elata chromosome GE13 mitochondrion,... | Gastrodia elata | NaN | 91201 | 52.8 | 52.8 | 19% | 1.200000e-02 | 100.00% | 34420 | OP441103.1 |
Align data to virus reference genome
Some of the reads that aligned to u101227 BLASTed to SARS-CoV2 sequences. Given that u101227 was observed uniformly in all samples, which included two samples from patient PBMCs and two from a cell line, it is possible that these viral sequences were introduced by sample contamination. It is also possible that the BLAST hits reflect adapter sequence or contamination within the SARS-CoV-2 entries in the BLAST database itself, rather than true biological signal in the samples. The predicted taxonomy for u101227 in PalmDB is Picornaviridae, which includes disease-causing viruses like Enteroviruses. If this virus does not have a reference genome available, the BLAST results will not be very reliable.
We can align the dataset to one of the SARS-CoV2 reference genomes by rerunnning the kb ref and kb count commands from above, but providing the virus reference genome and omitting the --aa flag since this is now a nucleotide reference (instead of a protein reference):
[ ]:
# Download SARS-CoV2 reference genome from NCBI
# Finding the correct reference genome for the virus present in this dataset might take several tries and I would recommend alingning to all available reference genomes for a given virus
# We will perform an alignment to a single SARS-CoV2 reference genome here as an exampe
!wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/858/895/GCF_009858895.2_ASM985889v3/GCF_009858895.2_ASM985889v3_genomic.fna.gz
!wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/858/895/GCF_009858895.2_ASM985889v3/GCF_009858895.2_ASM985889v3_genomic.gtf.gz
--2025-01-10 22:21:53-- https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/858/895/GCF_009858895.2_ASM985889v3/GCF_009858895.2_ASM985889v3_genomic.fna.gz
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.31, 130.14.250.10, 130.14.250.11, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.31|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9591 (9.4K) [application/x-gzip]
Saving to: ‘GCF_009858895.2_ASM985889v3_genomic.fna.gz’
GCF_009858895.2_ASM 100%[===================>] 9.37K --.-KB/s in 0s
2025-01-10 22:21:53 (109 MB/s) - ‘GCF_009858895.2_ASM985889v3_genomic.fna.gz’ saved [9591/9591]
--2025-01-10 22:21:53-- https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/858/895/GCF_009858895.2_ASM985889v3/GCF_009858895.2_ASM985889v3_genomic.gtf.gz
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.12, 130.14.250.13, 130.14.250.7, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.12|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1218 (1.2K) [application/x-gzip]
Saving to: ‘GCF_009858895.2_ASM985889v3_genomic.gtf.gz’
GCF_009858895.2_ASM 100%[===================>] 1.19K --.-KB/s in 0s
2025-01-10 22:21:53 (75.3 MB/s) - ‘GCF_009858895.2_ASM985889v3_genomic.gtf.gz’ saved [1218/1218]
[ ]:
# We will use a slightly lower k for this alignment since the viral genome is smaller and more prone to mutations
# Set k=31 here for more specificity (at a trade-off with sensitivity)
k = 21
[ ]:
# Generate a SARS-CoV2 reference index
sars_cov2_genome = "GCF_009858895.2_ASM985889v3_genomic.fna.gz"
sars_cov2_gtf = "GCF_009858895.2_ASM985889v3_genomic.gtf.gz"
sars_reference_index = "sars_cov2_ref.idx"
sars_t2g = "sars_cov2_t2g.txt"
sars_fasta = "sars_cov2_transcripts.fa"
!kb ref \
--kallisto $kallisto \
--verbose \
-t $threads \
-k $k \
--d-list $host_genome_transciptome \
-i $sars_reference_index \
-g $sars_t2g \
-f1 $sars_fasta \
$sars_cov2_genome $sars_cov2_gtf
# Align data to the SARS-CoV2 reference index generated above
sars_out_folder = "kb_output_sars_cov2"
!kb count \
--kallisto $kallisto \
--verbose \
-t $threads \
-k $k \
-x 10xv2 \
--parity single \
-i $sars_reference_index \
-g $sars_t2g \
--h5ad \
-o $sars_out_folder \
--batch-barcodes \
batch.txt
[2025-01-10 22:22:06,079] DEBUG [main] Printing verbose output
[2025-01-10 22:22:08,288] DEBUG [main] kallisto binary located at /content/kallisto/build/src/kallisto
[2025-01-10 22:22:08,289] DEBUG [main] bustools binary located at /usr/local/lib/python3.10/dist-packages/kb_python/bins/linux/bustools/bustools
[2025-01-10 22:22:08,289] DEBUG [main] Creating `tmp` directory
[2025-01-10 22:22:08,289] DEBUG [main] Namespace(list=False, command='ref', tmp=None, keep_tmp=False, verbose=True, i='sars_cov2_ref.idx', g='sars_cov2_t2g.txt', f1='sars_cov2_transcripts.fa', include_attribute=None, exclude_attribute=None, f2=None, c1=None, c2=None, d=None, k=21, t=2, d_list='homo_sapiens.cdna_dna.fa.gz', d_list_overhang=1, aa=False, workflow='standard', distinguish=False, make_unique=False, overwrite=False, kallisto='kallisto/build/src/kallisto', bustools='/usr/local/lib/python3.10/dist-packages/kb_python/bins/linux/bustools/bustools', opt_off=False, fasta='GCF_009858895.2_ASM985889v3_genomic.fna.gz', gtf='GCF_009858895.2_ASM985889v3_genomic.gtf.gz', feature=None, no_mismatches=False, ec_max_size=None, flank=None)
[2025-01-10 22:22:08,289] INFO [ref] Preparing GCF_009858895.2_ASM985889v3_genomic.fna.gz, GCF_009858895.2_ASM985889v3_genomic.gtf.gz
[2025-01-10 22:22:08,292] WARNING [main] Gene `GU280_gp01` has no transcripts. The entire gene will be marked as a transcript and an exon with ID `GU280_gp01`.
[2025-01-10 22:22:08,292] WARNING [main] Gene `GU280_gp02` has no transcripts. The entire gene will be marked as a transcript and an exon with ID `GU280_gp02`.
[2025-01-10 22:22:08,292] WARNING [main] Gene `GU280_gp03` has no transcripts. The entire gene will be marked as a transcript and an exon with ID `GU280_gp03`.
[2025-01-10 22:22:08,292] WARNING [main] Gene `GU280_gp04` has no transcripts. The entire gene will be marked as a transcript and an exon with ID `GU280_gp04`.
[2025-01-10 22:22:08,293] WARNING [main] Gene `GU280_gp05` has no transcripts. The entire gene will be marked as a transcript and an exon with ID `GU280_gp05`.
[2025-01-10 22:22:08,293] WARNING [main] Gene `GU280_gp06` has no transcripts. The entire gene will be marked as a transcript and an exon with ID `GU280_gp06`.
[2025-01-10 22:22:08,293] WARNING [main] Gene `GU280_gp07` has no transcripts. The entire gene will be marked as a transcript and an exon with ID `GU280_gp07`.
[2025-01-10 22:22:08,293] WARNING [main] Gene `GU280_gp08` has no transcripts. The entire gene will be marked as a transcript and an exon with ID `GU280_gp08`.
[2025-01-10 22:22:08,293] WARNING [main] Gene `GU280_gp09` has no transcripts. The entire gene will be marked as a transcript and an exon with ID `GU280_gp09`.
[2025-01-10 22:22:08,293] WARNING [main] Gene `GU280_gp10` has no transcripts. The entire gene will be marked as a transcript and an exon with ID `GU280_gp10`.
[2025-01-10 22:22:08,293] WARNING [main] Gene `GU280_gp11` has no transcripts. The entire gene will be marked as a transcript and an exon with ID `GU280_gp11`.
[2025-01-10 22:22:08,293] INFO [ref] Splitting genome GCF_009858895.2_ASM985889v3_genomic.fna.gz into cDNA at /content/tmp/tmppr1s84kg
[2025-01-10 22:22:08,296] INFO [ref] Concatenating 1 cDNAs to sars_cov2_transcripts.fa
[2025-01-10 22:22:08,297] INFO [ref] Creating transcript-to-gene mapping at sars_cov2_t2g.txt
[2025-01-10 22:22:08,298] WARNING [ref] Using provided k-mer length 21 instead of optimal length 31
[2025-01-10 22:22:08,298] INFO [ref] Indexing sars_cov2_transcripts.fa to sars_cov2_ref.idx
[2025-01-10 22:22:08,298] DEBUG [ref] kallisto index -i sars_cov2_ref.idx -k 21 -t 2 -d homo_sapiens.cdna_dna.fa.gz sars_cov2_transcripts.fa
[2025-01-10 22:22:08,399] DEBUG [ref]
[2025-01-10 22:22:08,400] DEBUG [ref] [build] loading fasta file sars_cov2_transcripts.fa
[2025-01-10 22:22:08,400] DEBUG [ref] [build] k-mer length: 21
[2025-01-10 22:22:08,400] DEBUG [ref] KmerStream::KmerStream(): Start computing k-mer cardinality estimations (1/2)
[2025-01-10 22:22:08,400] DEBUG [ref] KmerStream::KmerStream(): Start computing k-mer cardinality estimations (1/2)
[2025-01-10 22:22:08,400] DEBUG [ref] KmerStream::KmerStream(): Finished
[2025-01-10 22:22:08,901] DEBUG [ref] CompactedDBG::build(): Estimated number of k-mers occurring at least once: 29244
[2025-01-10 22:22:08,902] DEBUG [ref] CompactedDBG::build(): Estimated number of minimizer occurring at least once: 7101
[2025-01-10 22:22:08,902] DEBUG [ref] CompactedDBG::filter(): Processed 29044 k-mers in 11 reads
[2025-01-10 22:22:08,902] DEBUG [ref] CompactedDBG::filter(): Found 29020 unique k-mers
[2025-01-10 22:22:08,902] DEBUG [ref] CompactedDBG::filter(): Number of blocks in Bloom filter is 201
[2025-01-10 22:22:08,902] DEBUG [ref] CompactedDBG::construct(): Extract approximate unitigs (1/2)
[2025-01-10 22:22:09,003] DEBUG [ref] CompactedDBG::construct(): Extract approximate unitigs (2/2)
[2025-01-10 22:22:09,004] DEBUG [ref] CompactedDBG::construct(): Closed all input files
[2025-01-10 22:22:09,004] DEBUG [ref]
[2025-01-10 22:22:09,004] DEBUG [ref] CompactedDBG::construct(): Splitting unitigs (1/2)
[2025-01-10 22:22:09,004] DEBUG [ref]
[2025-01-10 22:22:09,004] DEBUG [ref] CompactedDBG::construct(): Splitting unitigs (2/2)
[2025-01-10 22:22:09,004] DEBUG [ref] CompactedDBG::construct(): Before split: 21 unitigs
[2025-01-10 22:22:09,004] DEBUG [ref] CompactedDBG::construct(): After split (1/1): 21 unitigs
[2025-01-10 22:22:09,004] DEBUG [ref] CompactedDBG::construct(): Unitigs split: 0
[2025-01-10 22:22:09,004] DEBUG [ref] CompactedDBG::construct(): Unitigs deleted: 0
[2025-01-10 22:22:09,004] DEBUG [ref]
[2025-01-10 22:22:09,004] DEBUG [ref] CompactedDBG::construct(): Joining unitigs
[2025-01-10 22:22:09,004] DEBUG [ref] CompactedDBG::construct(): After join: 12 unitigs
[2025-01-10 22:22:09,004] DEBUG [ref] CompactedDBG::construct(): Joined 9 unitigs
[2025-01-10 22:22:09,004] DEBUG [ref] [build] extracting distinguishing flanking k-mers from "homo_sapiens.cdna_dna.fa.gz"
[2025-01-10 22:52:25,513] DEBUG [ref] [build] identified 192 distinguishing flanking k-mers
[2025-01-10 22:52:25,514] DEBUG [ref] [build] building MPHF
[2025-01-10 22:52:25,614] DEBUG [ref] [build] creating equivalence classes ...
[2025-01-10 22:52:25,614] DEBUG [ref] [build] target de Bruijn graph has k-mer length 21 and minimizer length 13
[2025-01-10 22:52:25,614] DEBUG [ref] [build] target de Bruijn graph has 385 contigs and contains 29232 k-mers
[2025-01-10 22:52:25,614] DEBUG [ref]
[2025-01-10 22:52:26,716] DEBUG [main] Removing `tmp` directory
[2025-01-10 22:52:53,155] DEBUG [main] Printing verbose output
[2025-01-10 22:52:55,364] DEBUG [main] kallisto binary located at /content/kallisto/build/src/kallisto
[2025-01-10 22:52:55,365] DEBUG [main] bustools binary located at /usr/local/lib/python3.10/dist-packages/kb_python/bins/linux/bustools/bustools
[2025-01-10 22:52:55,369] DEBUG [main] Creating `kb_output_sars_cov2/tmp` directory
[2025-01-10 22:52:55,369] DEBUG [main] Namespace(list=False, command='count', tmp=None, keep_tmp=False, verbose=True, i='sars_cov2_ref.idx', g='sars_cov2_t2g.txt', x='10xv2', o='kb_output_sars_cov2', num=False, w=None, r=None, t=2, m='2G', strand=None, inleaved=False, genomebam=False, aa=False, gtf=None, chromosomes=None, workflow='standard', em=False, mm=False, tcc=False, filter=None, filter_threshold=None, c1=None, c2=None, overwrite=False, dry_run=False, batch_barcodes=True, loom=False, h5ad=True, loom_names='barcode,target_name', sum='none', cellranger=False, gene_names=False, N=None, report=False, no_inspect=False, long=False, threshold=0.8, error_rate=None, platform='ONT', kallisto='kallisto/build/src/kallisto', bustools='/usr/local/lib/python3.10/dist-packages/kb_python/bins/linux/bustools/bustools', opt_off=False, k=21, no_validate=False, no_fragment=False, union=False, no_jump=False, quant_umis=False, keep_flags=False, parity='single', fragment_l=None, fragment_s=None, bootstraps=None, matrix_to_files=False, matrix_to_directories=False, fastqs=['batch.txt'])
[2025-01-10 22:53:00,312] INFO [count] Using index sars_cov2_ref.idx to generate BUS file to kb_output_sars_cov2 from
[2025-01-10 22:53:00,312] INFO [count] /content/kb_output_sars_cov2/tmp/tmp4p630fag
[2025-01-10 22:53:00,313] DEBUG [count] kallisto bus -i sars_cov2_ref.idx -o kb_output_sars_cov2 -x 10xv2 -t 2 --batch-barcodes --batch /content/kb_output_sars_cov2/tmp/tmp4p630fag
[2025-01-10 22:53:00,414] DEBUG [count]
[2025-01-10 22:53:00,414] DEBUG [count] [bus] will try running read files supplied in batch file
[2025-01-10 22:53:00,414] DEBUG [count] [bus] Note: Strand option was not specified; setting it to --fr-stranded for specified technology
[2025-01-10 22:53:00,414] DEBUG [count] [index] k-mer length: 21
[2025-01-10 22:53:00,414] DEBUG [count] [index] number of targets: 11
[2025-01-10 22:53:00,414] DEBUG [count] [index] number of k-mers: 29,232
[2025-01-10 22:53:00,414] DEBUG [count] [index] number of distinguishing flanking k-mers: 192
[2025-01-10 22:53:00,414] DEBUG [count] [quant] running in single-end mode
[2025-01-10 22:53:00,414] DEBUG [count] [quant] will process file 1: /content/SRR11321526_1_short.fastq
[2025-01-10 22:53:00,414] DEBUG [count] [quant] will process file 2: /content/SRR11321526_2_short.fastq
[2025-01-10 22:53:00,415] DEBUG [count] [quant] will process file 1: /content/SRR11321528_1_short.fastq
[2025-01-10 22:53:00,415] DEBUG [count] [quant] will process file 2: /content/SRR11321528_2_short.fastq
[2025-01-10 22:53:00,415] DEBUG [count] [quant] will process file 1: /content/SRR11321027_1_short.fastq
[2025-01-10 22:53:00,415] DEBUG [count] [quant] will process file 2: /content/SRR11321027_2_short.fastq
[2025-01-10 22:53:00,415] DEBUG [count] [quant] will process file 1: /content/SRR11321026_1_short.fastq
[2025-01-10 22:53:00,415] DEBUG [count] [quant] will process file 2: /content/SRR11321026_2_short.fastq
[2025-01-10 22:54:13,017] DEBUG [count] [quant] finding pseudoalignments for all files ...
[2025-01-10 22:55:26,842] DEBUG [count] [progress] 1M reads processed (0.0% mapped)
[2025-01-10 22:56:38,616] DEBUG [count] [progress] 2M reads processed (0.0% mapped)
[2025-01-10 22:57:51,633] DEBUG [count] [progress] 3M reads processed (0.0% mapped)
[2025-01-10 22:59:04,247] DEBUG [count] [progress] 4M reads processed (0.0% mapped)
[2025-01-10 23:00:18,133] DEBUG [count] [progress] 5M reads processed (0.0% mapped)
[2025-01-10 23:01:29,261] DEBUG [count] [progress] 6M reads processed (0.0% mapped)
[2025-01-10 23:02:44,006] DEBUG [count] [progress] 7M reads processed (0.0% mapped)
[2025-01-10 23:03:58,135] DEBUG [count] [progress] 8M reads processed (0.0% mapped)
[2025-01-10 23:05:12,056] DEBUG [count] [progress] 9M reads processed (0.0% mapped)
[2025-01-10 23:06:27,251] DEBUG [count] [progress] 10M reads processed (0.0% mapped)
[2025-01-10 23:07:41,214] DEBUG [count] [progress] 11M reads processed (0.0% mapped)
[2025-01-10 23:08:55,498] DEBUG [count] [progress] 12M reads processed (0.0% mapped)
[2025-01-10 23:10:09,482] DEBUG [count] [progress] 13M reads processed (0.0% mapped)
[2025-01-10 23:11:23,321] DEBUG [count] [progress] 14M reads processed (0.0% mapped)
[2025-01-10 23:12:38,384] DEBUG [count] [progress] 15M reads processed (0.0% mapped)
[2025-01-10 23:13:53,246] DEBUG [count] [progress] 16M reads processed (0.0% mapped)
[2025-01-10 23:15:06,130] DEBUG [count] [progress] 17M reads processed (0.0% mapped)
[2025-01-10 23:16:20,647] DEBUG [count] [progress] 18M reads processed (0.0% mapped)
[2025-01-10 23:17:34,493] DEBUG [count] [progress] 19M reads processed (0.0% mapped)
[2025-01-10 23:18:41,643] DEBUG [count] [progress] 20M reads processed (0.0% mapped)
[2025-01-10 23:19:57,201] DEBUG [count] [progress] 21M reads processed (0.0% mapped)
[2025-01-10 23:21:12,646] DEBUG [count] [progress] 22M reads processed (0.0% mapped)
[2025-01-10 23:22:26,495] DEBUG [count] [progress] 23M reads processed (0.0% mapped)
[2025-01-10 23:23:41,101] DEBUG [count] [progress] 24M reads processed (0.0% mapped)
[2025-01-10 23:25:00,423] DEBUG [count] [progress] 25M reads processed (0.0% mapped)
[2025-01-10 23:26:15,749] DEBUG [count] [progress] 26M reads processed (0.0% mapped)
[2025-01-10 23:27:30,788] DEBUG [count] [progress] 27M reads processed (0.0% mapped)
[2025-01-10 23:28:44,837] DEBUG [count] [progress] 28M reads processed (0.0% mapped)
[2025-01-10 23:30:00,421] DEBUG [count] [progress] 29M reads processed (0.0% mapped)
[2025-01-10 23:31:15,614] DEBUG [count] [progress] 30M reads processed (0.0% mapped)
[2025-01-10 23:32:29,335] DEBUG [count] [progress] 31M reads processed (0.0% mapped)
[2025-01-10 23:33:42,670] DEBUG [count] [progress] 32M reads processed (0.0% mapped)
[2025-01-10 23:34:55,149] DEBUG [count] [progress] 33M reads processed (0.0% mapped)
[2025-01-10 23:36:08,691] DEBUG [count] [progress] 34M reads processed (0.0% mapped)
[2025-01-10 23:37:21,768] DEBUG [count] [progress] 35M reads processed (0.0% mapped)
[2025-01-10 23:38:34,436] DEBUG [count] [progress] 36M reads processed (0.0% mapped)
[2025-01-10 23:39:46,948] DEBUG [count] [progress] 37M reads processed (0.0% mapped)
[2025-01-10 23:40:59,180] DEBUG [count] [progress] 38M reads processed (0.0% mapped)
[2025-01-10 23:42:13,862] DEBUG [count] [progress] 39M reads processed (0.0% mapped)
[2025-01-10 23:43:28,554] DEBUG [count] [progress] 40M reads processed (0.0% mapped)
[2025-01-10 23:44:42,803] DEBUG [count] [progress] 41M reads processed (0.0% mapped)
[2025-01-10 23:46:00,436] DEBUG [count] [progress] 42M reads processed (0.0% mapped)
[2025-01-10 23:47:17,115] DEBUG [count] [progress] 43M reads processed (0.0% mapped)
[2025-01-10 23:48:33,170] DEBUG [count] [progress] 44M reads processed (0.0% mapped)
[2025-01-10 23:49:51,295] DEBUG [count] [progress] 45M reads processed (0.0% mapped)
[2025-01-10 23:51:09,533] DEBUG [count] [progress] 46M reads processed (0.0% mapped)
[2025-01-10 23:52:27,425] DEBUG [count] [progress] 47M reads processed (0.0% mapped)
[2025-01-10 23:53:40,746] DEBUG [count] [progress] 48M reads processed (0.0% mapped)
[2025-01-10 23:53:40,846] DEBUG [count] [progress] 50M reads processed (0.0% mapped) done
[2025-01-10 23:53:40,847] DEBUG [count] [quant] processed 50,000,000 reads, 14 reads pseudoaligned
[2025-01-10 23:53:41,948] DEBUG [count]
[2025-01-10 23:53:41,964] INFO [count] Sorting BUS file kb_output_sars_cov2/output.bus to kb_output_sars_cov2/tmp/output.s.bus
[2025-01-10 23:53:41,964] DEBUG [count] bustools sort -o kb_output_sars_cov2/tmp/output.s.bus -T kb_output_sars_cov2/tmp -t 2 -m 2G kb_output_sars_cov2/output.bus
[2025-01-10 23:53:44,131] DEBUG [count] all fits in buffer
[2025-01-10 23:53:45,432] DEBUG [count] Read in 14 BUS records
[2025-01-10 23:53:45,432] DEBUG [count] reading time 5e-06s
[2025-01-10 23:53:45,432] DEBUG [count] sorting time 2e-06s
[2025-01-10 23:53:45,432] DEBUG [count] writing time 5.8e-05s
[2025-01-10 23:53:45,433] INFO [count] On-list not provided
[2025-01-10 23:53:45,433] INFO [count] Copying pre-packaged 10XV2 on-list to kb_output_sars_cov2
[2025-01-10 23:53:45,607] INFO [count] Inspecting BUS file kb_output_sars_cov2/tmp/output.s.bus
[2025-01-10 23:53:45,607] DEBUG [count] bustools inspect -o kb_output_sars_cov2/inspect.json -w kb_output_sars_cov2/10x_version2_whitelist.txt kb_output_sars_cov2/tmp/output.s.bus
[2025-01-10 23:53:47,311] INFO [count] Correcting BUS records in kb_output_sars_cov2/tmp/output.s.bus to kb_output_sars_cov2/tmp/output.s.c.bus with on-list kb_output_sars_cov2/10x_version2_whitelist.txt
[2025-01-10 23:53:47,312] DEBUG [count] bustools correct -o kb_output_sars_cov2/tmp/output.s.c.bus -w kb_output_sars_cov2/10x_version2_whitelist.txt kb_output_sars_cov2/tmp/output.s.bus
[2025-01-10 23:53:48,415] DEBUG [count] Found 737280 barcodes in the on-list
[2025-01-10 23:53:48,816] DEBUG [count] Processed 14 BUS records
[2025-01-10 23:53:48,816] DEBUG [count] In on-list = 13
[2025-01-10 23:53:48,816] DEBUG [count] Corrected = 0
[2025-01-10 23:53:48,816] DEBUG [count] Uncorrected = 1
[2025-01-10 23:53:49,918] INFO [count] Sorting BUS file kb_output_sars_cov2/tmp/output.s.c.bus to kb_output_sars_cov2/output.unfiltered.bus
[2025-01-10 23:53:49,918] DEBUG [count] bustools sort -o kb_output_sars_cov2/output.unfiltered.bus -T kb_output_sars_cov2/tmp -t 2 -m 2G kb_output_sars_cov2/tmp/output.s.c.bus
[2025-01-10 23:53:52,025] DEBUG [count] all fits in buffer
[2025-01-10 23:53:53,326] DEBUG [count] Read in 13 BUS records
[2025-01-10 23:53:53,326] DEBUG [count] reading time 5e-06s
[2025-01-10 23:53:53,326] DEBUG [count] sorting time 1e-06s
[2025-01-10 23:53:53,326] DEBUG [count] writing time 5.4e-05s
[2025-01-10 23:53:53,330] INFO [count] Generating count matrix kb_output_sars_cov2/counts_unfiltered/cells_x_genes from BUS file kb_output_sars_cov2/output.unfiltered.bus
[2025-01-10 23:53:53,330] DEBUG [count] bustools count -o kb_output_sars_cov2/counts_unfiltered/cells_x_genes -g sars_cov2_t2g.txt -e kb_output_sars_cov2/matrix.ec -t kb_output_sars_cov2/transcripts.txt --genecounts --umi-gene kb_output_sars_cov2/output.unfiltered.bus
[2025-01-10 23:53:54,995] DEBUG [count] kb_output_sars_cov2/counts_unfiltered/cells_x_genes.mtx passed validation
[2025-01-10 23:53:54,996] INFO [count] Writing gene names to file kb_output_sars_cov2/counts_unfiltered/cells_x_genes.genes.names.txt
[2025-01-10 23:53:54,996] WARNING [count] 11 gene IDs do not have corresponding valid gene names. These genes will use their gene IDs instead.
[2025-01-10 23:53:54,996] INFO [count] Reading matrix kb_output_sars_cov2/counts_unfiltered/cells_x_genes.mtx
[2025-01-10 23:53:55,302] INFO [count] Writing matrix to h5ad kb_output_sars_cov2/counts_unfiltered/adata.h5ad
[2025-01-10 23:53:55,622] DEBUG [main] Removing `kb_output_sars_cov2/tmp` directory
Though some reads aligned, only 14 out of 50,000,000 reads mapped to this SARS-CoV2 genome, so this was likely not the correct viral reference genome. You could also confirm the presence of virus using the assembly of contigs from the raw sequencing data (for example, using a tool like SPAdes) and the subsequent alignment of the contigs to all viral reference genomes.
For novel viruses, the parallel analysis of viral presence and host gene expression can provide further insights, as was done here (investigate viral tropism) and here (correlate viral presence with altered host gene expression at single-cell resolution).
For inspiration regarding further analysis, each figure from our manuscript can be reproduced using these notebooks.