lr-kallisto single-nuclei data analysis tutorial
Processing ONT single-nuclei data, starting with seqspec and using splitcode for extraction
Open in Google Colab
[ ]:
!git clone https://github.com/pachterlab/splitcode
!cd splitcode; mkdir build; cd build; cmake ..; make; make install
Cloning into 'splitcode'...
remote: Enumerating objects: 2833, done.
remote: Counting objects: 100% (230/230), done.
remote: Compressing objects: 100% (117/117), done.
remote: Total 2833 (delta 132), reused 197 (delta 111), pack-reused 2603 (from 1)
Receiving objects: 100% (2833/2833), 15.76 MiB | 20.61 MiB/s, done.
Resolving deltas: 100% (1557/1557), done.
CMake Deprecation Warning at CMakeLists.txt:1 (cmake_minimum_required):
Compatibility with CMake < 3.5 will be removed from a future version of
CMake.
Update the VERSION argument <min> value or use a ...<max> suffix to tell
CMake that the project does not need compatibility with older versions.
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
shared build
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- Found ZLIB: /usr/lib/x86_64-linux-gnu/libz.so (found version "1.2.11")
using htslib
-- Configuring done (2.8s)
-- Generating done (0.0s)
-- Build files have been written to: /content/splitcode/build
[ 7%] Creating directories for 'htslib'
[ 15%] No download step for 'htslib'
[ 23%] No update step for 'htslib'
[ 30%] No patch step for 'htslib'
[ 38%] No configure step for 'htslib'
[ 46%] Performing build step for 'htslib'
bgzf.c: In function ‘bgzf_write’:
bgzf.c:733:5: warning: this ‘if’ clause does not guard... []8;;https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#index-Wmisleading-indentation-Wmisleading-indentation]8;;]
733 | if ( !fp->is_compressed )
| ^~
bgzf.c:736:9: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the ‘if’
736 | const uint8_t *input = (const uint8_t*)data;
| ^~~~~
faidx.c: In function ‘fai_load’:
faidx.c:265:5: warning: this ‘else’ clause does not guard... []8;;https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#index-Wmisleading-indentation-Wmisleading-indentation]8;;]
265 | else
| ^~~~
faidx.c:268:9: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the ‘else’
268 | if (fp == 0) {
| ^~
hts.c: In function ‘hts_idx_init’:
hts.c:522:63: warning: overflow in conversion from ‘uint32_t’ {aka ‘unsigned int’} to ‘int’ changes value from ‘idx->z.last_bin = 4294967295’ to ‘-1’ []8;;https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#index-Woverflow-Woverflow]8;;]
522 | idx->z.save_bin = idx->z.save_tid = idx->z.last_tid = idx->z.last_bin = 0xffffffffu;
| ^~~
In file included from /usr/include/string.h:535,
from sam.c:3:
In function ‘strncpy’,
inlined from ‘bam_hdr_write’ at sam.c:130:2:
/usr/include/x86_64-linux-gnu/bits/string_fortified.h:95:10: warning: ‘__builtin_strncpy’ output truncated before terminating nul copying 4 bytes from a string of the same length []8;;https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#index-Wstringop-truncation-Wstringop-truncation]8;;]
95 | return __builtin___strncpy_chk (__dest, __src, __len,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
96 | __glibc_objsize (__dest));
| ~~~~~~~~~~~~~~~~~~~~~~~~~
tbx.c: In function ‘tbx_seqnames’:
tbx.c:285:5: warning: this ‘for’ clause does not guard... []8;;https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#index-Wmisleading-indentation-Wmisleading-indentation]8;;]
285 | for (tid=0; tid<m; tid++)
| ^~~
tbx.c:287:9: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the ‘for’
287 | *n = m;
| ^
vcf.c: In function ‘vcf_format’:
vcf.c:1909:25: warning: this ‘if’ clause does not guard... []8;;https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#index-Wmisleading-indentation-Wmisleading-indentation]8;;]
1909 | if ( !first ) kputc(';', s); first = 0;
| ^~
vcf.c:1909:54: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the ‘if’
1909 | if ( !first ) kputc(';', s); first = 0;
| ^~~~~
vcf.c:1948:21: warning: this ‘if’ clause does not guard... []8;;https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#index-Wmisleading-indentation-Wmisleading-indentation]8;;]
1948 | if (!first) kputc(':', s); first = 0;
| ^~
vcf.c:1948:48: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the ‘if’
1948 | if (!first) kputc(':', s); first = 0;
| ^~~~~
In file included from htslib/vcf.h:12,
from htslib/vcfutils.h:8,
from vcfutils.c:1:
vcfutils.c: In function ‘bcf_remove_alleles’:
vcfutils.c:291:25: warning: suggest parentheses around assignment used as truth value []8;;https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#index-Wparentheses-Wparentheses]8;;]
291 | assert( n=nG_ori );
| ^
cram/cram_io.c: In function ‘cram_populate_ref’:
cram/cram_io.c:1529:34: warning: ‘.tmp_’ directive writing 5 bytes into a region of size between 1 and 4096 []8;;https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#index-Wformat-overflow=-Wformat-overflow=]8;;]
1529 | sprintf(path_tmp, "%s.tmp_%d", path, /*getpid(),*/ i);
| ^~~~~
cram/cram_io.c:1529:31: note: directive argument in the range [0, 2147483647]
1529 | sprintf(path_tmp, "%s.tmp_%d", path, /*getpid(),*/ i);
| ^~~~~~~~~~~
In file included from /usr/include/stdio.h:894,
from cram/cram_io.c:51:
/usr/include/x86_64-linux-gnu/bits/stdio2.h:38:10: note: ‘__builtin___sprintf_chk’ output between 7 and 4111 bytes into a destination of size 4096
38 | return __builtin___sprintf_chk (__s, __USE_FORTIFY_LEVEL - 1,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
39 | __glibc_objsize (__s), __fmt,
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
40 | __va_arg_pack ());
| ~~~~~~~~~~~~~~~~~
In file included from /usr/include/string.h:535,
from cram/cram_io.c:55:
In function ‘strncpy’,
inlined from ‘full_path’ at cram/cram_io.c:2917:6,
inlined from ‘cram_write_SAM_hdr’ at cram/cram_io.c:2987:3:
/usr/include/x86_64-linux-gnu/bits/string_fortified.h:95:10: warning: ‘__builtin_strncpy’ specified bound 4096 equals destination size []8;;https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#index-Wstringop-truncation-Wstringop-truncation]8;;]
95 | return __builtin___strncpy_chk (__dest, __src, __len,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
96 | __glibc_objsize (__dest));
| ~~~~~~~~~~~~~~~~~~~~~~~~~
In function ‘strncpy’,
inlined from ‘full_path’ at cram/cram_io.c:2909:2,
inlined from ‘cram_write_SAM_hdr’ at cram/cram_io.c:2987:3:
/usr/include/x86_64-linux-gnu/bits/string_fortified.h:95:10: warning: ‘__builtin_strncpy’ specified bound 4096 equals destination size []8;;https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#index-Wstringop-truncation-Wstringop-truncation]8;;]
95 | return __builtin___strncpy_chk (__dest, __src, __len,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
96 | __glibc_objsize (__dest));
| ~~~~~~~~~~~~~~~~~~~~~~~~~
In function ‘strncpy’,
inlined from ‘cram_dopen’ at cram/cram_io.c:3291:2:
/usr/include/x86_64-linux-gnu/bits/string_fortified.h:95:10: warning: ‘__builtin_strncpy’ specified bound 20 equals destination size []8;;https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#index-Wstringop-truncation-Wstringop-truncation]8;;]
95 | return __builtin___strncpy_chk (__dest, __src, __len,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
96 | __glibc_objsize (__dest));
| ~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from cram/sam_header.c:36:
cram/sam_header.c: In function ‘sam_hdr_add_lines’:
cram/sam_header.c:328:20: warning: suggest parentheses around assignment used as truth value []8;;https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#index-Wparentheses-Wparentheses]8;;]
328 | assert(p->next = t);
| ^
cram/sam_header.c: In function ‘sam_hdr_vadd’:
cram/sam_header.c:449:16: warning: suggest parentheses around assignment used as truth value []8;;https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#index-Wparentheses-Wparentheses]8;;]
449 | assert(p->next = t);
| ^
In file included from /usr/include/string.h:535,
from cram/string_alloc.c:39:
In function ‘strncpy’,
inlined from ‘string_ndup’ at cram/string_alloc.c:149:5,
inlined from ‘string_dup’ at cram/string_alloc.c:141:12:
/usr/include/x86_64-linux-gnu/bits/string_fortified.h:95:10: warning: ‘__builtin_strncpy’ output truncated before terminating nul copying as many bytes from a string as its length []8;;https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#index-Wstringop-truncation-Wstringop-truncation]8;;]
95 | return __builtin___strncpy_chk (__dest, __src, __len,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
96 | __glibc_objsize (__dest));
| ~~~~~~~~~~~~~~~~~~~~~~~~~
cram/string_alloc.c: In function ‘string_dup’:
cram/string_alloc.c:141:12: note: length computed here
141 | return string_ndup(a_str, instr, strlen(instr));
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cram/vlen.c: In function ‘vflen’:
cram/vlen.c:121:9: warning: variable ‘i’ set but not used []8;;https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#index-Wunused-but-set-variable-Wunused-but-set-variable]8;;]
121 | int i;
| ^
[ 53%] No install step for 'htslib'
[ 61%] Completed 'htslib'
[ 61%] Built target htslib
[ 69%] Building CXX object src/CMakeFiles/splitcode_core.dir/ProcessReads.cpp.o
[ 76%] Building CXX object src/CMakeFiles/splitcode_core.dir/main.cpp.o
[ 84%] Linking CXX static library libsplitcode_core.a
[ 84%] Built target splitcode_core
[ 92%] Building CXX object src/CMakeFiles/splitcode.dir/main.cpp.o
[100%] Linking CXX executable splitcode
[100%] Built target splitcode
[ 61%] Built target htslib
[ 84%] Built target splitcode_core
[100%] Built target splitcode
Install the project...
-- Install configuration: "Release"
-- Installing: /usr/local/bin/splitcode
[ ]:
#show splitcode version
!splitcode/build/src/splitcode --version
splitcode, version 0.30.0
[ ]:
#Get example data and seqspec with supporting on-list files and configs for barcode correction and merging that is SPLiT-Seq specific
!wget https://zenodo.org/records/13947934/files/lr-kallisto_example.tar.gz
--2024-10-25 18:21:17-- https://zenodo.org/records/13947934/files/lr-kallisto_example.tar.gz
Resolving zenodo.org (zenodo.org)... 188.184.98.238, 188.185.79.172, 188.184.103.159, ...
Connecting to zenodo.org (zenodo.org)|188.184.98.238|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 183118684 (175M) [application/octet-stream]
Saving to: ‘lr-kallisto_example.tar.gz’
lr-kallisto_example 100%[===================>] 174.63M 31.1MB/s in 6.2s
2024-10-25 18:21:24 (28.0 MB/s) - ‘lr-kallisto_example.tar.gz’ saved [183118684/183118684]
[ ]:
#uncompress folder
!tar -xvf lr-kallisto_example.tar.gz
!cp lr-kallisto_example/r*txt .
lr-kallisto_example/
lr-kallisto_example/spec.yaml
lr-kallisto_example/13G-gc_5W.fastq.gz
lr-kallisto_example/config.mergeRT
lr-kallisto_example/r2_3.txt
lr-kallisto_example/config_RT_randO.txt
lr-kallisto_example/config-correct.txt
lr-kallisto_example/config_RT_polyT.txt
lr-kallisto_example/r1_R.txt
lr-kallisto_example/r1_T.txt
[ ]:
#Install seqspec
!pip install git+https://github.com/pachterlab/seqspec
!seqspec index -m rna -s file -t splitcode lr-kallisto_example/spec.yaml > ONT.config
Collecting git+https://github.com/pachterlab/seqspec
Cloning https://github.com/pachterlab/seqspec to /tmp/pip-req-build-74o45k5g
Running command git clone --filter=blob:none --quiet https://github.com/pachterlab/seqspec /tmp/pip-req-build-74o45k5g
Resolved https://github.com/pachterlab/seqspec to commit 1475ff00eac673af900d9be4fb36f23716058da4
Preparing metadata (setup.py) ... done
Requirement already satisfied: pyyaml>=6.0 in /usr/local/lib/python3.10/dist-packages (from seqspec==0.3.1) (6.0.2)
Requirement already satisfied: jsonschema in /usr/local/lib/python3.10/dist-packages (from seqspec==0.3.1) (4.23.0)
Collecting newick (from seqspec==0.3.1)
Downloading newick-1.9.0-py2.py3-none-any.whl.metadata (8.4 kB)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from seqspec==0.3.1) (2.32.3)
Collecting biopython (from seqspec==0.3.1)
Downloading biopython-1.84-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from seqspec==0.3.1) (24.1)
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from biopython->seqspec==0.3.1) (1.26.4)
Requirement already satisfied: attrs>=22.2.0 in /usr/local/lib/python3.10/dist-packages (from jsonschema->seqspec==0.3.1) (24.2.0)
Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /usr/local/lib/python3.10/dist-packages (from jsonschema->seqspec==0.3.1) (2024.10.1)
Requirement already satisfied: referencing>=0.28.4 in /usr/local/lib/python3.10/dist-packages (from jsonschema->seqspec==0.3.1) (0.35.1)
Requirement already satisfied: rpds-py>=0.7.1 in /usr/local/lib/python3.10/dist-packages (from jsonschema->seqspec==0.3.1) (0.20.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->seqspec==0.3.1) (3.4.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->seqspec==0.3.1) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->seqspec==0.3.1) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->seqspec==0.3.1) (2024.8.30)
Downloading biopython-1.84-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.2/3.2 MB 76.5 MB/s eta 0:00:00
Downloading newick-1.9.0-py2.py3-none-any.whl (15 kB)
Building wheels for collected packages: seqspec
Building wheel for seqspec (setup.py) ... done
Created wheel for seqspec: filename=seqspec-0.3.1-py3-none-any.whl size=47739 sha256=e34bb2aba1bd74bc68ea9195bd9f9e7166aaafc260a437d37eb16329a48a5bb1
Stored in directory: /tmp/pip-ephem-wheel-cache-wp2iswh_/wheels/a3/4d/76/1e56215ad56c936bb0c24c4be01c5e8f7aeb0af528d415423c
Successfully built seqspec
Installing collected packages: newick, biopython, seqspec
Successfully installed biopython-1.84 newick-1.9.0 seqspec-0.3.1
[ ]:
#Make config less error tolerant, allowing one error instead of 3 errors
!sed -i 's/3:3:3/1:1:1/g' ONT.config
[ ]:
#extract barcodes, cDNAs, and umis
!splitcode/build/src/splitcode -c ONT.config -t 2 lr-kallisto_example/13G-gc_5W.fastq.gz -o igvfb01_13G_scmodified/igvfb01_13G_5W.fastq.gz
* Forcing --gzip because all output file names end in .gz
* Using a list of 8 tags (vector size: 8; map size: 6,286; num elements in map: 6,344)
* will process sample 1: lr-kallisto_example/13G-gc_5W.fastq.gz
* processing the reads ...
done
* processed 268,542 reads
[ ]:
import os
samples = [''] #,'13G']
for sample in samples:
files_f = [sample+'f', sample+'c']
files_r = [sample+'r', sample+'rc']
for f in files_r+files_f:
for fastq in ["barcode", "cDNA", "umi"]:
os.system("gunzip -f "+f+"_"+fastq+".fastq.gz")
c = 0
writelines = ""
fastqc=open(sample+'c_barcode.fastq', 'r')
ofastqc=open(sample+'c_barcode.tmp.fastq', 'w+')
for l in fastqc.readlines():
c+=1
if l.startswith("@") or l.startswith("+"):
writelines+=l
elif l != "\n":
writelines+=l[0:8][::-1]+l[8:16][::-1]+l[16:24][::-1]+"\n"
else:
writelines+=l
if c == 4:
ofastqc.write(writelines)
c = 0
writelines = ""
fastqc.close()
ofastqc.close()
os.system("mv "+sample+'c_barcode.tmp.fastq '+sample+'c_barcode.fastq')
fastqc=open(sample+'r_barcode.fastq', 'r')
ofastqc=open(sample+'r_barcode.tmp.fastq', 'w+')
for l in fastqc.readlines():
c+=1
if l.startswith("@") or l.startswith("+"):
writelines+=l
elif l != "\n":
writelines+=l[::-1]+"\n"
else:
writelines+=l
if c == 4:
ofastqc.write(writelines)
c = 0
writelines = ""
fastqc.close()
ofastqc.close()
os.system("mv "+sample+'r_barcode.tmp.fastq '+sample+'r_barcode.fastq')
c = 0
writelines = ""
fastqc=open(sample+'rc_barcode.fastq', 'r')
ofastqc=open(sample+'rc_barcode.tmp.fastq', 'w+')
for l in fastqc.readlines():
c+=1
if l.startswith("@") or l.startswith("+"):
writelines+=l
elif l != "\n":
writelines+=l[16:24]+l[8:16]+l[0:8]+"\n"
else:
writelines+=l
if c == 4:
ofastqc.write(writelines)
c = 0
writelines = ""
fastqc.close()
ofastqc.close()
os.system("mv "+sample+'rc_barcode.tmp.fastq '+sample+'rc_barcode.fastq')
for fastq in ["barcode", "cDNA", "umi"]:
cmd = "cat "
for f in files_f+files_r:
cmd += f+'_'+fastq+'.fastq '
cmd += '> '+sample+'_'+fastq+'.fastq'
print(cmd)
os.system(cmd)
cat f_barcode.fastq c_barcode.fastq r_barcode.fastq rc_barcode.fastq > _barcode.fastq
cat f_cDNA.fastq c_cDNA.fastq r_cDNA.fastq rc_cDNA.fastq > _cDNA.fastq
cat f_umi.fastq c_umi.fastq r_umi.fastq rc_umi.fastq > _umi.fastq
[ ]:
#correct barcodes
!splitcode/build/src/splitcode -c lr-kallisto_example/config-correct.txt --nFastqs=2 --select=0 --gzip -o 13G_cDNA.fastq.gz _cDNA.fastq _barcode.fastq -t 2;
!splitcode/build/src/splitcode -c lr-kallisto_example/config-correct.txt --nFastqs=2 --select=0 --gzip -o 13G_umi.fastq.gz _umi.fastq _barcode.fastq -t 2;
#merge randO and polyT
!splitcode/build/src/splitcode -c lr-kallisto_example/config.mergeRT -o 13G_barcode.fastq 13G_barcode.fastq.gz -t 2;
* Using a list of 384 tags (vector size: 384; map size: 32,059; num elements in map: 42,792)
* will process sample 1: _cDNA.fastq
_barcode.fastq
* processing the reads ...
done
* processed 1,074,168 reads
* Using a list of 384 tags (vector size: 384; map size: 32,059; num elements in map: 42,792)
* will process sample 1: _umi.fastq
_barcode.fastq
* processing the reads ...
done
* processed 1,074,168 reads
* Using a list of 96 tags (vector size: 96; map size: 2,989; num elements in map: 3,168)
* will process sample 1: 13G_barcode.fastq.gz
* processing the reads ...
done
* processed 23,977 reads
[ ]:
#correct barcodes and extract only polyT
!splitcode/build/src/splitcode -c lr-kallisto_example/config_RT_polyT.txt --nFastqs=2 --select=0 --gzip -o 13G_cDNA_polyT.fastq.gz 13G_cDNA.fastq.gz 13G_barcode.fastq.gz -t 2;
!splitcode/build/src/splitcode -c lr-kallisto_example/config_RT_polyT.txt --nFastqs=2 --select=0 --gzip -o 13G_umi_polyT.fastq.gz 13G_umi.fastq.gz 13G_barcode.fastq.gz -t 2;
* Using a list of 288 tags (vector size: 288; map size: 22,690; num elements in map: 31,560)
* will process sample 1: 13G_cDNA.fastq.gz
13G_barcode.fastq.gz
* processing the reads ...
done
* processed 23,977 reads
* Using a list of 288 tags (vector size: 288; map size: 22,690; num elements in map: 31,560)
* will process sample 1: 13G_umi.fastq.gz
13G_barcode.fastq.gz
* processing the reads ...
done
* processed 23,977 reads
[ ]:
#correct barcodes and extract only randO
!splitcode/build/src/splitcode -c lr-kallisto_example/config_RT_randO.txt --nFastqs=2 --select=0 --gzip -o 13G_cDNA_randO.fastq.gz 13G_cDNA.fastq.gz 13G_barcode.fastq.gz -t 2;
!splitcode/build/src/splitcode -c lr-kallisto_example/config_RT_randO.txt --nFastqs=2 --select=0 --gzip -o 13G_umi_randO.fastq.gz 13G_umi.fastq.gz 13G_barcode.fastq.gz -t 2;
* Using a list of 288 tags (vector size: 288; map size: 22,913; num elements in map: 31,647)
* will process sample 1: 13G_cDNA.fastq.gz
13G_barcode.fastq.gz
* processing the reads ...
done
* processed 23,977 reads
* Using a list of 288 tags (vector size: 288; map size: 22,913; num elements in map: 31,647)
* will process sample 1: 13G_umi.fastq.gz
13G_barcode.fastq.gz
* processing the reads ...
done
* processed 23,977 reads
[ ]:
#install gget and kb
!pip install -q kb_python gget
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.8/4.8 MB 49.7 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
Preparing metadata (setup.py) ... done
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36.5/36.5 MB 23.1 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 43.1/43.1 MB 10.8 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 129.0/129.0 kB 7.2 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34.4/34.4 MB 24.2 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 50.9/50.9 MB 9.5 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 15.4 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 49.5/49.5 kB 2.6 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 37.0 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.9/56.9 kB 4.1 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 22.0/22.0 MB 33.0 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 85.7/85.7 kB 4.0 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 40.6/40.6 kB 1.6 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 83.6/83.6 kB 5.1 MB/s eta 0:00:00
Building wheel for loompy (setup.py) ... done
Building wheel for session-info (setup.py) ... done
[ ]:
#download reference for mouse with k-mer size 31 or 63
!kb ref -k 31 -d mouse -i mouse_k-31.idx -g mouse.t2g
!kb ref -k 63 -d mouse -i mouse_k-63.idx -g mouse.t2g
#alternatively build your own
'''
!kb ref -k 63\
-i mouse_k-63.idx \
-g t2g.txt \
-f1 fasta.fa \
$(gget ref --ftp -w dna,gtf mouse)
'''
[2024-10-25 18:35:34,749] INFO [download] Downloading files for mouse (standard workflow) from https://github.com/pachterlab/kallisto-transcriptome-indices/releases/download/v1/mouse_index_standard.tar.xz to tmp/mouse_index_standard.tar.xz
100% 82.8M/82.8M [00:01<00:00, 81.3MB/s]
[2024-10-25 18:35:35,827] INFO [download] Extracting files from tmp/mouse_index_standard.tar.xz
[2024-10-25 18:36:16,725] INFO [download] Downloading files for mouse (standard workflow) from https://github.com/pachterlab/kallisto-transcriptome-indices/releases/download/v1/mouse_index_standard_long.tar.xz to tmp/mouse_index_standard_long.tar.xz
100% 74.8M/74.8M [00:01<00:00, 74.2MB/s]
[2024-10-25 18:36:17,785] INFO [download] Extracting files from tmp/mouse_index_standard_long.tar.xz
'\n!kb ref -k 63 -i mouse_k-63.idx -g t2g.txt -f1 fasta.fa $(gget ref --ftp -w dna,gtf mouse)\n'
[ ]:
#Use kb count to pseudoalign and produce count matrices
!kb count -k 63 --long --threshold 0.8 -i mouse_k-63.idx -g mouse.t2g -o output -x '2,0,24:1,0,10:0,0,0' 13G_cDNA.fastq.gz 13G_umi.fastq.gz 13G_barcode.fastq
!kb count -k 63 --long --threshold 0.8 -i mouse_k-63.idx -g mouse.t2g -o output_randO -x '2,0,24:1,0,10:0,0,0' 13G_cDNA_randO.fastq.gz 13G_umi_randO.fastq.gz 13G_barcode_randO.fastq.gz
!kb count -k 63 --long --threshold 0.8 -i mouse_k-63.idx -g mouse.t2g -o output_polyT -x '2,0,24:1,0,10:0,0,0' 13G_cDNA_polyT.fastq.gz 13G_umi_polyT.fastq.gz 13G_barcode_polyT.fastq.gz
[2024-10-25 18:37:10,815] INFO [count] Using index mouse_k-63.idx to generate BUS file to output from
[2024-10-25 18:37:10,815] INFO [count] 13G_cDNA.fastq.gz
[2024-10-25 18:37:10,815] INFO [count] 13G_umi.fastq.gz
[2024-10-25 18:37:10,815] INFO [count] 13G_barcode.fastq
[2024-10-25 18:39:00,697] INFO [count] Sorting BUS file output/output.bus to output/tmp/output.s.bus
[2024-10-25 18:39:04,925] INFO [count] On-list not provided
[2024-10-25 18:39:04,926] INFO [count] Generating on-list output/whitelist.txt from BUS file output/tmp/output.s.bus
[2024-10-25 18:39:06,029] INFO [count] Inspecting BUS file output/tmp/output.s.bus
[2024-10-25 18:39:07,135] INFO [count] Correcting BUS records in output/tmp/output.s.bus to output/tmp/output.s.c.bus with on-list output/whitelist.txt
[2024-10-25 18:39:09,942] INFO [count] Sorting BUS file output/tmp/output.s.c.bus to output/output.unfiltered.bus
[2024-10-25 18:39:13,963] INFO [count] Generating count matrix output/counts_unfiltered/cells_x_genes from BUS file output/output.unfiltered.bus
[2024-10-25 18:39:15,772] INFO [count] Writing gene names to file output/counts_unfiltered/cells_x_genes.genes.names.txt
[2024-10-25 18:39:16,237] WARNING [count] 1599 gene IDs do not have corresponding valid gene names. These genes will use their gene IDs instead.
[2024-10-25 18:39:37,599] INFO [count] Using index mouse_k-63.idx to generate BUS file to output_randO from
[2024-10-25 18:39:37,600] INFO [count] 13G_cDNA_randO.fastq.gz
[2024-10-25 18:39:37,600] INFO [count] 13G_umi_randO.fastq.gz
[2024-10-25 18:39:37,600] INFO [count] 13G_barcode_randO.fastq.gz
[2024-10-25 18:40:53,664] INFO [count] Sorting BUS file output_randO/output.bus to output_randO/tmp/output.s.bus
[2024-10-25 18:40:58,089] INFO [count] On-list not provided
[2024-10-25 18:40:58,090] INFO [count] Generating on-list output_randO/whitelist.txt from BUS file output_randO/tmp/output.s.bus
[2024-10-25 18:40:59,194] INFO [count] Inspecting BUS file output_randO/tmp/output.s.bus
[2024-10-25 18:41:00,299] INFO [count] Correcting BUS records in output_randO/tmp/output.s.bus to output_randO/tmp/output.s.c.bus with on-list output_randO/whitelist.txt
[2024-10-25 18:41:03,615] INFO [count] Sorting BUS file output_randO/tmp/output.s.c.bus to output_randO/output.unfiltered.bus
[2024-10-25 18:41:07,143] INFO [count] Generating count matrix output_randO/counts_unfiltered/cells_x_genes from BUS file output_randO/output.unfiltered.bus
[2024-10-25 18:41:09,139] INFO [count] Writing gene names to file output_randO/counts_unfiltered/cells_x_genes.genes.names.txt
[2024-10-25 18:41:09,688] WARNING [count] 1599 gene IDs do not have corresponding valid gene names. These genes will use their gene IDs instead.
[2024-10-25 18:41:31,869] INFO [count] Using index mouse_k-63.idx to generate BUS file to output_polyT from
[2024-10-25 18:41:31,869] INFO [count] 13G_cDNA_polyT.fastq.gz
[2024-10-25 18:41:31,869] INFO [count] 13G_umi_polyT.fastq.gz
[2024-10-25 18:41:31,869] INFO [count] 13G_barcode_polyT.fastq.gz
[2024-10-25 18:42:50,132] INFO [count] Sorting BUS file output_polyT/output.bus to output_polyT/tmp/output.s.bus
[2024-10-25 18:42:53,947] INFO [count] On-list not provided
[2024-10-25 18:42:53,948] INFO [count] Generating on-list output_polyT/whitelist.txt from BUS file output_polyT/tmp/output.s.bus
[2024-10-25 18:42:55,052] INFO [count] Inspecting BUS file output_polyT/tmp/output.s.bus
[2024-10-25 18:42:56,158] INFO [count] Correcting BUS records in output_polyT/tmp/output.s.bus to output_polyT/tmp/output.s.c.bus with on-list output_polyT/whitelist.txt
[2024-10-25 18:42:58,965] INFO [count] Sorting BUS file output_polyT/tmp/output.s.c.bus to output_polyT/output.unfiltered.bus
[2024-10-25 18:43:02,289] INFO [count] Generating count matrix output_polyT/counts_unfiltered/cells_x_genes from BUS file output_polyT/output.unfiltered.bus
[2024-10-25 18:43:04,179] INFO [count] Writing gene names to file output_polyT/counts_unfiltered/cells_x_genes.genes.names.txt
[2024-10-25 18:43:04,583] WARNING [count] 1599 gene IDs do not have corresponding valid gene names. These genes will use their gene IDs instead.
[ ]:
# Import packages and read in count matrix from polyT
import scipy
import anndata
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scanpy as sc
from sklearn.decomposition import TruncatedSVD
from scipy import sparse, io
import seaborn as sns
matplotlib.rcParams.update({'font.size': 12})
%config InlineBackend.figure_format = 'retina'
# import data
dir = 'output_polyT/counts_unfiltered' #'b01_nanopore_13G_single_cell'#'b01_nanopore_13G__sc_single_cell'
adata = sc.read_mtx(dir+"/cells_x_genes.mtx")
adata.obs.index = pd.read_csv(dir+"/cells_x_genes.barcodes.txt", header=None)[0].values
adata.var.index = pd.read_csv(dir+"/cells_x_genes.genes.names.txt", header=None)[0].values
[ ]:
#Read in merged polyT and randO matrix
dir = 'output/counts_unfiltered' #'b01_nanopore_13G_single_cell'#'b01_nanopore_13G__sc_single_cell'
Iadata = sc.read_mtx(dir+"/cells_x_genes.mtx")
Iadata.obs.index = pd.read_csv(dir+"/cells_x_genes.barcodes.txt", header=None)[0].values
Iadata.var.index = pd.read_csv(dir+"/cells_x_genes.genes.names.txt", header=None)[0].values
[ ]:
adata.var
| Gm15178 |
|---|
| Gm29155 |
| C730036E19Rik |
| Gm29157 |
| Gm29156 |
| ... |
| ENSMUSG00000095019.2 |
| ENSMUSG00000094915.2 |
| ENSMUSG00000079808.4 |
| ENSMUSG00000095041.8 |
| ENSMUSG00000095742.2 |
34183 rows × 0 columns
[ ]:
Iadata.var
| Gm15178 |
|---|
| Gm29155 |
| C730036E19Rik |
| Gm29157 |
| Gm29156 |
| ... |
| ENSMUSG00000095019.2 |
| ENSMUSG00000094915.2 |
| ENSMUSG00000079808.4 |
| ENSMUSG00000095041.8 |
| ENSMUSG00000095742.2 |
34183 rows × 0 columns
[ ]:
Iadata.obs
| AAACATCGAACTCACCTCTCATGC |
|---|
| AAACATCGAAGAGATCTCATTGCA |
| AAACATCGACACAGAAATCCTTAC |
| AAACATCGACAGATTCACTGCTCT |
| AAACATCGACAGATTCCATACTTC |
| ... |
| TTCACGCAGAATCTGATCATTGCA |
| TTCACGCAGATAGACAGCCTATCT |
| TTCACGCAGCTAACGAACTGCTCT |
| TTCACGCAGTCTGTCATTATTCTG |
| TTCACGCATCTTCACACTTCTAAC |
3435 rows × 0 columns
[ ]:
adata.obs
| AAACATCGACACAGAAACTGCTCT |
|---|
| AAACATCGACAGATTCACTGCTCT |
| AAACATCGACAGATTCCATACTTC |
| AAACATCGAGAGTCAAACATTTAC |
| AAACATCGAGATGTACATTTGGCA |
| ... |
| TTCACGCACGCTGATCACTGCTCT |
| TTCACGCAGAACAGGCGCTATCAT |
| TTCACGCAGATAGACAGCCTATCT |
| TTCACGCAGTCTGTCATTATTCTG |
| TTCACGCATCTTCACACTTCTAAC |
1989 rows × 0 columns
[ ]:
sc.pp.calculate_qc_metrics(adata, inplace=True)
sc.pp.calculate_qc_metrics(Iadata, inplace=True)
[ ]:
sc.pp.normalize_per_cell(adata, counts_per_cell_after=1e4)
sc.pp.log1p(adata)
sc.pp.normalize_per_cell(Iadata, counts_per_cell_after=1e4)
sc.pp.log1p(Iadata)
[ ]:
#!pip install matplotlib-venn
import matplotlib.pyplot as plt
from matplotlib_venn import venn2
aadata = [t for t in adata.obs.index]
iadata = [t for t in Iadata.obs.index]
venn2((set(aadata), set(iadata)), ('polyT', 'merged'))
<matplotlib_venn._common.VennDiagram at 0x7d68b98a1ba0>
[ ]:
intersect_g = [gene for gene in adata.var.index if gene in Iadata.var.index]
intersectsc = [t for t in adata.obs.index if t in Iadata.obs.index]
[ ]:
adata[intersectsc, intersect_g]
spearman_gene_ONT_vs_Ill = []
for c in intersectsc:
if np.squeeze(np.asarray(adata[c, intersect_g].X.todense())).all() != 1:
spearman=scipy.stats.spearmanr(np.squeeze(np.asarray(adata[c, intersect_g].X.todense())), np.squeeze(np.asarray(Iadata[c, intersect_g].X.todense())))
spearman_gene_ONT_vs_Ill.append(spearman[0])
#plt.scatter(adata.obs['n_counts'], spearman_gene_ONT_vs_Ill)
plt.hexbin(adata[intersectsc, intersect_g].obs['n_counts'], spearman_gene_ONT_vs_Ill, gridsize=50, cmap='jet', bins='log')
plt.ylabel("Spearman genelevel sn expression ONT vs Ill")
plt.xlabel("Number of ONT UMI/nucleus")
plt.ylim((0,1))
plt.colorbar(label='counts')
plt.show()
[ ]:
%%time
!git clone https://github.com/pachterlab/kallisto.git
!ls kallisto
!mkdir kallisto/build
!ls kallisto
!ls kallisto/ext/htslib/
!cd kallisto/ext/htslib; autoheader; autoconf
!cd kallisto/build; cmake .. -DMAX_KMER_SIZE=64 -; make
Cloning into 'kallisto'...
remote: Enumerating objects: 8684, done.
remote: Counting objects: 100% (1766/1766), done.
remote: Compressing objects: 100% (506/506), done.
remote: Total 8684 (delta 1320), reused 1563 (delta 1258), pack-reused 6918 (from 1)
Receiving objects: 100% (8684/8684), 9.34 MiB | 14.05 MiB/s, done.
Resolving deltas: 100% (5746/5746), done.
astyle.txt ext gen_release.sh INSTALL.md README.md test
CMakeLists.txt func_tests gulpfile.js license.txt src unit_tests
astyle.txt CMakeLists.txt func_tests gulpfile.js license.txt src unit_tests
build ext gen_release.sh INSTALL.md README.md test
bcf_sr_sort.c hfile_gcs.c htslib.pc.in plugin.c tbx.c
bcf_sr_sort.h hfile_internal.h htslib_vars.mk probaln.c test
bgzf.c hfile_libcurl.c INSTALL README textutils.c
bgzip.c hfile_net.c kfunc.c README.md thread_pool.c
config.mk.in hfile_s3.c knetfile.c realn.c thread_pool_internal.h
configure.ac hts.c kstring.c regidx.c vcf.5
cram htsfile.1 LICENSE sam.5 vcf.c
errmod.c htsfile.c Makefile sam.c vcf_sweep.c
faidx.5 hts_internal.h md5.c synced_bcf_reader.c vcfutils.c
faidx.c htslib multipart.c tabix.1
hfile.c htslib.mk NEWS tabix.c
CMake Error: Unknown argument -
CMake Error: Run 'cmake --help' for all supported options.
make: *** No targets specified and no makefile found. Stop.
CPU times: user 126 ms, sys: 20.3 ms, total: 146 ms
Wall time: 5.21 s