Introduction to single-cell RNA-seq: Seurat (R)

Open in Google Colab

This notebook demonstrates pre-processing, QC, and basic analysis of the 1k Human PBMCs dataset from 10x genomics. The approximate running time of the notebook (excluding Seurat installation) is about 15 minutes.

The notebook was written by Joseph Rich and Lior Pachter, based off an older version written by Kyung Hoi (Joseph) Min, Lambda Moses, A. Sina Booeshaghi and Lior Pachter. If you use the methods in this notebook for your analysis please cite the following publications which describe the tools used in the notebook, as well as specific methods they run (these are cited inline in the notebook):

  • Melsted P, Booeshaghi AS, Liu L, Gao F, Lu L, Min KHJ, et al. Modular, efficient and constant-memory single-cell RNA-seq preprocessing. Nat Biotechnol. 2021 Jul;39(7):813–8.

  • Hjörleifsson KE, Sullivan DK, Swarna NP, Holley G, Melsted P, Pachter L. Accurate quantification of single-cell and single-nucleus RNA-seq transcripts using distinguishing flanking k-mers [Internet]. bioRxiv; 2024 [cited 2024 Jul 10]. p. 2022.12.02.518832. Available from: https://www.biorxiv.org/content/10.1101/2022.12.02.518832v3

  • Sullivan DK, Min KH (Joseph), Hjörleifsson KE, Luebbert L, Holley G, Moses L, et al. kallisto, bustools, and kb-python for quantifying bulk, single-cell, and single-nucleus RNA-seq [Internet]. BioRxiv; 2023 Nov [cited 2024 Jan 25]. Available from: http://biorxiv.org/lookup/doi/10.1101/2023.11.21.568164

See our kb-python tutorials site to see tutorial notebooks on other RNA-seq workflows.

Setup

[1]:
# This is  used to time the running of the notebook
start_time <- Sys.time()

Install R packages

There are several packages in R built for scRNA-seq data analysis. Here we use Seurat.

[2]:
system.time({
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
  BiocManager::install(c('multtest', "DropletUtils", "biomaRt"), Ncpus = 2)
  install.packages(c("Seurat", "scico", "ggpointdensity", "hexbin"), Ncpus = 2)
})
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

'getOption("repos")' replaces Bioconductor standard repositories, see
'help("repositories", package = "BiocManager")' for details.
Replacement repositories:
    CRAN: https://cran.rstudio.com

Bioconductor version 3.19 (BiocManager 1.30.25), R 4.4.1 (2024-06-14)

Installing package(s) 'BiocVersion', 'multtest', 'DropletUtils'

also installing the dependencies ‘formatR’, ‘UCSC.utils’, ‘GenomeInfoDbData’, ‘zlibbioc’, ‘matrixStats’, ‘abind’, ‘lambda.r’, ‘futile.options’, ‘statmod’, ‘GenomeInfoDb’, ‘XVector’, ‘MatrixGenerics’, ‘S4Arrays’, ‘futile.logger’, ‘snow’, ‘SparseArray’, ‘sparseMatrixStats’, ‘rhdf5filters’, ‘limma’, ‘locfit’, ‘R.oo’, ‘R.methodsS3’, ‘sitmo’, ‘BiocGenerics’, ‘Biobase’, ‘SingleCellExperiment’, ‘S4Vectors’, ‘IRanges’, ‘GenomicRanges’, ‘SummarizedExperiment’, ‘BiocParallel’, ‘DelayedArray’, ‘DelayedMatrixStats’, ‘HDF5Array’, ‘rhdf5’, ‘edgeR’, ‘R.utils’, ‘dqrng’, ‘beachmat’, ‘scuttle’, ‘Rhdf5lib’, ‘BH’


Old packages: 'gtable', 'pkgbuild', 'ps', 'withr'

Installing packages into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘bitops’, ‘dotCall64’, ‘gtools’, ‘caTools’, ‘plyr’, ‘tensor’, ‘sp’, ‘spam’, ‘globals’, ‘listenv’, ‘parallelly’, ‘zoo’, ‘lazyeval’, ‘crosstalk’, ‘RcppTOML’, ‘here’, ‘gplots’, ‘reshape2’, ‘gridExtra’, ‘RcppArmadillo’, ‘spatstat.data’, ‘spatstat.univar’, ‘spatstat.random’, ‘spatstat.utils’, ‘spatstat.sparse’, ‘goftest’, ‘deldir’, ‘polyclip’, ‘FNN’, ‘SeuratObject’, ‘cowplot’, ‘fastDummies’, ‘fitdistrplus’, ‘future’, ‘future.apply’, ‘ggrepel’, ‘ggridges’, ‘ica’, ‘igraph’, ‘irlba’, ‘leiden’, ‘lmtest’, ‘patchwork’, ‘pbapply’, ‘plotly’, ‘png’, ‘progressr’, ‘RANN’, ‘RcppAnnoy’, ‘RcppHNSW’, ‘reticulate’, ‘ROCR’, ‘RSpectra’, ‘Rtsne’, ‘scattermore’, ‘sctransform’, ‘spatstat.explore’, ‘spatstat.geom’, ‘uwot’, ‘RcppEigen’, ‘RcppProgress’


    user   system  elapsed
3322.496  270.698 2402.259
[3]:
library(DropletUtils)
library(Matrix)
library(tidyverse)
library(Seurat)
library(ggpointdensity)
library(scico)
library(scales)
theme_set(theme_bw())
Loading required package: SingleCellExperiment

Loading required package: SummarizedExperiment

Loading required package: MatrixGenerics

Loading required package: matrixStats


Attaching package: ‘MatrixGenerics’


The following objects are masked from ‘package:matrixStats’:

    colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
    colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
    colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
    colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
    colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
    colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
    colWeightedMeans, colWeightedMedians, colWeightedSds,
    colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
    rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
    rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
    rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
    rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
    rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
    rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
    rowWeightedSds, rowWeightedVars


Loading required package: GenomicRanges

Loading required package: stats4

Loading required package: BiocGenerics


Attaching package: ‘BiocGenerics’


The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs


The following objects are masked from ‘package:base’:

    anyDuplicated, aperm, append, as.data.frame, basename, cbind,
    colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
    get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
    Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
    tapply, union, unique, unsplit, which.max, which.min


Loading required package: S4Vectors


Attaching package: ‘S4Vectors’


The following object is masked from ‘package:utils’:

    findMatches


The following objects are masked from ‘package:base’:

    expand.grid, I, unname


Loading required package: IRanges

Loading required package: GenomeInfoDb

Loading required package: Biobase

Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.



Attaching package: ‘Biobase’


The following object is masked from ‘package:MatrixGenerics’:

    rowMedians


The following objects are masked from ‘package:matrixStats’:

    anyMissing, rowMedians



Attaching package: ‘Matrix’


The following object is masked from ‘package:S4Vectors’:

    expand


── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
 dplyr     1.1.4      readr     2.1.5
 forcats   1.0.0      stringr   1.5.1
 ggplot2   3.5.1      tibble    3.2.1
 lubridate 1.9.3      tidyr     1.3.1
 purrr     1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
 lubridate::%within%() masks IRanges::%within%()
 dplyr::collapse()     masks IRanges::collapse()
 dplyr::combine()      masks Biobase::combine(), BiocGenerics::combine()
 dplyr::count()        masks matrixStats::count()
 dplyr::desc()         masks IRanges::desc()
 tidyr::expand()       masks Matrix::expand(), S4Vectors::expand()
 dplyr::filter()       masks stats::filter()
 dplyr::first()        masks S4Vectors::first()
 dplyr::lag()          masks stats::lag()
 tidyr::pack()         masks Matrix::pack()
 ggplot2::Position()   masks BiocGenerics::Position(), base::Position()
 purrr::reduce()       masks GenomicRanges::reduce(), IRanges::reduce()
 dplyr::rename()       masks S4Vectors::rename()
 lubridate::second()   masks S4Vectors::second()
 lubridate::second<-() masks S4Vectors::second<-()
 dplyr::slice()        masks IRanges::slice()
 tidyr::unpack()       masks Matrix::unpack()
 Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Loading required package: SeuratObject

Loading required package: sp


Attaching package: ‘sp’


The following object is masked from ‘package:IRanges’:

    %over%



Attaching package: ‘SeuratObject’


The following object is masked from ‘package:SummarizedExperiment’:

    Assays


The following object is masked from ‘package:GenomicRanges’:

    intersect


The following object is masked from ‘package:GenomeInfoDb’:

    intersect


The following object is masked from ‘package:IRanges’:

    intersect


The following object is masked from ‘package:S4Vectors’:

    intersect


The following object is masked from ‘package:BiocGenerics’:

    intersect


The following objects are masked from ‘package:base’:

    intersect, t



Attaching package: ‘Seurat’


The following object is masked from ‘package:SummarizedExperiment’:

    Assays



Attaching package: ‘scales’


The following object is masked from ‘package:purrr’:

    discard


The following object is masked from ‘package:readr’:

    col_factor


[4]:
# Slightly modified from BUSpaRse, just to avoid installing a few dependencies not used here
read_count_output <- function(dir, name) {
  dir <- normalizePath(dir, mustWork = TRUE)
  m <- readMM(paste0(dir, "/", name, ".mtx"))
  m <- Matrix::t(m)
  m <- as(m, "dgCMatrix")
  # The matrix read has cells in rows
  ge <- ".genes.txt"
  genes <- readLines(file(paste0(dir, "/", name, ge)))
  barcodes <- readLines(file(paste0(dir, "/", name, ".barcodes.txt")))
  colnames(m) <- barcodes
  rownames(m) <- genes
  return(m)
}

Install kb-python

[5]:
system("pip3 install kb-python")
[12]:
getwd()
'/content'

Download the data

[6]:
# Download the file using wget
system("wget -q https://cf.10xgenomics.com/samples/cell-exp/4.0.0/SC3_v3_NextGem_SI_PBMC_CSP_1K/SC3_v3_NextGem_SI_PBMC_CSP_1K_fastqs.tar")

# Extract the first tar file
system("tar -xf SC3_v3_NextGem_SI_PBMC_CSP_1K_fastqs.tar")

Downloading a pre-built index with kb ref

[7]:
system("kb ref -d human -i index.idx -g t2g.txt")

Note

To see more information about building a custom index, refer to the page Generate a reference index in the documentation.

[8]:
# # building a standard custom reference from the human genome
# system("kb ref -i index.idx -g t2g.txt -f1 f1.fa human_reference_path.fa human_gtf_path.gtf")

# # building a custom reference from the human genome with k=55 (where k is the size of each k-mer used in pseudoalignment)
# system("kb ref -i index.idx -g t2g.txt -f1 f1.fa -k 55 human_reference_path.fa human_gtf_path.gtf")

# # building a reference with the nac workflow (will detect both splice-aware and unspliced transcripts)
# system("kb ref -i index.idx -g t2g.txt -f1 f1.fa --nac -f2 f2.fa -c1 c1.fa -c2 c2.fa human_reference_path.fa human_gtf_path.gtf")

Pseudoaligning the scRNA-seq data to the index with kb count

Run kallisto and bustools

The following command will generate an RNA count matrix of cells (rows) by genes (columns) in .mtx format. Notice that this requires providing the index and transcript-to-gene mapping downloaded in the previous step to the -i and -g arguments respectively. Also, since the reads were generated with the 10x Genomics Chromium Single Cell v2 Chemistry, the -x 10xv2 argument is used. To view other supported technologies, run kb --list.

[14]:
# This step runs `kb` to pseudoalign the reads, and then generate the cells x gene matrix in h5ad format.
system("kb count -i index.idx -g t2g.txt -x 10XV3 -t 2 SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_gex_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_gex_S1_L002_R1_001.fastq.gz SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_gex_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_gex_S1_L002_R2_001.fastq.gz SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_gex_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_gex_S1_L003_R1_001.fastq.gz SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_gex_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_gex_S1_L003_R2_001.fastq.gz SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_gex_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_gex_S1_L004_R1_001.fastq.gz SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_gex_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_gex_S1_L004_R2_001.fastq.gz")
[15]:
list.files(".", recursive = TRUE)
  1. '10x_version3_whitelist.txt'
  2. 'counts_unfiltered/cells_x_genes.barcodes.txt'
  3. 'counts_unfiltered/cells_x_genes.genes.names.txt'
  4. 'counts_unfiltered/cells_x_genes.genes.txt'
  5. 'counts_unfiltered/cells_x_genes.mtx'
  6. 'index.idx'
  7. 'inspect.json'
  8. 'kb_info.json'
  9. 'matrix.ec'
  10. 'output.bus'
  11. 'output.unfiltered.bus'
  12. 'run_info.json'
  13. 'sample_data/anscombe.json'
  14. 'sample_data/california_housing_test.csv'
  15. 'sample_data/california_housing_train.csv'
  16. 'sample_data/mnist_test.csv'
  17. 'sample_data/mnist_train_small.csv'
  18. 'sample_data/README.md'
  19. 'SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_antibody_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_antibody_S1_L002_I1_001.fastq.gz'
  20. 'SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_antibody_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_antibody_S1_L002_R1_001.fastq.gz'
  21. 'SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_antibody_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_antibody_S1_L002_R2_001.fastq.gz'
  22. 'SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_antibody_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_antibody_S1_L003_I1_001.fastq.gz'
  23. 'SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_antibody_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_antibody_S1_L003_R1_001.fastq.gz'
  24. 'SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_antibody_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_antibody_S1_L003_R2_001.fastq.gz'
  25. 'SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_antibody_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_antibody_S1_L004_I1_001.fastq.gz'
  26. 'SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_antibody_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_antibody_S1_L004_R1_001.fastq.gz'
  27. 'SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_antibody_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_antibody_S1_L004_R2_001.fastq.gz'
  28. 'SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_gex_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_gex_S1_L002_I1_001.fastq.gz'
  29. 'SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_gex_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_gex_S1_L002_R1_001.fastq.gz'
  30. 'SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_gex_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_gex_S1_L002_R2_001.fastq.gz'
  31. 'SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_gex_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_gex_S1_L003_I1_001.fastq.gz'
  32. 'SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_gex_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_gex_S1_L003_R1_001.fastq.gz'
  33. 'SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_gex_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_gex_S1_L003_R2_001.fastq.gz'
  34. 'SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_gex_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_gex_S1_L004_I1_001.fastq.gz'
  35. 'SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_gex_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_gex_S1_L004_R1_001.fastq.gz'
  36. 'SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_gex_fastqs/SC3_v3_NextGem_SI_CSP-Labeled_PBMCs_1K_gex_S1_L004_R2_001.fastq.gz'
  37. 'SC3_v3_NextGem_SI_PBMC_CSP_1K_fastqs.tar'
  38. 't2g.txt'
  39. 'transcripts.txt'
[35]:
# Read matrix into R
res_mat <- read_count_output("counts_unfiltered", name = "cells_x_genes")

Basic QC

Filter empty droplets

Most cell barcodes in the gene by cell matrix correspond to empty droplets. A common way to determine which barcodes are empty droplets and which are real cells is to visualize the knee plot.

The knee plot was introduced in the Drop-seq paper:

The knee plot is a graph of the total UMI counts of each barcode vs. the rank of the cell barcode with respect to UMI counts. The inflection point indicates a change in state, and is used as a cutoff for total UMI counts; barcodes below the cutoff are deemed empty droplets and removed.

[36]:
dim(res_mat)
  1. 39546
  2. 281262
[37]:
tot_counts <- Matrix::colSums(res_mat)
summary(tot_counts)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
    0.00     1.00     1.00    33.27     5.00 55594.00
[38]:
bc_rank <- barcodeRanks(res_mat, lower = 10)
[39]:
#' Knee plot for filtering empty droplets
#'
#' Visualizes the inflection point to filter empty droplets. This function plots
#' different datasets with a different color. Facets can be added after calling
#' this function with `facet_*` functions. Will be added to the next release
#' version of BUSpaRse.
#'
#' @param bc_rank A `DataFrame` output from `DropletUtil::barcodeRanks`.
#' @return A ggplot2 object.
knee_plot <- function(bc_rank) {
  knee_plt <- tibble(rank = bc_rank[["rank"]],
                     total = bc_rank[["total"]]) %>%
    distinct() %>%
    dplyr::filter(total > 0)
  annot <- tibble(inflection = metadata(bc_rank)[["inflection"]],
                  rank_cutoff = max(bc_rank$rank[bc_rank$total > metadata(bc_rank)[["inflection"]]]))
  p <- ggplot(knee_plt, aes(total, rank)) +
    geom_line() +
    geom_hline(aes(yintercept = rank_cutoff), data = annot, linetype = 2) +
    geom_vline(aes(xintercept = inflection), data = annot, linetype = 2) +
    scale_x_log10() +
    scale_y_log10() +
    annotation_logticks() +
    labs(y = "Rank", x = "Total UMIs")
  return(p)
}
[40]:
options(repr.plot.width=9, repr.plot.height=6)
knee_plot(bc_rank)
../../_images/sc_notebooks_seurat_30_0.png
[42]:
res_mat <- res_mat[, tot_counts > metadata(bc_rank)$inflection]
res_mat <- res_mat[Matrix::rowSums(res_mat) > 0,]
dim(res_mat)
  1. 21753
  2. 1225

Visualizing count distributions

High mitochondrial gene content is often used as a proxy for low-quality or dying cells in scRNA-seq data. Here we visualize the distribution of mitochondrial gene content across cells in the dataset.

[46]:
tr2g <- read_tsv("t2g.txt", col_names = c("transcript", "gene", "gene_symbol"), show_col_types = FALSE) %>%
  select(-transcript) %>%
  distinct()
[24]:
# # Convert from Ensembl gene ID to gene symbol
# rownames(res_mat) <- tr2g$gene_symbol[match(rownames(res_mat), tr2g$gene)]
[47]:
seu <- CreateSeuratObject(res_mat, min.cells = 3, min.features = 200)
[50]:
ensembl <- biomaRt::useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")  #* If service is down, try mirror = "useast"
mt_genes <- biomaRt::getBM(attributes = c('ensembl_gene_id', 'external_gene_name'), filters = 'chromosome_name', values = 'MT', mart = ensembl)

assay_gene_names <- rownames(seu[["RNA"]])
assay_gene_names_trimmed <- gsub("\\..*", "", assay_gene_names)
common_genes <- intersect(mt_genes$ensembl_gene_id, assay_gene_names_trimmed)
common_genes_with_version <- assay_gene_names[match(common_genes, assay_gene_names_trimmed)]
seu[["percent.mt"]] <- PercentageFeatureSet(seu, features = common_genes_with_version)

# seu[["percent.mt"]] <- PercentageFeatureSet(seu, pattern = "^mt-")
[51]:
# Visualize QC metrics as a violin plot
options(repr.plot.width=12, repr.plot.height=6)
VlnPlot(seu, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3, pt.size = 0.1)
Warning message:
“Default search for "data" layer in "RNA" assay yielded no results; utilizing "counts" layer instead.”
../../_images/sc_notebooks_seurat_38_1.png
[65]:
options(repr.plot.width=9, repr.plot.height=6)
ggplot(seu@meta.data, aes(nCount_RNA, nFeature_RNA)) +
  geom_hex(bins = 100) +
  scale_fill_scico(palette = "devon", direction = -1, end = 0.9) +
  scale_x_log10(breaks = breaks_log(12)) +
  scale_y_log10(breaks = breaks_log(12)) + annotation_logticks() +
  labs(x = "Total UMI counts", y = "Number of genes detected") +
  theme(panel.grid.minor = element_blank())
../../_images/sc_notebooks_seurat_39_0.png
[53]:
ggplot(seu@meta.data, aes(nCount_RNA, percent.mt)) +
  geom_pointdensity() +
  scale_color_scico(palette = "devon", direction = -1, end = 0.9) +
  labs(x = "Total UMI counts", y = "Percentage mitochondrial")
../../_images/sc_notebooks_seurat_40_0.png

In the above plots, color indicates density.

Cells with a high percentage of mitochondrially encoded transcripts are often removed in QC, as those are likely to be low quality cells. If a cell is lysed in sample preparation, transcripts in the mitochondria are less likely to be lost than transcripts in the cytoplasm due to the mitochondria’s double membrane. As a result, cells that lysed tend to have a higher percentage of mitochondrially encoded transcripts.

Based on the plot above, we have decided to filter out cells with more than 30% mitochondrial gene content.

[54]:
seu <- subset(seu, subset = percent.mt < 30)
[55]:
seu <- NormalizeData(seu) %>% ScaleData()
Normalizing layer: counts

Centering and scaling data matrix

Analysis

Identify highly variable genes

[56]:
seu <- FindVariableFeatures(seu, nfeatures = 3000)
top10 <- head(VariableFeatures(seu), 10)
plot1 <- VariableFeaturePlot(seu, log = FALSE)
LabelPoints(plot = plot1, points = top10, repel = TRUE)
Finding variable features for layer counts

When using repel, set xnudge and ynudge to 0 for optimal results

../../_images/sc_notebooks_seurat_47_1.png

PCA

[57]:
seu <- RunPCA(seu, verbose = FALSE, npcs = 20) # uses HVG by default
ElbowPlot(seu, ndims = 20)
../../_images/sc_notebooks_seurat_49_0.png
[58]:
PCAPlot(seu)
../../_images/sc_notebooks_seurat_50_0.png

Clustering and visualization

There are many algorithms for clustering cells, and while they have been compared in detail in various benchmarks (see e.g., Duo et al. 2018), there is no univerally agreed upon method. Here we demonstrate clustering using Louvain clustering, which is a popular method for clustering single-cell RNA-seq data. The method was published in

  • Blondel, Vincent D; Guillaume, Jean-Loup; Lambiotte, Renaud; Lefebvre, Etienne (9 October 2008). “Fast unfolding of communities in large networks”. Journal of Statistical Mechanics: Theory and Experiment. 2008 (10): P10008.

[59]:
seu <- FindNeighbors(seu, dims = 1:10)
seu <- FindClusters(seu)
Computing nearest neighbor graph

Computing SNN

Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 1171
Number of edges: 37096

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8077
Number of communities: 10
Elapsed time: 0 seconds
[60]:
PCAPlot(seu)
../../_images/sc_notebooks_seurat_53_0.png

Visualization

Visualization of the clustered data can be performed through non-linear dimensionality reduction. The most popular non-linear dimensionality reduction methods are UMAP and t-SNE. Our group has demonstrated that both UMAP and t-SNE’s non-linear natures make their interpretability highly limited. To read more, please see the following reference:

However, for those still interested in these methodss, we demonstrate their use below.

[61]:
seu <- RunTSNE(seu, dims = 1:10)
TSNEPlot(seu)
../../_images/sc_notebooks_seurat_55_0.png
[62]:
seu <- RunUMAP(seu, dims = 1:10, verbose = FALSE)
UMAPPlot(seu)
Warning message:
“The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session”
../../_images/sc_notebooks_seurat_56_1.png
[63]:
Sys.time() - start_time
Time difference of 2.814722 hours

Installing packages took about 26 minutes, which is a drawback of Rcpp. The QC and analysis post-installation takes about 10 minutes from reads to results. This includes downloading the data, filtering, clustering and visualization.

Feedback: please report any issues, or submit pull requests for improvements, in the Github repository where this notebook is located.