【scRW】[4]Single-cell RNA-seq: Quality control set-up

Learning Objectives:

Understand how to bring in data from single-cell RNA-seq experiments

Quality control set-up

1. Exploring the example dataset

在本次研討會上费就，我們將使用單細(xì)胞RNA-seq數(shù)據(jù)集往枣，該數(shù)據(jù)集是Kang等人（2017年）進(jìn)行的一項(xiàng)較大研究的一部分凝垛。在本文中渡八，作者提出了一種利用遺傳變異（eQTL）來確定基因的計(jì)算算法。每個包含單個細(xì)胞的液滴的遺傳同一性（單一）霉撵，并鑒定包含來自不同個體的兩個細(xì)胞的液滴（雙重）蝎抽。
用于測試其算法的數(shù)據(jù)包括從八名狼瘡患者中提取的合并的外周血單個核細(xì)胞（PBMC）静秆，分為對照和干擾素β治療（刺激）條件。

[圖片上傳失敗...(image-e2393b-1590137860500)]*
](https://upload-images.jianshu.io/upload_images/11904209-6ba372a42250af69.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

2. Raw data

This dataset is available on GEO (GSE96583), however the available counts matrix lacked mitochondrial reads, so we downloaded the BAM files from the SRA (SRP102802). These BAM files were converted back to FASTQ files, then run through Cell Ranger to obtain the count data that we will be using.

NOTE: The counts for this dataset is also freely available from 10X Genomics and is used as part of the Seurat tutorial.

3. Metadata

除了原始數(shù)據(jù)严衬，我們還需要收集有關(guān)數(shù)據(jù)的信息澄者；這就是所謂的 metadata元數(shù)據(jù) 。通常有一種 temptation ，就是只是開始探索數(shù)據(jù)粱挡，但如果我們不知道該數(shù)據(jù)所源自的樣本赠幕，那并不是很有意義。

以下提供了我們數(shù)據(jù)集的一些相關(guān)metadata元數(shù)據(jù)：

The libraries were prepared using 10X Genomics version 2 chemistry;
The samples were sequenced on the Illumina NextSeq 500;
將來自八名狼瘡患者的PBMC樣本分別分成兩等份;
= 用100 U / mL重組IFN-β激活一份PBMC询筏，持續(xù)6小時榕堰。
= The second aliquot分試樣 was left untreated.
= 6小時后，將每種條件的8個樣品一起收集到兩個最終的收集池中（受激細(xì)胞和對照細(xì)胞）嫌套。我們將使用這兩個合并的樣本逆屡。
分別鑒定了12138和12167個細(xì)胞（去除雙峰后）作為對照樣品和刺激后的合并樣品。
由于樣本是PBMC踱讨，因此我們期望有免疫細(xì)胞魏蔗，例如：
= B cells
= T cells
= NK cells
= monocytes
= macrophages
= possibly megakaryocytes

建議您在執(zhí)行QC之前最好知道預(yù)期的細(xì)胞類型。因?yàn)檫@樣當(dāng)任何細(xì)胞類型的低復(fù)雜度（只有部分基因的轉(zhuǎn)錄本）或線粒體表達(dá)水平較高的細(xì)胞可以幫助我們提出剔除QC不合格的細(xì)胞群體痹筛。

預(yù)期上述細(xì)胞類型都不應(yīng)該具有低復(fù)雜度或具有高線粒體含量莺治。

4. Setting up the R environment

研究涉及大量數(shù)據(jù)的最重要部分之一就是如何最好地對其進(jìn)行管理。我們傾向于對分析進(jìn)行優(yōu)先級劃分帚稠，但是數(shù)據(jù)管理中還有許多其他重要方面經(jīng)常被人們忽略谣旁，因此他們對新數(shù)據(jù)沒有先見之明。 HMS Data Management Working Group
深入討論了除數(shù)據(jù)創(chuàng)建和分析之外要考慮的一些事項(xiàng)滋早。

數(shù)據(jù)管理的一個重要方面是組織榄审。對于您進(jìn)行和分析數(shù)據(jù)的每個實(shí)驗(yàn)，通過 a planned storage space (directory structure) 來進(jìn)行組織是最佳實(shí)踐杆麸。我們將對單細(xì)胞分析進(jìn)行此操作搁进。

打開 RStudio 并創(chuàng)建一個名為single_cell_rnaseq的新R項(xiàng)目。然后角溃，創(chuàng)建以下目錄：

single_cell_rnaseq/
├── data
├── results
└── figures

5. Download data & New script

Control sample
Stimulated sample

# February 2020
# HBC single-cell RNA-seq workshop

# Single-cell RNA-seq analysis - QC

將Rscript保存為quality_control.R拷获。您的工作目錄應(yīng)如下所示：

New script

6.Loading libraries

Now, we can load the necessary libraries:

# Load libraries
library(SingleCellExperiment)
library(Seurat)
library(tidyverse)
library(Matrix)
library(scales)
library(cowplot)
library(RCurl)

7.Loading single-cell RNA-seq count data

Regardless of the technology or pipeline used to process your single-cell RNA-seq sequence data, the output will generally be the same. That is, for each individual sample you will have the following three files:

1.a file with the cell IDs, representing all cells quantified
2.a file with the gene IDs, representing all genes quantified
3.a matrix of counts per gene for every cell

We can explore these files by clicking on the data/ctrl_raw_feature_bc_matrix folder:

7.1 `barcodes.tsv`

這是一個文本文件篮撑，其中包含該樣品存在的所有細(xì)胞條形碼减细。條形碼以矩陣文件中顯示的數(shù)據(jù)順序列出（即，這些是列名; i.e. these are the column names）赢笨。

barcodes.tsv

7.2. `features.tsv`

This is a text file which contains the identifiers of the quantified genes. The source of the identifier can vary depending on what reference (i.e. Ensembl, NCBI, UCSC) you use in the quantification methods, but most often these are official gene symbols. The order of these genes corresponds to the order of the rows in the matrix file (i.e. these are the row names).

image.png

7.3 `matrix.mtx`

這是一個文本文件未蝌，其中包含計(jì)數(shù)值矩陣。行與上面的基因ID相關(guān)聯(lián)茧妒，列與細(xì)胞條形碼對應(yīng)萧吠。請注意，此矩陣中有許多零值桐筏。

matrix.mtx

Loading this data into R requires us to use functions that allow us to efficiently combine these three files into a single count matrix. However, instead of creating a regular matrix data structure, the functions we will use create a sparse matrix to improve the amount of space, memory and CPU required to work with our huge count matrix.

Different methods for reading in data include:

readMM(): This function is from the Matrix package and will turn our standard matrix into a sparse matrix. The features.tsv file and barcodes.tsv must first be individually loaded into R and then they are combined. For specific code and instructions on how to do this please see our additional material.
Read10X(): This function is from the Seurat package and will use the Cell Ranger output directory as input. In this way individual files do not need to be loaded in, instead the function will load and combine them into a sparse matrix for you. We will be using this function to load in our data!

7.4 Reading in a single sample (`read10X()`)

When working with 10X data and its proprietary software Cell Ranger, you will always have an outs directory. Within this directory you will find a number of different files including:

web_summary.html: report that explores different QC metrics, including the mapping metrics, filtering thresholds, estimated number of cells after filtering, and information on the number of reads and genes per cell after filtering.
BAM alignment files: files used for visualization of the mapped reads and for re-creation of FASTQ files, if needed
filtered_feature_bc_matrix: folder containing all files needed to construct the count matrix using data filtered by Cell Ranger
raw_feature_bc_matrix: folder containing all files needed to construct the count matrix using the raw unfiltered data

We are mainly interested in the raw_feature_bc_matrix as we wish to perform our own QC and filtering while accounting for the biology of our experiment/biological system.

If we had a single sample, we could generate the count matrix and then subsequently create a Seurat object:

# How to read in 10X data for a single sample (output is a sparse matrix)
ctrl_counts <- Read10X(data.dir = "data/ctrl_raw_feature_bc_matrix")

# Turn count matrix into a Seurat object (output is a Seurat object)
ctrl <- CreateSeuratObject(counts = ctrl_counts,
                           min.features = 100)

min.features 參數(shù)指定每個細(xì)胞需要檢測的最小基因數(shù)量纸型。此參數(shù)將過濾掉質(zhì)量較差的單元，這些單元可能只是封裝了隨機(jī)條形碼 random barcodes 而沒有任何單元any cell present。通常狰腌，檢測不到100個基因的細(xì)胞不考慮進(jìn)行分析除破。

Seurat automatically creates some metadata for each of the cells when you use the Read10X() function to read in data. This information is stored in the meta.data slot within the Seurat object (see more in the note below).

The Seurat object is a custom list-like object 自定義列表狀對象 that has well-defined spaces to store specific information/data. You can find more information about the slots in the Seurat object at this link.

# Explore the metadata
head(ctrl@meta.data)

What do the columns of metadata mean?

orig.ident: this often contains the sample identity if known, but will default to “SeuratProject”
nCount_RNA: number of UMIs per cell
nFeature_RNA: number of genes detected per cell

7.5 Reading in multiple samples with a `for loop`

In practice, you will likely have several samples that you will need to read in data for using one of the 2 functions we discussed earlier (Read10X() or readMM()). So, to make the data import into R more efficient we can use a for loop, that will interate over a series of commands for each of the inputs given.

In R, it has the following structure/syntax:

## DO NOT RUN

for (variable in input){
    command1
    command2
    command3
}

The for loop we will be using today will iterate over the two sample “files” and execute two commands for each sample - (1) read in the count data (Read10X()) and (2) create the Seurat objects from the read in data (CreateSeuratObject()):

# Create each individual Seurat object for every sample
for (file in c("ctrl_raw_feature_bc_matrix", "stim_raw_feature_bc_matrix")){
        seurat_data <- Read10X(data.dir = paste0("data/", file))
        seurat_obj <- CreateSeuratObject(counts = seurat_data, 
                                         min.features = 100, 
                                         project = file)
        assign(file, seurat_obj)
}

We can break down the for loop to describe the different lines of code:

Step1: specify input

For this dataset, we have two samples that we would like to create a Seurat object for:
ctrl_raw_feature_bc_matrix
stim_raw_feature_bc_matrix

We can specify these samples in the input part for our for loop as elements of a vector using c(). We are assigning these to a variable and we can call that variable anything we would like (try to give it a name that makes sense). In this example, we called the variable file.

During the execution of the above loop, file will first contain the value “ctrl_raw_feature_bc_matrix”, run through the commands all the way through to assign(). Next, it will contain the value “stim_raw_feature_bc_matrix” and once again run through all the commands. If you had 15 folders as input, instead of 2, the above code will run through 15 times, for each of your data folders.

## DO NOT RUN

# Create each individual Seurat object
for (file in c("ctrl_raw_feature_bc_matrix", "stim_raw_feature_bc_matrix")){
}

Step 2: Read in data for the input

We can continue our for loop by adding a line to read in data with Read10X():

Here, we need to specify the path to the file, so we will prepend the data/ directory to our sample folder name using the paste0() function.

Step 3: Create Seurat object from the 10X count data

Now, we can create the Seurat object by using the CreateSeuratObject() function, adding in the argument project, where we can add the sample name.

## DO NOT RUN

        seurat_obj <- CreateSeuratObject(counts = seurat_data, 
                                         min.features = 100, 
                                         project = file)

Step 4: Assign Seurat object to a new variable based on sample

The last command assigns the Seurat object created (seurat_obj) to a new variable. In this way, when we iterate and move on to the next sample in our input we will not overwrite the Seurat object created in the previous iteration:

## DO NOT RUN
  
        assign(file, seurat_obj)
}

Now that we have created both of these objects, let’s take a quick look at the metadata to see how it looks:

# Check the metadata in the new Seurat objects
head(ctrl_raw_feature_bc_matrix@meta.data)
head(stim_raw_feature_bc_matrix@meta.data)

Next, we need to merge these objects together into a single Seurat object. This will make it easier to run the QC steps for both sample groups together and enable us to easily compare the data quality for all the samples.

We can use the merge() function from the Seurat package to do this:

# Create a merged Seurat object
merged_seurat <- merge(x = ctrl_raw_feature_bc_matrix, 
                       y = stim_raw_feature_bc_matrix, 
                       add.cell.id = c("ctrl", "stim"))

Because the same cell IDs can be used for different samples, we add a sample-specific prefix to each of our cell IDs using the add.cell.id argument. If we look at the metadata of the merged object we should be able to see the prefixes in the rownames:

# Check that the merged object has the appropriate sample-specific prefixes
head(merged_seurat@meta.data)
tail(merged_seurat@meta.data)