【scRW】[4]Single-cell RNA-seq: Quality control set-up

Learning Objectives:

Understand how to bring in data from single-cell RNA-seq experiments


Quality control set-up

1. Exploring the example dataset

在本次研討會上费就,我們將使用單細(xì)胞RNA-seq數(shù)據(jù)集往枣,該數(shù)據(jù)集是Kang等人(2017年)進(jìn)行的一項(xiàng)較大研究的一部分凝垛。在本文中渡八,作者提出了一種利用遺傳變異(eQTL)來確定基因的計(jì)算算法。 每個包含單個細(xì)胞的液滴的遺傳同一性(單一)霉撵,并鑒定包含來自不同個體的兩個細(xì)胞的液滴(雙重)蝎抽。
用于測試其算法的數(shù)據(jù)包括從八名狼瘡患者中提取的合并的外周血單個核細(xì)胞(PBMC)静秆,分為對照和干擾素β治療(刺激)條件。

[圖片上傳失敗...(image-e2393b-1590137860500)]*
](https://upload-images.jianshu.io/upload_images/11904209-6ba372a42250af69.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

2. Raw data

This dataset is available on GEO (GSE96583), however the available counts matrix lacked mitochondrial reads, so we downloaded the BAM files from the SRA (SRP102802). These BAM files were converted back to FASTQ files, then run through Cell Ranger to obtain the count data that we will be using.

NOTE: The counts for this dataset is also freely available from 10X Genomics and is used as part of the Seurat tutorial.

3. Metadata

除了原始數(shù)據(jù)严衬,我們還需要收集有關(guān)數(shù)據(jù)的信息澄者; 這就是所謂的 metadata元數(shù)據(jù) 。 通常有一種 temptation ,就是只是開始探索數(shù)據(jù)粱挡,但如果我們不知道該數(shù)據(jù)所源自的樣本赠幕,那并不是很有意義。

以下提供了我們數(shù)據(jù)集的一些相關(guān)metadata元數(shù)據(jù):

  • The libraries were prepared using 10X Genomics version 2 chemistry;
  • The samples were sequenced on the Illumina NextSeq 500;
  • 將來自八名狼瘡患者的PBMC樣本分別分成兩等份;
    = 用100 U / mL重組IFN-β激活一份PBMC询筏,持續(xù)6小時榕堰。
    = The second aliquot分試樣 was left untreated.
    = 6小時后,將每種條件的8個樣品一起收集到兩個最終的收集池中(受激細(xì)胞和對照細(xì)胞)嫌套。 我們將使用這兩個合并的樣本逆屡。
  • 分別鑒定了12138和12167個細(xì)胞(去除雙峰后)作為對照樣品和刺激后的合并樣品。
  • 由于樣本是PBMC踱讨,因此我們期望有免疫細(xì)胞魏蔗,例如:
    = B cells
    = T cells
    = NK cells
    = monocytes
    = macrophages
    = possibly megakaryocytes

建議您在執(zhí)行QC之前最好知道預(yù)期的細(xì)胞類型。因?yàn)檫@樣當(dāng)任何細(xì)胞類型的低復(fù)雜度(只有部分基因的轉(zhuǎn)錄本)或線粒體表達(dá)水平較高的細(xì)胞可以幫助我們提出剔除QC不合格的細(xì)胞群體痹筛。

預(yù)期上述細(xì)胞類型都不應(yīng)該具有低復(fù)雜度或具有高線粒體含量莺治。

4. Setting up the R environment

研究涉及大量數(shù)據(jù)的最重要部分之一就是如何最好地對其進(jìn)行管理。 我們傾向于對分析進(jìn)行優(yōu)先級劃分帚稠,但是數(shù)據(jù)管理中還有許多其他重要方面經(jīng)常被人們忽略谣旁,因此他們對新數(shù)據(jù)沒有先見之明。 HMS Data Management Working Group
深入討論了除數(shù)據(jù)創(chuàng)建和分析之外要考慮的一些事項(xiàng)滋早。

數(shù)據(jù)管理的一個重要方面是組織榄审。 對于您進(jìn)行和分析數(shù)據(jù)的每個實(shí)驗(yàn),通過 a planned storage space (directory structure) 來進(jìn)行組織是最佳實(shí)踐杆麸。 我們將對單細(xì)胞分析進(jìn)行此操作搁进。

打開 RStudio 并創(chuàng)建一個名為single_cell_rnaseq的新R項(xiàng)目。 然后角溃,創(chuàng)建以下目錄:

single_cell_rnaseq/
├── data
├── results
└── figures

5. Download data & New script

Control sample
Stimulated sample

# February 2020
# HBC single-cell RNA-seq workshop

# Single-cell RNA-seq analysis - QC

將Rscript保存為quality_control.R拷获。 您的工作目錄應(yīng)如下所示:

New script

6.Loading libraries

Now, we can load the necessary libraries:

# Load libraries
library(SingleCellExperiment)
library(Seurat)
library(tidyverse)
library(Matrix)
library(scales)
library(cowplot)
library(RCurl)

7.Loading single-cell RNA-seq count data

Regardless of the technology or pipeline used to process your single-cell RNA-seq sequence data, the output will generally be the same. That is, for each individual sample you will have the following three files:

1.a file with the cell IDs, representing all cells quantified
2.a file with the gene IDs, representing all genes quantified
3.a matrix of counts per gene for every cell

We can explore these files by clicking on the data/ctrl_raw_feature_bc_matrix folder:

7.1 barcodes.tsv

這是一個文本文件篮撑,其中包含該樣品存在的所有細(xì)胞條形碼减细。 條形碼以矩陣文件中顯示的數(shù)據(jù)順序列出(即,這些是列名; i.e. these are the column names)赢笨。

barcodes.tsv

7.2. features.tsv

This is a text file which contains the identifiers of the quantified genes. The source of the identifier can vary depending on what reference (i.e. Ensembl, NCBI, UCSC) you use in the quantification methods, but most often these are official gene symbols. The order of these genes corresponds to the order of the rows in the matrix file (i.e. these are the row names).

image.png

7.3 matrix.mtx

這是一個文本文件未蝌,其中包含計(jì)數(shù)值矩陣。 行與上面的基因ID相關(guān)聯(lián)茧妒,列與細(xì)胞條形碼對應(yīng)萧吠。 請注意,此矩陣中有許多零值桐筏。

matrix.mtx

Loading this data into R requires us to use functions that allow us to efficiently combine these three files into a single count matrix. However, instead of creating a regular matrix data structure, the functions we will use create a sparse matrix to improve the amount of space, memory and CPU required to work with our huge count matrix.

  • Different methods for reading in data include:
  1. readMM(): This function is from the Matrix package and will turn our standard matrix into a sparse matrix. The features.tsv file and barcodes.tsv must first be individually loaded into R and then they are combined. For specific code and instructions on how to do this please see our additional material.
  2. Read10X(): This function is from the Seurat package and will use the Cell Ranger output directory as input. In this way individual files do not need to be loaded in, instead the function will load and combine them into a sparse matrix for you. We will be using this function to load in our data!

7.4 Reading in a single sample (read10X())

When working with 10X data and its proprietary software Cell Ranger, you will always have an outs directory. Within this directory you will find a number of different files including:

  • web_summary.html: report that explores different QC metrics, including the mapping metrics, filtering thresholds, estimated number of cells after filtering, and information on the number of reads and genes per cell after filtering.
  • BAM alignment files: files used for visualization of the mapped reads and for re-creation of FASTQ files, if needed
  • filtered_feature_bc_matrix: folder containing all files needed to construct the count matrix using data filtered by Cell Ranger
  • raw_feature_bc_matrix: folder containing all files needed to construct the count matrix using the raw unfiltered data

We are mainly interested in the raw_feature_bc_matrix as we wish to perform our own QC and filtering while accounting for the biology of our experiment/biological system.

If we had a single sample, we could generate the count matrix and then subsequently create a Seurat object:

# How to read in 10X data for a single sample (output is a sparse matrix)
ctrl_counts <- Read10X(data.dir = "data/ctrl_raw_feature_bc_matrix")

# Turn count matrix into a Seurat object (output is a Seurat object)
ctrl <- CreateSeuratObject(counts = ctrl_counts,
                           min.features = 100)

min.features 參數(shù)指定每個細(xì)胞需要檢測的最小基因數(shù)量纸型。 此參數(shù)將過濾掉質(zhì)量較差的單元,這些單元可能只是封裝了隨機(jī)條形碼 random barcodes 而沒有任何單元any cell present。 通常狰腌,檢測不到100個基因的細(xì)胞不考慮進(jìn)行分析除破。

Seurat automatically creates some metadata for each of the cells when you use the Read10X() function to read in data. This information is stored in the meta.data slot within the Seurat object (see more in the note below).

The Seurat object is a custom list-like object 自定義列表狀對象 that has well-defined spaces to store specific information/data. You can find more information about the slots in the Seurat object at this link.

# Explore the metadata
head(ctrl@meta.data)

What do the columns of metadata mean?

  • orig.ident: this often contains the sample identity if known, but will default to “SeuratProject”
  • nCount_RNA: number of UMIs per cell
  • nFeature_RNA: number of genes detected per cell

7.5 Reading in multiple samples with a for loop

In practice, you will likely have several samples that you will need to read in data for using one of the 2 functions we discussed earlier (Read10X() or readMM()). So, to make the data import into R more efficient we can use a for loop, that will interate over a series of commands for each of the inputs given.

In R, it has the following structure/syntax:

## DO NOT RUN

for (variable in input){
    command1
    command2
    command3
}

The for loop we will be using today will iterate over the two sample “files” and execute two commands for each sample - (1) read in the count data (Read10X()) and (2) create the Seurat objects from the read in data (CreateSeuratObject()):

# Create each individual Seurat object for every sample
for (file in c("ctrl_raw_feature_bc_matrix", "stim_raw_feature_bc_matrix")){
        seurat_data <- Read10X(data.dir = paste0("data/", file))
        seurat_obj <- CreateSeuratObject(counts = seurat_data, 
                                         min.features = 100, 
                                         project = file)
        assign(file, seurat_obj)
}

We can break down the for loop to describe the different lines of code:

Step1: specify input

For this dataset, we have two samples that we would like to create a Seurat object for:
ctrl_raw_feature_bc_matrix
stim_raw_feature_bc_matrix

We can specify these samples in the input part for our for loop as elements of a vector using c(). We are assigning these to a variable and we can call that variable anything we would like (try to give it a name that makes sense). In this example, we called the variable file.

During the execution of the above loop, file will first contain the value “ctrl_raw_feature_bc_matrix”, run through the commands all the way through to assign(). Next, it will contain the value “stim_raw_feature_bc_matrix” and once again run through all the commands. If you had 15 folders as input, instead of 2, the above code will run through 15 times, for each of your data folders.

## DO NOT RUN

# Create each individual Seurat object
for (file in c("ctrl_raw_feature_bc_matrix", "stim_raw_feature_bc_matrix")){
}
Step 2: Read in data for the input

We can continue our for loop by adding a line to read in data with Read10X():

Here, we need to specify the path to the file, so we will prepend the data/ directory to our sample folder name using the paste0() function.

Step 3: Create Seurat object from the 10X count data

Now, we can create the Seurat object by using the CreateSeuratObject() function, adding in the argument project, where we can add the sample name.

## DO NOT RUN

        seurat_obj <- CreateSeuratObject(counts = seurat_data, 
                                         min.features = 100, 
                                         project = file) 
Step 4: Assign Seurat object to a new variable based on sample

The last command assigns the Seurat object created (seurat_obj) to a new variable. In this way, when we iterate and move on to the next sample in our input we will not overwrite the Seurat object created in the previous iteration:

## DO NOT RUN
  
        assign(file, seurat_obj)
}

Now that we have created both of these objects, let’s take a quick look at the metadata to see how it looks:

# Check the metadata in the new Seurat objects
head(ctrl_raw_feature_bc_matrix@meta.data)
head(stim_raw_feature_bc_matrix@meta.data)

Next, we need to merge these objects together into a single Seurat object. This will make it easier to run the QC steps for both sample groups together and enable us to easily compare the data quality for all the samples.

We can use the merge() function from the Seurat package to do this:

# Create a merged Seurat object
merged_seurat <- merge(x = ctrl_raw_feature_bc_matrix, 
                       y = stim_raw_feature_bc_matrix, 
                       add.cell.id = c("ctrl", "stim"))

Because the same cell IDs can be used for different samples, we add a sample-specific prefix to each of our cell IDs using the add.cell.id argument. If we look at the metadata of the merged object we should be able to see the prefixes in the rownames:

# Check that the merged object has the appropriate sample-specific prefixes
head(merged_seurat@meta.data)
tail(merged_seurat@meta.data)
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市琼腔,隨后出現(xiàn)的幾起案子瑰枫,更是在濱河造成了極大的恐慌,老刑警劉巖丹莲,帶你破解...
    沈念sama閱讀 218,122評論 6 505
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件光坝,死亡現(xiàn)場離奇詭異,居然都是意外死亡甥材,警方通過查閱死者的電腦和手機(jī)盯另,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 93,070評論 3 395
  • 文/潘曉璐 我一進(jìn)店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來洲赵,“玉大人土铺,你說我怎么就攤上這事“鬻蓿” “怎么了悲敷?”我有些...
    開封第一講書人閱讀 164,491評論 0 354
  • 文/不壞的土叔 我叫張陵,是天一觀的道長俭令。 經(jīng)常有香客問我后德,道長,這世上最難降的妖魔是什么抄腔? 我笑而不...
    開封第一講書人閱讀 58,636評論 1 293
  • 正文 為了忘掉前任瓢湃,我火速辦了婚禮,結(jié)果婚禮上赫蛇,老公的妹妹穿的比我還像新娘绵患。我一直安慰自己,他們只是感情好悟耘,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,676評論 6 392
  • 文/花漫 我一把揭開白布落蝙。 她就那樣靜靜地躺著,像睡著了一般暂幼。 火紅的嫁衣襯著肌膚如雪筏勒。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 51,541評論 1 305
  • 那天旺嬉,我揣著相機(jī)與錄音管行,去河邊找鬼。 笑死邪媳,一個胖子當(dāng)著我的面吹牛捐顷,可吹牛的內(nèi)容都是我干的荡陷。 我是一名探鬼主播,決...
    沈念sama閱讀 40,292評論 3 418
  • 文/蒼蘭香墨 我猛地睜開眼迅涮,長吁一口氣:“原來是場噩夢啊……” “哼亲善!你這毒婦竟也來了?” 一聲冷哼從身側(cè)響起逗柴,我...
    開封第一講書人閱讀 39,211評論 0 276
  • 序言:老撾萬榮一對情侶失蹤蛹头,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后戏溺,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體渣蜗,經(jīng)...
    沈念sama閱讀 45,655評論 1 314
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,846評論 3 336
  • 正文 我和宋清朗相戀三年旷祸,在試婚紗的時候發(fā)現(xiàn)自己被綠了耕拷。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點(diǎn)故事閱讀 39,965評論 1 348
  • 序言:一個原本活蹦亂跳的男人離奇死亡托享,死狀恐怖骚烧,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情闰围,我是刑警寧澤赃绊,帶...
    沈念sama閱讀 35,684評論 5 347
  • 正文 年R本政府宣布,位于F島的核電站羡榴,受9級特大地震影響碧查,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜校仑,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,295評論 3 329
  • 文/蒙蒙 一忠售、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧迄沫,春花似錦稻扬、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,894評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至困后,卻和暖如春乐纸,著一層夾襖步出監(jiān)牢的瞬間衬廷,已是汗流浹背摇予。 一陣腳步聲響...
    開封第一講書人閱讀 33,012評論 1 269
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留吗跋,地道東北人侧戴。 一個月前我還...
    沈念sama閱讀 48,126評論 3 370
  • 正文 我出身青樓宁昭,卻偏偏與公主長得像,于是被迫代替她去往敵國和親酗宋。 傳聞我的和親對象是個殘疾皇子积仗,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 44,914評論 2 355