隨著單細(xì)胞技術(shù)的發(fā)展,數(shù)據(jù)量增加使得計(jì)算需求呈指數(shù)增長(zhǎng)某宪。分析單細(xì)胞數(shù)據(jù)時(shí)兴喂,使用稀100000個(gè)細(xì)胞的系數(shù)矩陣處理對(duì)于Seurat 來說就很有挑戰(zhàn)性衣迷。HDF5 格式現(xiàn)在被用于儲(chǔ)存
生物大數(shù)據(jù),單細(xì)胞可以儲(chǔ)存上百萬個(gè)細(xì)胞的數(shù)據(jù)壶谒。
Linnarson實(shí)驗(yàn)室開發(fā)了基于HDF5的數(shù)據(jù)結(jié)構(gòu)-loom loompy佃迄,用于儲(chǔ)存單細(xì)胞數(shù)據(jù)以及數(shù)據(jù)相關(guān)的屬性信息呵俏。并且,他們還發(fā)布了一個(gè)工具loompy吼肥。
satijalab實(shí)驗(yàn)室開發(fā)了一個(gè)基于R的loom工具-loomR
#安裝
- loomR 需要預(yù)安裝hdf5r
System | Command |
---|---|
OS X (using Homebrew) | brew install hdf5 |
Debian-based systems (including Ubuntu) | sudo apt-get install libhdf5-dev |
Systems supporting yum and RPMs | sudo yum install hdf5-devel |
install.packages("hdf5r")
或
devtools::install_github("hhoeflin/hdf5r")
- 安裝loomR
# Install devtools from CRAN
install.packages("devtools")
devtools::install_github(repo = "mojaveazure/loomR", ref = "develop")
# Load loomR
library(loomR)
#loom對(duì)象介紹
loomR 內(nèi)置基于R6的loom對(duì)象缀皱。R6 對(duì)象與S4 類似动猬,R6 使用field.name
)赁咙, R6方法也可以直接調(diào)用(i.e.
my.object$method()`)彼水。
# 連接loom對(duì)象
標(biāo)準(zhǔn)的R對(duì)象時(shí)將數(shù)據(jù)加載到內(nèi)存中,loom對(duì)象只是建立一個(gè)與磁盤文件的連接链瓦。使用loomR::create可以創(chuàng)建一個(gè)loom文件慈俯,或者使用Convert將Seurat 對(duì)象轉(zhuǎn)換成loom文件。
# Connect to the loom file in read/write mode
lfile <- connect(filename = "pbmc.loom", mode = "r+")
lfile
## Class: loom
## Filename: /home/paul/Documents/Satija/pbmc.loom
## Access type: H5F_ACC_RDWR
## Attributes: version, chunks
## Listing:
## name obj_type dataset.dims dataset.type_class
## col_attrs H5I_GROUP <NA> <NA>
## col_graphs H5I_GROUP <NA> <NA>
## layers H5I_GROUP <NA> <NA>
## matrix H5I_DATASET 2700 x 13714 H5T_FLOAT
## row_attrs H5I_GROUP <NA> <NA>
## row_graphs H5I_GROUP <NA> <NA>
一個(gè)loom包含6各部分,一個(gè)數(shù)據(jù)集(matrix)步鉴,以及5個(gè)組layers
, row_attrs
, col_attrs
, row_graphs
, and col_graphs
璃哟。
- matrix: n個(gè)基因m個(gè)細(xì)胞
- layers:matrix處理后的數(shù)據(jù)随闪,例如標(biāo)準(zhǔn)化后的數(shù)據(jù)。
- row_attrs:基因得到metadata
- col_attrs: 細(xì)胞metadata
- row_graphs col_graphs
# 對(duì)loom數(shù)據(jù)進(jìn)行操作
- 使用[[或$符號(hào)
# Viewing the `matrix` dataset with the double subset [[ operator You can
# also use the $ sigil, i.e. lfile$matrix
lfile[["matrix"]]
## Class: H5D
## Dataset: /matrix
## Filename: /home/paul/Documents/Satija/pbmc.loom
## Access type: H5F_ACC_RDWR
## Datatype: H5T_IEEE_F64LE
## Space: Type=Simple Dims=2700 x 13714 Maxdims=Inf x Inf
## Chunk: 32 x 32
- 查看組內(nèi)的數(shù)據(jù)集
# Viewing a dataset in the 'col_attrs' group with the double subset [[
# operator and full UNIX-style path
lfile[["col_attrs/cell_names"]]
## Class: H5D
## Dataset: /col_attrs/cell_names
## Filename: /home/paul/Documents/Satija/pbmc.loom
## Access type: H5F_ACC_RDWR
## Datatype: H5T_STRING {
## STRSIZE H5T_VARIABLE;
## STRPAD H5T_STR_NULLTERM;
## CSET H5T_CSET_ASCII;
## CTYPE H5T_C_S1;
## }
## Space: Type=Simple Dims=2700 Maxdims=Inf
## Chunk: 1024
# Viewing a dataset in the 'row_attrs' group with S3 $ chaining
lfile$row.attrs$gene_names
## Class: H5D
## Dataset: /row_attrs/gene_names
## Filename: /home/paul/Documents/Satija/pbmc.loom
## Access type: H5F_ACC_RDWR
## Datatype: H5T_STRING {
## STRSIZE H5T_VARIABLE;
## STRPAD H5T_STR_NULLTERM;
## CSET H5T_CSET_ASCII;
## CTYPE H5T_C_S1;
## }
## Space: Type=Simple Dims=13714 Maxdims=Inf
## Chunk: 1024
上面返回都是數(shù)據(jù)的描述,如果想獲取真實(shí)的數(shù)據(jù)当宴,需要使用[户矢;[,]表示返回所有數(shù)據(jù),或著使用索引獲取對(duì)應(yīng)的數(shù)據(jù)捌年,這樣就可以避免將所有數(shù)據(jù)導(dǎo)入內(nèi)存礼预。
# Access the upper left corner of the data matrix
lfile[["matrix"]][1:5, 1:5]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 0 0 0 0
## [2,] 0 0 0 0 0
## [3,] 0 0 0 0 0
## [4,] 0 0 0 0 0
## [5,] 0 0 0 0 0
# Access the full data matrix (here, using the $ instead of the [[ operator
# to access the matrix)
full.matrix <- lfile$matrix[, ]
dim(x = full.matrix)
## [1] 2700 13714
# Access all gene names
gene.names <- lfile[["row_attrs/gene_names"]][]
head(x = gene.names)
## [1] "AL627309.1" "AP006222.2" "RP11-206L10.2" "RP11-206L10.9"
## [5] "LINC00115" "NOC2L"
- loom對(duì)象有一個(gè)get.attribute.df()方法逆瑞,可以獲取各種metadata 信息整合成數(shù)據(jù)框(data frame)
- get.attribute.df()首先需要一個(gè)方向获高,1(行吻育,基因), 2(列摊趾,細(xì)胞);然后是一個(gè)metadata 數(shù)據(jù)名字組成的列表漩绵。
# Pull three bits of metadata from the column attributes
attrs <- c("nUMI", "nGene", "orig.ident")
attr.df <- lfile$get.attribute.df(MARGIN = 2, attribute.names = attrs)
head(x = attr.df)
## nUMI nGene orig.ident
## AAACATACAACCAC 2419 779 SeuratProject
## AAACATTGAGCTAC 4903 1352 SeuratProject
## AAACATTGATCAGC 3147 1129 SeuratProject
## AAACCGTGCTTCCG 2639 960 SeuratProject
## AAACCGTGTATGCG 980 521 SeuratProject
## AAACGCACTGGTAC 2163 781 SeuratProject
#loomR的Matrices
問了提高效率止吐,HDF5庫對(duì)底層數(shù)據(jù)矩陣的訪問進(jìn)行了轉(zhuǎn)置碍扔★踔兀基因儲(chǔ)存在列,細(xì)胞儲(chǔ)存在行溶耘;但是LoomR中凳兵,row.attrs還是表示基因,col.attrs表示細(xì)胞吟孙;這點(diǎn)容易讓人混淆杰妓。
# Print the number of genes
lfile[["row_attrs/gene_names"]]$dims
## [1] 13714
# Is the number of genes the same as the second dimension (typically
# columns) for the matrix?
lfile[["row_attrs/gene_names"]]$dims == lfile[["matrix"]]$dims[2]
## [1] TRUE
# For the sake of consistency within the single-cell community, we've
# reversed the dimensions for the `shape` field. As such, the number of
# genes is stored in `lfile$shape[1]`; the number of cells is stored in the
# second field
lfile[["row_attrs/gene_names"]]$dims == lfile$shape[1]
## [1] TRUE
- 獲取部分基因或細(xì)胞數(shù)據(jù):
# Pull gene expression data for all genes, for the first 5 cells Note that
# we're using the row position for cells
data.subset <- lfile[["matrix"]][1:5, ]
dim(x = data.subset)
## [1] 5 13714
# You can transpose this matrix if you wish to restore the standard
# orientation
data.subset <- t(x = data.subset)
dim(x = data.subset)
## [1] 13714 5
# Pull gene expression data for the gene MS4A1 Note that we're using the
# column position for genes
data.gene <- lfile[["matrix"]][, lfile$row.attrs$gene_names[] == "MS4A1"]
head(x = data.gene)
## [1] 0 6 0 0 0 0
#添加數(shù)據(jù)到loom
- add.layer()
- add.row.attribute()
- add.col.attribute()
這些方法只需要一個(gè)命名的矩陣或者向量列表巷挥。
# Generate random ENSEMBL IDs for demonstration purposes
ensembl.ids <- paste0("ENSG0000", 1:length(x = lfile$row.attrs$gene_names[]))
# Use add.row.attribute to add the IDs Note that if you want to overwrite an
# existing value, set overwrite = TRUE
lfile$add.row.attribute(list(ensembl.id = ensembl.ids), overwrite = TRUE)
lfile[["row_attrs"]]
## Class: H5Group
## Filename: /home/paul/Documents/Satija/pbmc.loom
## Group: /row_attrs
## Listing:
## name obj_type dataset.dims dataset.type_class
## ensembl.id H5I_DATASET 13714 H5T_STRING
## gene_names H5I_DATASET 13714 H5T_STRING
# Find the ENSEMBL ID for TCL1A
lfile[["row_attrs/ensembl.id"]][lfile$row.attrs$gene_names[] == "TCL1A"]
## [1] "ENSG00009584"
#Chunk-based iteration
處理大文件,可以每次讀取部分?jǐn)?shù)據(jù)胜嗓,進(jìn)行處理(迭代的方法)辞州。loomR 內(nèi)置了map或apply方法,可以每次對(duì)讀入的數(shù)據(jù)進(jìn)行處理埃元。
- map方法:不需要一次讀取文件所有內(nèi)容到內(nèi)存中阔拳,但是返回的結(jié)果在內(nèi)存中类嗤。
- 計(jì)算每個(gè)細(xì)胞的UMI數(shù)
# Map rowSums to `matrix`, using 500 cells at a time, returning a vector
nUMI_map <- lfile$map(FUN = rowSums, MARGIN = 2, chunk.size = 500, dataset.use = "matrix",
display.progress = FALSE)
# How long is nUMI_map? It should be equal to the number of cells in
# `matrix`
length(x = nUMI_map) == lfile$matrix$dims[1]
## [1] TRUE
- MARGIN參數(shù): 1表示行土浸,或者基因黄伊;2表示列还最,或者細(xì)胞
apply毡惜,與map方法的差異在于,數(shù)據(jù)處理結(jié)果儲(chǔ)存到loom 文件中(在磁盤扶叉,不在內(nèi)存)枣氧;因此消耗的內(nèi)存只是運(yùn)算部分达吞。指定結(jié)果儲(chǔ)存的文件名就可以了。
# Apply rowSums to `matrix`, using 500 cells at a time, storing in
# `col_attrs/umi_apply`
lfile$apply(name = "col_attrs/umi_apply", FUN = rowSums, MARGIN = 2, chunk.size = 500,
dataset.use = "matrix", display.progress = FALSE, overwrite = TRUE)
lfile$col.attrs$umi_apply
## Class: H5D
## Dataset: /col_attrs/umi_apply
## Filename: /home/paul/Documents/Satija/pbmc.loom
## Access type: H5F_ACC_RDWR
## Datatype: H5T_IEEE_F64LE
## Space: Type=Simple Dims=2700 Maxdims=Inf
## Chunk: 1024
# Ensure that all values are the same as doing a non-chunked calculation
all(lfile$col.attrs$umi_apply[] == rowSums(x = lfile$matrix[, ]))
## [1] TRUE
- Selective-chunking
上面以到的數(shù)據(jù)處理要么是部分細(xì)胞(所有基因)酪劫,部分基因(所有細(xì)胞)覆糟。但是也可以在基因和細(xì)胞同時(shí)進(jìn)行選擇搪桂,使用index.use參數(shù)就可以了。
#Closing loom
objects
loom對(duì)象是文件的連接酗电,寫入到文件完成之后撵术,必需關(guān)閉文件嫩与。
lfile$close_all()
#loomR and Seurat
loom只是支持gitHub ‘loom’ branch的Seurat
devtools::install_github(repo = "satijalab/seurat", ref = "loom")
library(Seurat)
#Creating a loom
object from a Seurat object (converting between Seurat and loom)
- Seurat對(duì)象轉(zhuǎn)換為 loom 文件
# Load the pbmc_small dataset included in Seurat
data("pbmc_small")
pbmc_small
## An object of class seurat in project SeuratProject
## 230 genes across 80 samples.
# Convert from Seurat to loom Convert takes and object in 'from', a name of
# a class in 'to', and, for conversions to loom, a filename
pfile <- Convert(from = pbmc_small, to = "loom", filename = "pbmc_small.loom",
display.progress = FALSE)
pfile
## Class: loom
## Filename: /home/paul/Documents/Satija/pbmc_small.loom
## Access type: H5F_ACC_RDWR
## Attributes: version, chunks
## Listing:
## name obj_type dataset.dims dataset.type_class
## col_attrs H5I_GROUP <NA> <NA>
## col_graphs H5I_GROUP <NA> <NA>
## layers H5I_GROUP <NA> <NA>
## matrix H5I_DATASET 80 x 230 H5T_FLOAT
## row_attrs H5I_GROUP <NA> <NA>
## row_graphs H5I_GROUP <NA> <NA>
#Seurat standard workflow
# Normalize data, and find variable genes using the typical Seurat workflow
pbmc_small <- NormalizeData(object = pbmc_small, display.progress = FALSE)
pbmc_small <- FindVariableGenes(object = pbmc_small, display.progress = FALSE,
do.plot = FALSE)
head(x = pbmc_small@hvg.info)
## gene.mean gene.dispersion gene.dispersion.scaled
## PCMT1 3.942220 7.751848 2.808417
## PPBP 5.555949 7.652876 1.216898
## LYAR 4.231004 7.577377 1.528749
## VDAC3 4.128322 7.383980 1.296982
## KHDRBS1 3.562833 7.367928 2.476809
## IGLL5 3.758330 7.319567 2.018088
# Run the same workflow using the loom object
NormalizeData(object = pfile, overwrite = TRUE, display.progress = FALSE)
FindVariableGenes(object = pfile, overwrite = TRUE, display.progress = FALSE)
# Normalized data goes into the 'norm_data' layer, variable gene information
# goes into 'row_attrs' Are the results equal?
par(mfrow = c(1, 2))
plot(x = t(x = pfile$layers$norm_data[, ]), y = pbmc_small@data, main = "Normalized Data",
xlab = "loom", ylab = "Seurat")
plot(x = pfile$row.attrs$gene_means[], y = pbmc_small@hvg.info[pfile$row.attrs$gene_names[],
"gene.mean"], main = "Gene Means", xlab = "loom", ylab = "Seurat")
- close loom 對(duì)象
pfile$close_all()
#原文
Introduction to loomR
mojaveazure/loomR
Guided Clustering of the Mouse Cell Atlas: loom edition