loomR介紹及使用指南

隨著單細(xì)胞技術(shù)的發(fā)展，數(shù)據(jù)量增加使得計(jì)算需求呈指數(shù)增長(zhǎng)某宪。分析單細(xì)胞數(shù)據(jù)時(shí)兴喂，使用稀100000個(gè)細(xì)胞的系數(shù)矩陣處理對(duì)于Seurat 來說就很有挑戰(zhàn)性衣迷。HDF5 格式現(xiàn)在被用于儲(chǔ)存

生物大數(shù)據(jù)，單細(xì)胞可以儲(chǔ)存上百萬個(gè)細(xì)胞的數(shù)據(jù)壶谒。

Linnarson實(shí)驗(yàn)室開發(fā)了基于HDF5的數(shù)據(jù)結(jié)構(gòu)-loom loompy佃迄，用于儲(chǔ)存單細(xì)胞數(shù)據(jù)以及數(shù)據(jù)相關(guān)的屬性信息呵俏。并且，他們還發(fā)布了一個(gè)工具loompy吼肥。

satijalab實(shí)驗(yàn)室開發(fā)了一個(gè)基于R的loom工具-loomR

#安裝

loomR 需要預(yù)安裝hdf5r

System	Command
OS X (using Homebrew)	`brew install hdf5`
Debian-based systems (including Ubuntu)	`sudo apt-get install libhdf5-dev`
Systems supporting yum and RPMs	`sudo yum install hdf5-devel`

install.packages("hdf5r")
或
devtools::install_github("hhoeflin/hdf5r")

安裝loomR

# Install devtools from CRAN
install.packages("devtools")
devtools::install_github(repo = "mojaveazure/loomR", ref = "develop")

# Load loomR
library(loomR)

#loom對(duì)象介紹

loomR 內(nèi)置基于R6的loom對(duì)象缀皱。R6 對(duì)象與S4 類似动猬，R6 使用 $代替@ (i.e. `my.object$ field.name)赁咙， R6方法也可以直接調(diào)用（i.e.my.object$method()`）彼水。

# 連接loom對(duì)象

標(biāo)準(zhǔn)的R對(duì)象時(shí)將數(shù)據(jù)加載到內(nèi)存中，loom對(duì)象只是建立一個(gè)與磁盤文件的連接链瓦。使用loomR::create可以創(chuàng)建一個(gè)loom文件慈俯，或者使用Convert將Seurat 對(duì)象轉(zhuǎn)換成loom文件。

# Connect to the loom file in read/write mode
lfile <- connect(filename = "pbmc.loom", mode = "r+")
lfile

## Class: loom
## Filename: /home/paul/Documents/Satija/pbmc.loom
## Access type: H5F_ACC_RDWR
## Attributes: version, chunks
## Listing:
##        name    obj_type dataset.dims dataset.type_class
##   col_attrs   H5I_GROUP         <NA>               <NA>
##  col_graphs   H5I_GROUP         <NA>               <NA>
##      layers   H5I_GROUP         <NA>               <NA>
##      matrix H5I_DATASET 2700 x 13714          H5T_FLOAT
##   row_attrs   H5I_GROUP         <NA>               <NA>
##  row_graphs   H5I_GROUP         <NA>               <NA>

一個(gè)loom包含6各部分，一個(gè)數(shù)據(jù)集（matrix）步鉴，以及5個(gè)組layers, row_attrs, col_attrs, row_graphs, and col_graphs璃哟。

matrix: n個(gè)基因m個(gè)細(xì)胞
layers：matrix處理后的數(shù)據(jù)随闪，例如標(biāo)準(zhǔn)化后的數(shù)據(jù)。
row_attrs：基因得到metadata
col_attrs: 細(xì)胞metadata
row_graphs col_graphs

# 對(duì)loom數(shù)據(jù)進(jìn)行操作

使用[[或$符號(hào)

# Viewing the `matrix` dataset with the double subset [[ operator You can
# also use the $ sigil, i.e. lfile$matrix
lfile[["matrix"]]
## Class: H5D
## Dataset: /matrix
## Filename: /home/paul/Documents/Satija/pbmc.loom
## Access type: H5F_ACC_RDWR
## Datatype: H5T_IEEE_F64LE
## Space: Type=Simple     Dims=2700 x 13714     Maxdims=Inf x Inf
## Chunk: 32 x 32

查看組內(nèi)的數(shù)據(jù)集

# Viewing a dataset in the 'col_attrs' group with the double subset [[
# operator and full UNIX-style path
lfile[["col_attrs/cell_names"]]

## Class: H5D
## Dataset: /col_attrs/cell_names
## Filename: /home/paul/Documents/Satija/pbmc.loom
## Access type: H5F_ACC_RDWR
## Datatype: H5T_STRING {
##       STRSIZE H5T_VARIABLE;
##       STRPAD H5T_STR_NULLTERM;
##       CSET H5T_CSET_ASCII;
##       CTYPE H5T_C_S1;
##    }
## Space: Type=Simple     Dims=2700     Maxdims=Inf
## Chunk: 1024

# Viewing a dataset in the 'row_attrs' group with S3 $ chaining
lfile$row.attrs$gene_names
## Class: H5D
## Dataset: /row_attrs/gene_names
## Filename: /home/paul/Documents/Satija/pbmc.loom
## Access type: H5F_ACC_RDWR
## Datatype: H5T_STRING {
##       STRSIZE H5T_VARIABLE;
##       STRPAD H5T_STR_NULLTERM;
##       CSET H5T_CSET_ASCII;
##       CTYPE H5T_C_S1;
##    }
## Space: Type=Simple     Dims=13714     Maxdims=Inf
## Chunk: 1024

上面返回都是數(shù)據(jù)的描述，如果想獲取真實(shí)的數(shù)據(jù)当宴，需要使用[户矢；[,]表示返回所有數(shù)據(jù)，或著使用索引獲取對(duì)應(yīng)的數(shù)據(jù)捌年，這樣就可以避免將所有數(shù)據(jù)導(dǎo)入內(nèi)存礼预。

# Access the upper left corner of the data matrix
lfile[["matrix"]][1:5, 1:5]

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    0    0    0    0
## [2,]    0    0    0    0    0
## [3,]    0    0    0    0    0
## [4,]    0    0    0    0    0
## [5,]    0    0    0    0    0

# Access the full data matrix (here, using the $ instead of the [[ operator
# to access the matrix)
full.matrix <- lfile$matrix[, ]
dim(x = full.matrix)

## [1]  2700 13714

# Access all gene names
gene.names <- lfile[["row_attrs/gene_names"]][]
head(x = gene.names)

## [1] "AL627309.1"    "AP006222.2"    "RP11-206L10.2" "RP11-206L10.9"
## [5] "LINC00115"     "NOC2L"

loom對(duì)象有一個(gè)get.attribute.df()方法逆瑞，可以獲取各種metadata 信息整合成數(shù)據(jù)框（data frame）
get.attribute.df()首先需要一個(gè)方向获高，1（行吻育，基因）， 2（列摊趾，細(xì)胞）；然后是一個(gè)metadata 數(shù)據(jù)名字組成的列表漩绵。

# Pull three bits of metadata from the column attributes
attrs <- c("nUMI", "nGene", "orig.ident")
attr.df <- lfile$get.attribute.df(MARGIN = 2, attribute.names = attrs)
head(x = attr.df)

##                nUMI nGene    orig.ident
## AAACATACAACCAC 2419   779 SeuratProject
## AAACATTGAGCTAC 4903  1352 SeuratProject
## AAACATTGATCAGC 3147  1129 SeuratProject
## AAACCGTGCTTCCG 2639   960 SeuratProject
## AAACCGTGTATGCG  980   521 SeuratProject
## AAACGCACTGGTAC 2163   781 SeuratProject

#loomR的Matrices

問了提高效率止吐，HDF5庫對(duì)底層數(shù)據(jù)矩陣的訪問進(jìn)行了轉(zhuǎn)置碍扔★踔兀基因儲(chǔ)存在列，細(xì)胞儲(chǔ)存在行溶耘；但是LoomR中凳兵，row.attrs還是表示基因，col.attrs表示細(xì)胞吟孙；這點(diǎn)容易讓人混淆杰妓。

# Print the number of genes
lfile[["row_attrs/gene_names"]]$dims

## [1] 13714

# Is the number of genes the same as the second dimension (typically
# columns) for the matrix?
lfile[["row_attrs/gene_names"]]$dims == lfile[["matrix"]]$dims[2]

## [1] TRUE

# For the sake of consistency within the single-cell community, we've
# reversed the dimensions for the `shape` field.  As such, the number of
# genes is stored in `lfile$shape[1]`; the number of cells is stored in the
# second field
lfile[["row_attrs/gene_names"]]$dims == lfile$shape[1]

## [1] TRUE

獲取部分基因或細(xì)胞數(shù)據(jù)：

# Pull gene expression data for all genes, for the first 5 cells Note that
# we're using the row position for cells
data.subset <- lfile[["matrix"]][1:5, ]
dim(x = data.subset)

## [1]     5 13714

# You can transpose this matrix if you wish to restore the standard
# orientation
data.subset <- t(x = data.subset)
dim(x = data.subset)

## [1] 13714     5

# Pull gene expression data for the gene MS4A1 Note that we're using the
# column position for genes
data.gene <- lfile[["matrix"]][, lfile$row.attrs$gene_names[] == "MS4A1"]
head(x = data.gene)

## [1] 0 6 0 0 0 0

#添加數(shù)據(jù)到loom

add.layer()
add.row.attribute()
add.col.attribute()

這些方法只需要一個(gè)命名的矩陣或者向量列表巷挥。

# Generate random ENSEMBL IDs for demonstration purposes
ensembl.ids <- paste0("ENSG0000", 1:length(x = lfile$row.attrs$gene_names[]))
# Use add.row.attribute to add the IDs Note that if you want to overwrite an
# existing value, set overwrite = TRUE
lfile$add.row.attribute(list(ensembl.id = ensembl.ids), overwrite = TRUE)
lfile[["row_attrs"]]
## Class: H5Group
## Filename: /home/paul/Documents/Satija/pbmc.loom
## Group: /row_attrs
## Listing:
##        name    obj_type dataset.dims dataset.type_class
##  ensembl.id H5I_DATASET        13714         H5T_STRING
##  gene_names H5I_DATASET        13714         H5T_STRING

# Find the ENSEMBL ID for TCL1A
lfile[["row_attrs/ensembl.id"]][lfile$row.attrs$gene_names[] == "TCL1A"]

## [1] "ENSG00009584"

#Chunk-based iteration

處理大文件，可以每次讀取部分?jǐn)?shù)據(jù)胜嗓，進(jìn)行處理（迭代的方法）辞州。loomR 內(nèi)置了map或apply方法，可以每次對(duì)讀入的數(shù)據(jù)進(jìn)行處理埃元。

map方法：不需要一次讀取文件所有內(nèi)容到內(nèi)存中阔拳，但是返回的結(jié)果在內(nèi)存中类嗤。
計(jì)算每個(gè)細(xì)胞的UMI數(shù)

# Map rowSums to `matrix`, using 500 cells at a time, returning a vector
nUMI_map <- lfile$map(FUN = rowSums, MARGIN = 2, chunk.size = 500, dataset.use = "matrix", 
    display.progress = FALSE)
# How long is nUMI_map? It should be equal to the number of cells in
# `matrix`
length(x = nUMI_map) == lfile$matrix$dims[1]
## [1] TRUE

- MARGIN參數(shù)： 1表示行土浸，或者基因黄伊；2表示列还最，或者細(xì)胞
apply毡惜，與map方法的差異在于，數(shù)據(jù)處理結(jié)果儲(chǔ)存到loom 文件中（在磁盤扶叉，不在內(nèi)存）枣氧；因此消耗的內(nèi)存只是運(yùn)算部分达吞。指定結(jié)果儲(chǔ)存的文件名就可以了。

# Apply rowSums to `matrix`, using 500 cells at a time, storing in
# `col_attrs/umi_apply`
lfile$apply(name = "col_attrs/umi_apply", FUN = rowSums, MARGIN = 2, chunk.size = 500, 
    dataset.use = "matrix", display.progress = FALSE, overwrite = TRUE)
lfile$col.attrs$umi_apply

## Class: H5D
## Dataset: /col_attrs/umi_apply
## Filename: /home/paul/Documents/Satija/pbmc.loom
## Access type: H5F_ACC_RDWR
## Datatype: H5T_IEEE_F64LE
## Space: Type=Simple     Dims=2700     Maxdims=Inf
## Chunk: 1024

# Ensure that all values are the same as doing a non-chunked calculation
all(lfile$col.attrs$umi_apply[] == rowSums(x = lfile$matrix[, ]))

## [1] TRUE

Selective-chunking

上面以到的數(shù)據(jù)處理要么是部分細(xì)胞（所有基因）酪劫，部分基因（所有細(xì)胞）覆糟。但是也可以在基因和細(xì)胞同時(shí)進(jìn)行選擇搪桂，使用index.use參數(shù)就可以了。

#Closing `loom` objects

loom對(duì)象是文件的連接酗电，寫入到文件完成之后撵术，必需關(guān)閉文件嫩与。

lfile$close_all()

#loomR and Seurat

loom只是支持gitHub ‘loom’ branch的Seurat

devtools::install_github(repo = "satijalab/seurat", ref = "loom")
library(Seurat)

#Creating a `loom` object from a Seurat object (converting between Seurat and loom)

Seurat對(duì)象轉(zhuǎn)換為 loom 文件

# Load the pbmc_small dataset included in Seurat
data("pbmc_small")
pbmc_small

## An object of class seurat in project SeuratProject 
##  230 genes across 80 samples.

# Convert from Seurat to loom Convert takes and object in 'from', a name of
# a class in 'to', and, for conversions to loom, a filename
pfile <- Convert(from = pbmc_small, to = "loom", filename = "pbmc_small.loom", 
    display.progress = FALSE)
pfile

## Class: loom
## Filename: /home/paul/Documents/Satija/pbmc_small.loom
## Access type: H5F_ACC_RDWR
## Attributes: version, chunks
## Listing:
##        name    obj_type dataset.dims dataset.type_class
##   col_attrs   H5I_GROUP         <NA>               <NA>
##  col_graphs   H5I_GROUP         <NA>               <NA>
##      layers   H5I_GROUP         <NA>               <NA>
##      matrix H5I_DATASET     80 x 230          H5T_FLOAT
##   row_attrs   H5I_GROUP         <NA>               <NA>
##  row_graphs   H5I_GROUP         <NA>               <NA>

#Seurat standard workflow

# Normalize data, and find variable genes using the typical Seurat workflow
pbmc_small <- NormalizeData(object = pbmc_small, display.progress = FALSE)
pbmc_small <- FindVariableGenes(object = pbmc_small, display.progress = FALSE, 
    do.plot = FALSE)
head(x = pbmc_small@hvg.info)

##         gene.mean gene.dispersion gene.dispersion.scaled
## PCMT1    3.942220        7.751848               2.808417
## PPBP     5.555949        7.652876               1.216898
## LYAR     4.231004        7.577377               1.528749
## VDAC3    4.128322        7.383980               1.296982
## KHDRBS1  3.562833        7.367928               2.476809
## IGLL5    3.758330        7.319567               2.018088

# Run the same workflow using the loom object
NormalizeData(object = pfile, overwrite = TRUE, display.progress = FALSE)
FindVariableGenes(object = pfile, overwrite = TRUE, display.progress = FALSE)
# Normalized data goes into the 'norm_data' layer, variable gene information
# goes into 'row_attrs' Are the results equal?
par(mfrow = c(1, 2))
plot(x = t(x = pfile$layers$norm_data[, ]), y = pbmc_small@data, main = "Normalized Data", 
    xlab = "loom", ylab = "Seurat")
plot(x = pfile$row.attrs$gene_means[], y = pbmc_small@hvg.info[pfile$row.attrs$gene_names[], 
    "gene.mean"], main = "Gene Means", xlab = "loom", ylab = "Seurat")

download.png

close loom 對(duì)象

pfile$close_all()

#原文

Introduction to loomR
mojaveazure/loomR
Guided Clustering of the Mouse Cell Atlas: loom edition

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末处坪，一起剝皮案震驚了整個(gè)濱河市同窘，隨后出現(xiàn)的幾起案子想邦，更是在濱河造成了極大的恐慌委刘，老刑警劉巖锡移，帶你破解...
沈念sama閱讀 219,427評(píng)論 6贊 508
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件呕童，死亡現(xiàn)場(chǎng)離奇詭異，居然都是意外死亡罩抗，警方通過查閱死者的電腦和手機(jī)拉庵，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 93,551評(píng)論 3贊 395
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來套蒂，“玉大人钞支，你說我怎么就攤上這事〔俚叮” “怎么了烁挟？”我有些...
開封第一講書人閱讀 165,747評(píng)論 0贊 356
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長(zhǎng)撼嗓。經(jīng)常有香客問我粉捻，道長(zhǎng)肩刃，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 58,939評(píng)論 1贊 295
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮叛氨，結(jié)果婚禮上徙邻，老公的妹妹穿的比我還像新娘淳地。我一直安慰自己颇象，他們只是感情好，可當(dāng)我...
茶點(diǎn)故事閱讀 67,955評(píng)論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布姐直。她就那樣靜靜地躺著撞叽，像睡著了一般科展。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上砂竖，一...
開封第一講書人閱讀 51,737評(píng)論 1贊 305
城市分裂傳說
那天测摔，我揣著相機(jī)與錄音浙于，去河邊找鬼。笑死檀轨，一個(gè)胖子當(dāng)著我的面吹牛参萄，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播筒溃，決...
沈念sama閱讀 40,448評(píng)論 3贊 420
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼烦周，長(zhǎng)吁一口氣：“原來是場(chǎng)噩夢(mèng)啊……” “哼读慎！你這毒婦竟也來了？” 一聲冷哼從身側(cè)響起崇摄，我...
開封第一講書人閱讀 39,352評(píng)論 0贊 276
萬榮殺人案實(shí)錄
序言：老撾萬榮一對(duì)情侶失蹤厕氨，失蹤者是張志新（化名）和其女友劉穎命斧，沒想到半個(gè)月后，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體泞莉，經(jīng)...
沈念sama閱讀 45,834評(píng)論 1贊 317
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡斯嚎，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,992評(píng)論 3贊 338
?白月光啟示錄
正文我和宋清朗相戀三年钉疫，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了固阁。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點(diǎn)故事閱讀 40,133評(píng)論 1贊 351
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡客税，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出赐纱，到底是詐尸還是另有隱情，我是刑警寧澤起胰，帶...
沈念sama閱讀 35,815評(píng)論 5贊 346
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布，位于F島的核電站戒劫，受9級(jí)特大地震影響淘邻，放射性物質(zhì)發(fā)生泄漏枚尼。R本人自食惡果不足惜蜻直，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,477評(píng)論 3贊 331
男人毒藥：我在死后第九天來索命
文/蒙蒙一赎瑰、第九天我趴在偏房一處隱蔽的房頂上張望压储。院中可真熱鬧，春花似錦、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 32,022評(píng)論 0贊 22
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽坊饶。三九已至染厅，卻和暖如春尔苦，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 33,147評(píng)論 1贊 272
情欲美人皮
我被黑心中介騙來泰國(guó)打工拗胜，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留，地道東北人迷殿。一個(gè)月前我還...
沈念sama閱讀 48,398評(píng)論 3贊 373
代替公主和親
正文我出身青樓，卻偏偏與公主長(zhǎng)得像，于是被迫代替她去往敵國(guó)和親踊挠。傳聞我的和親對(duì)象是個(gè)殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 45,077評(píng)論 2贊 355