生信學(xué)習(xí)基礎(chǔ)_R語言06_Matching reordering匹配與排序

原文地址:https://hbctraining.github.io/Intro-to-R/lessons/06_matching_reordering.html

大神的中文整理版:http://www.reibang.com/p/91f3aaa73b3f

本文是我拷貝的原文戚宦,加了自己的筆記和練習(xí)題答案巍杈。

Learning Objectives

  • Implement matching and re-ordering data within data structures.

Matching data

Often when working with genomic data, we have a data file that corresponds with our metadata file. The data file contains measurements from the biological assay for each individual sample. In this case, the biological assay is gene expression and data was generated using RNA-Seq.

Let’s read in our expression data (RPKM matrix) that we downloaded previously:

rpkm_data <- read.csv("data/counts.rpkm.csv")

Take a look at the first few lines of the data matrix to see what’s in there.

head(rpkm_data)

It looks as if the sample names (header) in our data matrix are similar to the row names of our metadata file, but it’s hard to tell since they are not in the same order. We can do a quick check of the number of columns in the count data and the rows in the metadata and at least see if the numbers match up.

ncol(rpkm_data)
nrow(metadata)

What we want to know is, do we have data for every sample that we have metadata?

The %in% operator

Although lacking in documentation this operator is well-used and convenient once you get the hang of it. The operator is used with the following syntax:

vector1_of_values %in% vector2_of_values

It will take a vector as input to the left and will evaluate each element to see if there is a match in the vector that follows on the right of the operator. The two vectors do not have to be the same size. This operation will return a vector of the same length as vector1 containing logical values to indicate whether or not there was a match. Take a look at the example below:

A <- c(1,3,5,7,9,11)   # odd numbers
B <- c(2,4,6,8,10,12)  # even numbers

# test to see if each of the elements of A is in B  
A %in% B

## [1] FALSE FALSE FALSE FALSE FALSE FALSE

Since vector A contains only odd numbers and vector B contains only even numbers, there is no overlap and so the vector returned contains a FALSE for each element. Let’s change a couple of numbers inside vector B to match vector A:

A <- c(1,3,5,7,9,11)   # odd numbers
B <- c(2,4,6,8,1,5)  # add some odd numbers in 

# test to see if each of the elements of A is in B
A %in% B

## [1]  TRUE FALSE  TRUE FALSE FALSE FALSE

The logical vector returned denotes which elements in A are also in B and which are not.

We saw previously that we could use the output from a logical expression to subset data by returning only the values corresponding to TRUE. Therefore, we can use the output logical vector to subset our data, and return only those elements in A, which are also in B by returning only the TRUE values:

matching1
intersection <- A %in% B
intersection

matching2
A[intersection]

matching3

In these previous examples, the vectors were small and so it’s easy to count by eye; but when we work with large datasets this is not practical. A quick way to assess whether or not we had any matches would be to use the any function to see if any of the values contained in vector A are also in vector B:

any(A %in% B)

The all function is also useful. Given a logical vector, it will tell you whether all values returned are TRUE. If there is at least one FALSE value, the all function will return a FALSE and you know that all of A are not contained in B.

all(A %in% B)


Exercise 1

  1. Using the A and B vectors created above, evaluate each element in B to see if there is a match in A
  1. Subset the B vector to only return those values that are also in A.
B %in% A
B[B %in% A]

Suppose we had two vectors that had the same values but just not in the same order. We could also use all to test for that. Rather than using the %in% operator we would use == and compare each element to the same position in the other vector. Unlike the %in% operator, for this to work you must have two vectors that are of equal length.

A <- c(10,20,30,40,50)
B <- c(50,40,30,20,10)  # same numbers but backwards 

# test to see if each element of A is in B
A %in% B

# test to see if each element of A is in the same position in B
A == B

# use all() to check if they are a perfect match
all(A == B)

Let’s try this on our data and see whether we have metadata information for all samples in our expression data. We’ll start by creating two vectors; one with the rownames of the metadata and colnames of the RPKM data. These are base functions in R which allow you to extract the row and column names as a vector:

x <- rownames(metadata)
y <- colnames(rpkm_data)

Now check to see that all of x are in y:

all(x %in% y)

Note that we can use nested functions in place of x and y:

all(rownames(metadata) %in% colnames(rpkm_data))

We know that all samples are present, but are they in the same order:

all(rownames(metadata) == colnames(rpkm_data))

Looks like all of the samples are there, but will need to be reordered. To reorder our genomic samples, we need to first learn different ways to reorder data. Therefore, we will step away from our genomic data briefly to learn about reordering, then return to it at the end of this lesson.


Exercise 2

We have a list of IDs for marker genes of particular interest. We want to extract count information associated with each of these genes, without having to scroll through our matrix of count data. We can do this using the %in% operator to extract the information for those genes from rpkm_data.

  1. Create a vector for your important gene IDs, and use the %in%operator to determine whether these genes are contained in the row names of our rpkm_data dataset.

     important_genes <- c("ENSMUSG00000083700", "ENSMUSG00000080990", "ENSMUSG00000065619", "ENSMUSG00000047945", "ENSMUSG00000081010",     "ENSMUSG00000030970")
    
    

important_genes %in% rownames(rpkm_data)
```

  1. Extract the rows containing the important genes from your rpkm_data dataset using the %in%operator.
rpkm_data[rownames(rpkm_data)[rownames(rpkm_data) %in% important_genes],]
  1. Extra Credit: Using the important_genes vector, extract the rows containing the important genes from your rpkm_data dataset without using the %in% operator.
rpkm_data[important_genes,]

Reordering data using indices

Indexing [ ] can be used to extract values from a dataset as we saw earlier, but we can also use it to rearrange our data values.

teaching_team <- c("Mary", "Meeta", "Radhika")

reordering

Remember that we can return values in a vector by specifying it’s position or index:

teaching_team[c(2, 3)] # Extracting values from a vector
teaching_team

We can also extract the values and reorder them:

teaching_team[c(3, 2)] # Extracting values and reordering them

Similarly, we can extract all of the values and reorder them:

teaching_team[c(3, 1, 2)]

If we want to save our results, we need to assign to a variable:

reorder_teach <- teaching_team[c(3, 1, 2)] # Saving the results to a variable

The match function

Now that we know how to reorder using indices, we can use the match() function to match the values in two vectors. We’ll be using it to evaluate which samples are present in both our counts and metadata dataframes, and then to re-order the columns in the counts matrix to match the row names in the metadata matrix.

match() takes at least 2 arguments:

  1. a vector of values in the order you want
  2. a vector of values to be reordered

The function returns the position of the matches (indices) with respect to the second vector, which can be used to re-order it so that it matches the order in the first vector. Let’s create vectors firstand second to demonstrate how it works:

first <- c("A","B","C","D","E")
second <- c("B","D","E","A","C")  # same letters but different order

matching4

How would you reorder second vector to match first using indices?

If we had large datasets, it would be difficult to reorder them by searching for the indices of the matching elements. This is where the match function comes in really handy:

match(first,second)
[1] 4 1 5 2 3

The function should return a vector of size length(first). Each number that is returned represents the index of the second vector where the matching value was observed.

Now, we can just use the indices to reorder the elements of the second vector to be in the same positions as the matching elements in the first vector:

reorder_idx <- match(first,second) # Saving indices for how to reorder `second` to match `first`

second[reorder_idx]  # Reordering the second vector to match the order of the first vector
second_reordered <- second[reorder_idx]  # Reordering and saving the output to a variable

matching7

Now that we know how match() works, let’s change vector second so that only a subset are retained:

first <- c("A","B","C","D","E")
second <- c("D","B","A")  # remove values

matching5

And try to match() again:

match(first,second)

[1]  3  2 NA  1 NA

NOTE: For values that don’t match by default return an NA value. You can specify what values you would have it assigned using nomatch argument. Also, if there is more than one matching value found only the first is reported.

Reordering genomic data using match() function

Using the match function, we now would like to match the row names of our metadata to the column names of our expression data*, so these will be the arguments for match. Using these two arguments we will retrieve a vector of match indices. The resulting vector represents the re-ordering of the column names in our data matrix to be identical to the rows in metadata:

rownames(metadata)

colnames(rpkm_data)

genomic_idx <- match(rownames(metadata), colnames(rpkm_data))
genomic_idx

Now we can create a new data matrix in which columns are re-ordered based on the match indices:

rpkm_ordered  <- rpkm_data[,genomic_idx]

Check and see what happened by using head. You can also verify that column names of this new data matrix matches the metadata row names by using the all function:

head(rpkm_ordered)
all(rownames(metadata) == colnames(rpkm_ordered))

Now that our samples are ordered the same in our metadata and counts data, if these were raw counts we could proceed to perform differential expression analysis with this dataset.

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末输拇,一起剝皮案震驚了整個濱河市围肥,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌,老刑警劉巖,帶你破解...
    沈念sama閱讀 218,682評論 6 507
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異打掘,居然都是意外死亡,警方通過查閱死者的電腦和手機鹏秋,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 93,277評論 3 395
  • 文/潘曉璐 我一進店門胧卤,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人拼岳,你說我怎么就攤上這事枝誊。” “怎么了惜纸?”我有些...
    開封第一講書人閱讀 165,083評論 0 355
  • 文/不壞的土叔 我叫張陵叶撒,是天一觀的道長。 經(jīng)常有香客問我耐版,道長祠够,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 58,763評論 1 295
  • 正文 為了忘掉前任粪牲,我火速辦了婚禮古瓤,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘腺阳。我一直安慰自己落君,他們只是感情好,可當我...
    茶點故事閱讀 67,785評論 6 392
  • 文/花漫 我一把揭開白布亭引。 她就那樣靜靜地躺著绎速,像睡著了一般。 火紅的嫁衣襯著肌膚如雪焙蚓。 梳的紋絲不亂的頭發(fā)上纹冤,一...
    開封第一講書人閱讀 51,624評論 1 305
  • 那天,我揣著相機與錄音购公,去河邊找鬼萌京。 笑死,一個胖子當著我的面吹牛宏浩,可吹牛的內(nèi)容都是我干的知残。 我是一名探鬼主播,決...
    沈念sama閱讀 40,358評論 3 418
  • 文/蒼蘭香墨 我猛地睜開眼绘闷,長吁一口氣:“原來是場噩夢啊……” “哼橡庞!你這毒婦竟也來了?” 一聲冷哼從身側(cè)響起印蔗,我...
    開封第一講書人閱讀 39,261評論 0 276
  • 序言:老撾萬榮一對情侶失蹤扒最,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后华嘹,有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體吧趣,經(jīng)...
    沈念sama閱讀 45,722評論 1 315
  • 正文 獨居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 37,900評論 3 336
  • 正文 我和宋清朗相戀三年耙厚,在試婚紗的時候發(fā)現(xiàn)自己被綠了强挫。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點故事閱讀 40,030評論 1 350
  • 序言:一個原本活蹦亂跳的男人離奇死亡薛躬,死狀恐怖俯渤,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情型宝,我是刑警寧澤八匠,帶...
    沈念sama閱讀 35,737評論 5 346
  • 正文 年R本政府宣布,位于F島的核電站趴酣,受9級特大地震影響梨树,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜岖寞,卻給世界環(huán)境...
    茶點故事閱讀 41,360評論 3 330
  • 文/蒙蒙 一抡四、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧仗谆,春花似錦指巡、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,941評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至岁疼,卻和暖如春阔涉,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背捷绒。 一陣腳步聲響...
    開封第一講書人閱讀 33,057評論 1 270
  • 我被黑心中介騙來泰國打工瑰排, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人暖侨。 一個月前我還...
    沈念sama閱讀 48,237評論 3 371
  • 正文 我出身青樓椭住,卻偏偏與公主長得像,于是被迫代替她去往敵國和親字逗。 傳聞我的和親對象是個殘疾皇子京郑,可洞房花燭夜當晚...
    茶點故事閱讀 44,976評論 2 355

推薦閱讀更多精彩內(nèi)容