基因家族分析2 || 數(shù)據(jù)庫檢索與成員鑒定

基因家族流程：基因家族分析（一）

==================================================================

數(shù)據(jù)庫檢索與成員鑒定

1)數(shù)據(jù)庫檢索：常用數(shù)據(jù)庫（植物）

· TAIR: http://www.arabidopsis.org/
· phytozome ：https://phytozome.jgi.doe.gov/pz/portal.html
· Ensemblplants:http://plants.ensembl.org/index.html
· Pfam:http://pfam.xfam.org/ （蛋白結(jié)構(gòu)域注釋的分類系統(tǒng)）

2)已鑒定的家族成員獲然艚尽（多物種進化樹需要）

1.具體家族信息可以參考擬南芥已經(jīng)發(fā)表的基因家族：https://www.arabidopsis.org/browse/genefamily/index.jsp
2.按照文獻中的ID從上述phytozome 等數(shù)據(jù)庫中下載台囱，下載該物種蛋白序列文件，找到對應成員读整。

注意：文獻中的數(shù)量可能比實際少簿训，因為數(shù)據(jù)庫的更新。

3)非鑒定的家族成員獲让准洹（即基因家族分析）

文件準備：hmm强品，蛋白庫，擬南芥目標蛋白家族序列屈糊。
比對工具：一般使用blast和hmmer的榛。

[1] hmm文件準備：為了大家可以實操，選擇擬南芥一個常見家族：14-3-3也叫GRF逻锐，如下圖夫晌，下載蛋白序列為At_14_3_3.fa 雕薪。來源：https://www.arabidopsis.org/browse/genefamily/index.jsp

At 14-3-3.png

1.將擬南芥14-3-3的第一個At4g09000蛋白或者已有物種的家族蛋白序列上傳到phmmer search ，查看該14-3-3蛋白所含domain的hmm（如(PF00244）模型晓淀，并鏈接到pfam下載hmm模型所袁。

14-3-3 domain.png

model.png

下載hmm.png

2.文獻中14-3-3在pfam accession為PF00244，在pfam檢索凶掰，其余方法同上燥爷。

[2] 軟件hmmer和blast鑒定成員。

1. hmmer下載及安裝

windows版本3.0官網(wǎng)不再提供懦窘，以下分析都是linux環(huán)境

#以下為目前最新版本3.2.1 (13 June 2018)
$ wget http://eddylab.org/software/hmmer/hmmer-3.2.1.tar.gz
$ tar zxf hmmer-3.2.1.tar.gz -C ~/biosoft/
$ cd ~/biosoft/hmmer-3.2.1
$ ./configure 
$ make 
$ make check 
$ echo 'PATH=$PATH:~/biosoft/hmmer-3.2.1/src/' >> ~/.bashrc
$ source ~/.bashrc
$ hmmsearch -h
Usage: hmmsearch [options] <hmmfile> <seqdb>
Options directing output:
  -o <f>           : direct output to file <f>, not stdout
  -A <f>           : save multiple alignment of all hits to file <f>
  --tblout <f>     : save parseable table of per-sequence hits to file <f>
  --domtblout <f>  : save parseable table of per-domain hits to file <f>
  --pfamtblout <f> : save table of hits and domains to file, in Pfam format <f>
  --acc            : prefer accessions over names in output
  --noali          : don't output alignments, so output is smaller
  --notextw        : unlimit ASCII text output line width
  --textw <n>      : set max width of ASCII text output lines  [120]  (n>=120)

hmmer鑒定：用的南瓜蛋白數(shù)據(jù)庫

pengzw@super-server:~$ wget http://pfam.xfam.org/family/PF00244/hmm
pengzw@super-server:~$ mv hmm 14-3-3.hmm
pengzw@super-server:~$ wget ftp://cucurbitgenomics.org/pub/cucurbit/genome/Cucurbita_moschata/v1/Cmoschata_v1.protein.fa.gz
pengzw@super-server:~$  gunzip Cmoschata_v1.protein.fa.gz
pengzw@super-server:~$ hmmsearch -o ./14-3-3_hmm.txt 14-3-3.hmm Cmoschata_v1.protein.fa
pengzw@super-server:~$ ls
14-3-3.hmm  14-3-3_hmm.txt   Cmoschata_v1.protein.fa

注意：
1.某些數(shù)據(jù)庫蛋白存在可變剪切前翎，分析要求一個基因僅保留一個represent 序列，一般都是保留最長的序列畅涂。這里擬南芥數(shù)據(jù)庫有僅保留一個序列的文件港华，直接下載即可。
2.保留最長轉(zhuǎn)錄本方法：（1）CJ大神TBtools軟件的Fasta Longet Representative功能毅戈。（2）python腳本：參考生信媛。

結(jié)果如下愤惰，位于橫線以上有19個成員苇经，橫線下方基本排除可能，但是我一般為了保險都要確定一下宦言。

hmm result.png

2.blast鑒定:用擬南芥基因家族蛋白BLAST所用物種基因組扇单，取相似性大于50%（可根據(jù)需要求修改或者文獻）的序列。At_14_3_3.fa從擬南芥數(shù)據(jù)庫下載整理奠旺。

pengzw@super-server:~$ makeblastdb -in Cmoschata_v1.protein.fa -dbtype prot -parse_seqids -out Cmoschata_v1.protein.db
#blast比對：
pengzw@super-server:~$ blastp -query At_14_3_3.fa -db Cmoschata_v1.protein.db -out 14-3-3.blast -evalue 1e-10 -num_threads 4 -outfmt 6 -num_alignments 5
pengzw@super-server:~$ cat 14-3-3.blast|awk '$3>=50 {print $0}' >>50.txt

結(jié)果：去除重復值蜘澜，有14個結(jié)果。

blast 50%.png

3.domain確定：pfam响疚，phmmer search 鄙信，SMART，NCBI Batch CD- search（我喜歡phmmer）忿晕。

推薦：序列太多装诡，下載本地版pfam，詳情見我的另外一篇践盼，lnRNA | pfam預測 http://www.reibang.com/p/3a19db8f3cbb

將hmmer的結(jié)果的id整理出來鸦采，提取蛋白序列（TBtools，再一次感謝CJ大神）咕幻，再提交到phmmer網(wǎng)站渔伯，查看是否只有一個14-3-3 domain。

phmmer.png

phmmer result.png

4.匯總結(jié)果

結(jié)合1.2.3步肄程，命令行或者excel核對求交集锣吼，得到最終結(jié)果南瓜中有12個14-3-3家族成員选浑。

5.總結(jié)及注意事項：

? 只有一個domain，hmmer很快吐限，但是可能結(jié)果很多鲜侥，例如MAPK、MAPKK诸典、MAPKKK等描函，它們的domain都為pkinase，分族是根據(jù)進化樹分支結(jié)果狐粱，這時需要結(jié)合blast結(jié)果驗證舀寓。
? 兩個及以上domain，需要利用檢索取兩個較少結(jié)果的交集肌蜻，可結(jié)合blast結(jié)果驗證互墓。