推薦一個高效做PCA分析的工具承耿,從VCF文件到直接出圖的一鍵式分析冠骄,對于小白非常友好。小明哥剛開發(fā)時小編就在用了加袋,那時還叫MingPCACluster凛辣,今年五月終于發(fā)表見刊,恭喜职烧。
[圖片上傳失敗...(image-c406fb-1718460402712)]
VCF2PCACluster 是基于群體SNP數據VCF格式開發(fā)的PCA分析和聚類軟件扁誓,同時兼并了Genotype 等格式軟件,即只要對應的一個輸入文件進來蚀之,這PCA和作圖分組等一步到位蝗敢。簡單、易用和高效足删。 其中主要功能有:
- SNP位點過濾:如三堿基,MAF等
- 5種算法計算親緣關系矩陣kinship
- 基于kinship進行 PCA分析
- 3種聚類算法對PCA的結果進行聚類分析
- 基于PCA的結果和聚類結果進行可視化
主要亮點:一步高效生成PCA和聚類圖寿谴。其中為了強調核心是高效低內存, 一步操作,一個輸入到PCA結果和出圖失受,對用戶友好拭卿。
地址:https://github.com/hewm2008/VCF2PCACluster
小編使用的體驗確實比其他大多數軟件要快多了骡湖。
[圖片上傳失敗...(image-f00625-1718460402712)]
[圖片上傳失敗...(image-76677c-1718460402712)]
安裝
git clone https://github.com/hewm2008/VCF2PCACluster.git
cd VCF2PCACluster; chmod 755 -R bin/*
./bin/VCF2PCACluster -h ### print help information
主要參數
# for more Help document please see the manual. Para [-i] is show for [-InVCF], Para [-o] is show for [-OutPut]
Usage: VCF2PCACluster -InVCF in.vcf.gz -OutPut outPrefix [options]
-InVCF <str> Input SNP VCF Format
-InKinship <str> Input SNP K Kinship File Format
-OutPut <str> OutPut File Prefix(Kinship PCA etc)
-KinshipMethod <int> Method of Kinship [1-5],defaut [1]
1:Normalized_IBS(Yang/BaldingNicolsKinship)
2:Centered_IBS(VanRaden)
3:IBSKinshipImpute 4:IBSKinship 5:p_dis
-ClusterMethod <str> Method For Cluster[EM/Kmean/DBSCAN/None] [EM]
-help v1.40 Show more Parameters and help [hewm2008]
InFile:
-InGenotype <str> InPut Genotype File for no VCF file
-InSubSample <str> Only keep samples from subsample List for PCA[ALLsample]
-InSampleGroup <str> InFile of sample Group info,format(sample groupA)
SNP Filtering:
-MAF <float> Min minor allele frequency filter [0.001]
-Miss <float> Max ratio of miss allele filter [0.25]
-Het <float> Max ratio of het allele filter [1.00]
-HWE <float> Exact test of Hardy-Weinberg Equilibrium for SNP Pvalue[0]
-Fchr <str> Filter the chrX chr[chrX,chrY,X,Y]
-KeepRemainVCF keep the VCF after filter
Clustering:
-RandomCenter Random diff-center to Re-Run Cluster for Kmean
-BestKManually <int> manually set the Best K (Num of Cluster) (auto)
-BestKRatio <float> Get the best K Cluster by deta-SSE Ratio[0.15]
-MinPointNum <int> Minimum point number of D-cluster[4]
-Epsilon <float> Epsilon for DBSCAN_Distance/EM_convergence (auto)
-Iterations <int> iterations number for EM clustering[1000]
OutPut:
-PCnum <int> Num of PC eig [10]
軟件中英文文檔已經寫得非常詳細,具體查看:https://github.com/hewm2008/VCF2PCACluster
美中不足的是我們往往并不需要cluster的結果峻厚,所以這最好是作為一個選項响蕴,不然我還是得自己繪圖,那何不用Plink呢惠桃?
嘮叨
小明開發(fā)的生信工具主打一個低調浦夷、簡單、實用辜王,比如LDBlockShow劈狐、PopLDdecay、RectChr呐馆、Reseqtools肥缔、NGenomeSyn等。小編作為他的前同事汹来,早已經成為這些軟件的忠實粉絲续膳,希望他能繼續(xù)開發(fā)出好用的生信工具。也歡迎大家多多使用和引用收班。