https://broadinstitute.github.io/picard/picard-metric-definitions.html
https://broadinstitute.github.io/picard/index.html
picard 是一套命令行組成的工具包锥余,用于處理高通量數(shù)據(jù)以及SAM/bam/VCF等相關(guān)數(shù)據(jù)格式满哪。相關(guān)文件格式見說明 Hts-specs, SAM specification and the VCF specification.
使用方法:
java jvm-args -jar picard.jar PicardToolName OPTION1=value1 OPTION2=value2...
所有工具
- AlignmentSummaryMetrics: 統(tǒng)計(jì)比對(duì)結(jié)果(SAM/BAM), 由CollectAlignmentSummaryMetrics生成,結(jié)果在文件.alignment_summary_metrics中。
ClusteredCrosscheckMetric: 處理聚類的 crosschecking fingerprints結(jié)果*
CollectHiSeqXPfFailMetrics.PFFailDetailedMetric: a metric class for describing FP failing reads from an Illumina HiSeqX lane *
CollectHiSeqXPfFailMetrics.PFFailSummaryMetric: Metrics produced by the GetHiSeqXPFFailMetrics program.*
CollectOxoGMetrics.CpcgMetrics: Metrics class for outputs.*
CollectQualityYieldMetrics.QualityYieldMetrics: 用于描述 BAM 比對(duì)質(zhì)量的一些指標(biāo)羡亩。*
CollectVariantCallingMetrics.VariantCallingDetailMetrics: 給定文件的 VCF 文件媒佣,與 SNP 和 Indel 相關(guān)的指標(biāo)斋扰。*
CollectVariantCallingMetrics.VariantCallingSummaryMetrics: 同上*
CollectWgsMetrics.WgsMetrics: 用于評(píng)估全基因組測(cè)序結(jié)果爷速。*
CollectWgsMetricsWithNonZeroCoverage.WgsMetricsWithNonZeroCoverage: 同上*
CrosscheckMetric: 處理 crosschecking fingerprints結(jié)果*
DuplicationMetrics: 對(duì) SAM 標(biāo)記 duplicates央星,并計(jì)算相關(guān)指標(biāo)。*
ErrorSummaryMetrics: CollectSequencingArtifactMetrics 計(jì)算的summary 指標(biāo)惫东,計(jì)算每種堿基錯(cuò)誤率莉给。*
ExtractIlluminaBarcodes.BarcodeMetric:
ExtractIlluminaBarcodes計(jì)算的指標(biāo),分析 Basecalling 目錄下的數(shù)據(jù)廉沮,確定每個(gè)reads 和 barcode 的關(guān)系颓遏。*FingerprintingDetailMetrics: fingerprint 內(nèi),單個(gè) SNP/雜合體 比較的詳細(xì)指標(biāo)滞时。*
FingerprintingSummaryMetrics: 總結(jié) fingerprinting 指標(biāo)叁幢,統(tǒng)計(jì)比較測(cè)序數(shù)據(jù)。*
GcBiasDetailMetrics:
Class that holds detailed metrics about reads that fall within windows of a certain GC bin on the reference genome.*-
GcBiasMetrics: *
GcBiasSummaryMetrics: High level metrics that capture how biased the coverage in a certain lane is.*
GenotypeConcordanceContingencyMetrics: Class that holds metrics about the Genotype Concordance contingency tables.*
GenotypeConcordanceDetailMetrics: Class that holds detail metrics about Genotype Concordance*
GenotypeConcordanceSummaryMetrics: Class that holds summary metrics about Genotype Concordance*
HsMetrics:Metrics generated by CollectHsMetrics for the analysis of target-capture sequencing experiments.
-
IlluminaBasecallingMetrics: Metric for Illumina Basecalling that stores means and standard deviations on a per-barcode per-lane basis.*
IlluminaLaneMetrics: Embodies characteristics that describe a lane.*
IlluminaPhasingMetrics: Metrics for Illumina Basecalling that stores median phasing and prephasing percentages on a per-template-read, per-lane basis.*
IndependentReplicateMetric: A class to store information relevant for biological rate estimation*
InsertSizeMetrics: Metrics about the insert size distribution of a paired-end library, created by the CollectInsertSizeMetrics program and usually written to a file with the extension ".insert_size_metrics".*
JumpingLibraryMetrics: High level metrics about the presence of outward- and inward-facing pairs within a SAM file generated with a jumping library, produced by the CollectJumpingLibraryMetrics program and usually stored in a file with the extension ".jump_metrics".*
MendelianViolationMetrics: Describes the type and number of mendelian violations found within a Trio.*
MergeableMetricBase: An extension of MetricBase that knows how to merge-by-adding fields that are appropriately annotated.*
MultilevelMetrics: *
RnaSeqMetrics: Metrics about the alignment of RNA-seq reads within a SAM file to genes, produced by the CollectRnaSeqMetrics program and usually stored in a file with the extension ".rna_metrics".*
RrbsCpgDetailMetrics: Holds information about CpG sites encountered for RRBS processing QC*
RrbsSummaryMetrics: Holds summary statistics from RRBS processing QC*
SequencingArtifactMetrics.BaitBiasDetailMetrics: Bait bias artifacts broken down by context.*
SequencingArtifactMetrics.BaitBiasSummaryMetrics: Summary analysis of a single bait bias artifact, also known as a reference bias artifact.*
SequencingArtifactMetrics.PreAdapterDetailMetrics: Pre-adapter artifacts broken down by context.*
SequencingArtifactMetrics.PreAdapterSummaryMetrics: Summary analysis of a single pre-adapter artifact.*
TargetedPcrMetrics: Metrics class for the analysis of reads obtained from targeted pcr experiments e.g.*
UmiMetrics: Metrics that are calculated during the process of marking duplicates within a stream of SAMRecords using the UmiAwareDuplicateSetIterator.
-
IlluminaBasecallingMetrics: Metric for Illumina Basecalling that stores means and standard deviations on a per-barcode per-lane basis.*
詳細(xì)功能
CollectHsMetrics:
分析靶向測(cè)序(hybrid-selection)的相關(guān)指標(biāo)
該命令讀取SAM/BAM文件坪稽。HS(雜交捕獲曼玩,靶向測(cè)序,hybrid-selection)是靶向測(cè)序常用的技術(shù)窒百,如外顯子靶向測(cè)序黍判,更多信息參考GATK Dictionary entry.
該命令需要
1)比對(duì)結(jié)果(SAM/BAM)
2)捕獲位點(diǎn)信息(靶向試劑盒生產(chǎn)商提供)。若捕獲位點(diǎn)是 bed 格式贝咙,則需要 BedToInterval 工具轉(zhuǎn)換為 picard 所需的 interval_list 格式样悟。
3)若有參考序列拂募,則會(huì)同時(shí)計(jì)算 AT_DROPOUT and GC_DROPOUT 指標(biāo)庭猩。
因?yàn)槟承﹨^(qū)域GC含量過多或過少,會(huì)使測(cè)序錯(cuò)誤率增加陈症,然后導(dǎo)致比對(duì)到這些區(qū)域的reads變少蔼水,即比對(duì)效率降低,覆蓋度降低录肯。
你可以使用 PER_TARGET_COVERAGE趴腋,獲取每個(gè)捕獲位點(diǎn)的GC含量和測(cè)序深度等信息。
標(biāo)為 pct 的指標(biāo)都是比例论咏。
java -jar picard.jar CollectHsMetrics \
I=input.bam \
O=hs_metrics.txt \
R=reference_sequence.fasta \
BAIT_INTERVALS=bait.interval_list \
TARGET_INTERVALS=target.interval_list
# BAIT_INTERVALS 可以與 TARGET_INTERVALS 相同(但我還不太明白)
bait 與 target 區(qū)別:
計(jì)算 bait coverage 時(shí)优炬,很少去除 reads,因此可以直觀感受濕實(shí)驗(yàn)效果厅贪,但是計(jì)算 target coverage 時(shí)蠢护,因?yàn)閷?duì)突變檢測(cè)的貢獻(xiàn)有限,去除了很多堿基养涮】叮可以看一下各種 PCT_EXC 指標(biāo)的描述眉抬,為什么在計(jì)算 target 時(shí),過濾掉很多 reads 懈凹。大部分過濾條件可以通過參數(shù)調(diào)節(jié)蜀变。
詳細(xì)的結(jié)果說明查看 CollectHsMetrics
CollectHsMetrics 分析的指標(biāo)分為三類。
1 ) 基本測(cè)序指標(biāo)介评,用來計(jì)算其他指標(biāo)库北。比如基因組大小,reads 總數(shù)威沫,比對(duì)的 reads 總數(shù)贤惯。
bait_set: 捕獲雜交用的 bait 名稱
bait_territory:位于一個(gè)或多個(gè) bait位點(diǎn)的堿基數(shù)量
target_territory:覆蓋在target區(qū)域 unique base數(shù)量
bait_design_efficiency:設(shè)計(jì)效率。 target_territory/bait_territory 比例棒掠。值為1 表示設(shè)計(jì)效率極好孵构,0.5表示一半 bait 堿基不在taget區(qū)域。
PF_READS:通過vendor's 過濾的reads總數(shù)烟很。
PF_BASES_ALIGNED :通過堿基質(zhì)量控制(PF)颈墅,且比對(duì)到基因組(比對(duì)分值>0)上 unique 堿基。
on_bait_bases: 比對(duì)到基因組 bait 區(qū)域的 (PF_BASES_ALIGNED )堿基數(shù)量雾袱。
genome_size
total_reads: SAM 文件中 reads 總數(shù)恤筛。
pf_reads:通過平臺(tái)/vendor 質(zhì)控的 reads 總數(shù)。
pf_bases:PF_READS 的堿基量芹橡。
pf_unique_reads:非重復(fù) reads
pf_uq_reads_aligned: 比對(duì)reads中 unique 比例
pf_bases_aligned:比對(duì)上的堿基總數(shù)毒坛。
pf_uq_bases_aligned: 比對(duì) reads 中 unique reads 的堿基總數(shù)
on_target_bases: 比對(duì)到 target 區(qū)域的堿基總數(shù)
pct_pf_reads:下機(jī)數(shù)據(jù)中通過質(zhì)控的 reads 比例。
pct_pf_uq_reads:下機(jī)數(shù)據(jù)中通過質(zhì)控且無重復(fù)的 reads 比例
pct_pf_uq_reads_aligned:通過質(zhì)控的reads中林说,比對(duì)到reference 的無重復(fù) reads 比例
2 ) 實(shí)驗(yàn)質(zhì)量煎殷,比如比對(duì)到 bait 附近投放、內(nèi)部菠红、外部的堿基數(shù)量或比例, fold 80 堿基罰分荤懂,捕獲文庫大小珠移,捕獲罰分弓乙。在過濾之前得到這些指標(biāo),比如低比對(duì)質(zhì)量钧惧,低質(zhì)量堿基暇韧,重復(fù)reads。
near_bait_bases:比對(duì)到 bait 附近的 reads 堿基量浓瞪。 即有部分重疊懈玻。
off_bait_bases:沒有比對(duì)到 bait 區(qū)域的堿基量。
pct_selected_bases:(near_bait_bases+on_bait_bases)/PF_BASES_ALIGNED
pct_off_bait:off_bait_bases/PF_BASES_ALIGNED追逮。
on_bait_vs_selected:on-taget 中bait 完全覆蓋的比例酪刀。
fold_80_base_penalty:測(cè)序均一度指標(biāo)粹舵,非0覆蓋區(qū)域上,使80%堿基達(dá)到平均coverage時(shí)骂倘,需要另外測(cè)序的倍數(shù)眼滤。值越低越好,最好值為1历涝。
hs_library_size:被捕獲的文庫片段數(shù)量估計(jì)值
hs_penalty_10x:80% 靶向區(qū)區(qū)域堿基達(dá)到 10X時(shí)的捕獲罰分诅需。即:當(dāng)設(shè)計(jì)10M的靶向區(qū)域時(shí),要得到 10X coverage荧库, 需要測(cè)序堰塌,直到 PF_ALIGNED_BASES =10^7 * 10 * HS_PENALTY_10X.
hs_penalty_20x:想要80%區(qū)域到達(dá) 20X coverage。
hs_penalty_30x
hs_penalty_40x
hs_penalty_50x
hs_penalty_100x
3)target 覆蓋度評(píng)估分衫,評(píng)估下游分析中的可靠性场刑。比如target 區(qū)域平均覆蓋度,不同覆蓋度水平的堿基比例蚪战,不同條件過濾的堿基比例牵现。按照所有條件過濾后計(jì)算這些指標(biāo)。
mean_bait_coverage:所有 bait 位點(diǎn)上的平均覆蓋度邀桑。
pct_usable_bases_on_bait: 可使用的 PF 堿基中瞎疼,比對(duì)到 bait 上的去重的堿基數(shù)量。
pct_usable_bases_on_target: 可使用的 PF 堿基中壁畸,比對(duì)到 target 上的去重的堿基數(shù)量贼急。
fold_enrichment:擴(kuò)增區(qū)域被擴(kuò)增的倍數(shù)
mean_target_coverage: target 區(qū)域平均覆蓋度。
median_target_coverage:覆蓋度
max_target_coverage:覆蓋度
min_target_coverage:覆蓋度
zero_cvg_targets_pct:target 區(qū)域覆蓋度<1的比例捏萍。
不同條件過濾的堿基比例:
pct_exc_dupe:標(biāo)記為重復(fù)的 reads 太抓。
pct_exc_adapter:adapter
pct_exc_mapq:低比對(duì)質(zhì)量
pct_exc_baseq: 低堿基堿基。
pct_exc_overlap: 重復(fù)序列比例照弥。 the second observation from an insert with overlapping reads. 腻异?进副?这揣?
pct_exc_off_target: 比對(duì)到 taget 區(qū)域外。
不同覆蓋度水平的堿基比例:
pct_target_bases_1x:比對(duì)到target 區(qū)域的影斑,不小于 1X的堿基比例
pct_target_bases_2x
pct_target_bases_10x
pct_target_bases_20x
pct_target_bases_30x
pct_target_bases_40x
pct_target_bases_50x
pct_target_bases_100x
at_dropout:與平均覆蓋度相比给赞,低堿基含量(GC<50%)的區(qū)域,偏低的程度矫户。結(jié)果是個(gè)比值片迅,表示總reads中比對(duì)到 低 GC含量區(qū)域的比例。
gc_dropout:高 GC含量的區(qū)域上 reads 比例皆辽。
het_snp_sensitivity:HET SNP 理論值柑蛇。
het_snp_q:HET SNP 理論值的 Q 值芥挣,
sample
library
read_group