官網(wǎng)是用來學(xué)習(xí)的最好的地方(然而我也沒怎么看...)膜钓!
WDL是什么坚洽?
- WDL是由
Broad Institute
開發(fā)的一種流程開發(fā)語言韭畸,全稱workflow description language
喻犁。
WDL編輯器
使用Sublime Txet就可以,安裝WDL插件就可以漂亮地顯示W(wǎng)DL代碼了上岗。
一個WDL代碼示例
這是一個我老板用來跑RRBS的pipeline福荸,以此來學(xué)習(xí)。
workflow bismarkRRBS {
String file_basename
String masterFolder
File Fastq_R1
File? Fastq_R2
Boolean paired
Boolean MspI
Int clip_R1
Int? clip_R2
Int three_prime_clip_R1
Int? three_prime_clip_R2
String genome_folder
Int threadN
String targetFolder = masterFolder + "/" + file_basename + "/"
call makeDir as md1 {
input:
targetFolder = targetFolder
}
call makeDir as md2 {
input:
targetFolder = md1.outFolder + "report"
}
call trim_galore {
input:
Fastq_R1 = Fastq_R1,
Fastq_R2 = Fastq_R2,
clip_R1 = clip_R1,
clip_R2 = clip_R2,
three_prime_clip_R1 = three_prime_clip_R1,
three_prime_clip_R2 = three_prime_clip_R2,
paired = paired,
MspI = MspI,
targetFolder = md1.outFolder,
reportFolder = md2.outFolder,
file_basename = file_basename
}
call bismark {
input:
R1_trimmed = trim_galore.R1_trimmed,
R2_trimmed = trim_galore.R2_trimmed,
paired = paired,
threadN = threadN,
file_basename = file_basename,
genome_folder = genome_folder,
targetFolder = md1.outFolder,
reportFolder = md2.outFolder
}
call processBam {
input:
file_basename = file_basename,
alignedBam = bismark.alignedBam,
genome_folder = genome_folder,
alignedReport = bismark.alignedReport,
paired = paired,
threadN = threadN,
targetFolder = md1.outFolder,
reportFolder = md2.outFolder
}
output {
File DedupBam = processBam.DedupBam
File DedupBam_index = processBam.DedupBam_index
}
}
task trim_galore {
File Fastq_R1
File? Fastq_R2
Boolean paired
Boolean MspI
Int clip_R1
Int? clip_R2
Int three_prime_clip_R1
Int? three_prime_clip_R2
String targetFolder
String reportFolder
String file_basename
String s1 = if paired then "--paired" else ""
String s2 = s1 + if (clip_R1 > 0) then " --clip_R1 " + clip_R1 else ""
String s3 = s2 + if (three_prime_clip_R1 > 0) then " --three_prime_clip_R1 " + three_prime_clip_R1 else ""
String s4 = s3 + if(paired && clip_R2 > 0) then " --clip_R2 " + clip_R2 else ""
String s5 = s4 + if(MspI) then " --rrbs " else ""
String commandText = s5 + if(paired && three_prime_clip_R2 > 0) then " --three_prime_clip_R2 " + three_prime_clip_R2 else ""
command {
mkdir trimmed
trim_galore ${commandText} -o trimmed --basename ${file_basename} ${Fastq_R1} ${Fastq_R2}
cp trimmed/*report.txt ${reportFolder}
}
output{
File R1_trimmed = if paired then "trimmed/${file_basename}_R1_val_1.fq.gz" else "trimmed/${file_basename}_trimmed.fq.gz"
File R2_trimmed = if paired then "trimmed/${file_basename}_R2_val_2.fq.gz" else "/dev/null"
}
}
task bismark {
File R1_trimmed
File? R2_trimmed
Boolean paired
Int threadN
String file_basename
String genome_folder
String targetFolder
String reportFolder
String commandText = if paired then " -1 " + R1_trimmed + " -2 " + R2_trimmed else R1_trimmed
command {
bismark --genome_folder ${genome_folder} \
${commandText} \
-p ${threadN} \
--rg_id ${file_basename} \
--rg_sample ${file_basename} \
--basename ${file_basename} \
--temp_dir temp_dir \
-o aligned
cp aligned/*report.txt ${reportFolder}
}
output {
File alignedBam = glob("aligned/*.bam")[0]
File alignedReport = glob("aligned/*report.txt")[0]
}
}
task processBam {
File alignedBam
File alignedReport
File genome_folder
String file_basename
Boolean paired
Int threadN
String targetFolder
String reportFolder
String commandText = if paired then "-p" else "-s"
command {
mkdir Dedup
mkdir Call
sambamba sort --natural-sort -m 8GB --tmpdir temp_dir -t ${threadN} -o ${file_basename}_nSorted.bam ${alignedBam}
bismark_methylation_extractor ${commandText} --comprehensive --bedGraph --gzip --genome_folder ${genome_folder} \
--multicore ${threadN} \
-o Call ${file_basename}_nSorted.bam
mv Call/*splitting_report.txt ${file_basename}_splitting_report.txt
mv Call/*M-bias.txt ${file_basename}_M-bias.txt
bismark2report --alignment_report ${alignedReport} \
--mbias_report ${file_basename}_M-bias.txt \
--splitting_report ${file_basename}_splitting_report.txt \
-o ${file_basename}_report.html
sambamba sort -m 8GB --tmpdir temp_dir -t ${threadN} -o ${file_basename}.bam ${file_basename}_nSorted.bam
cp ${file_basename}.bam ${targetFolder}
cp ${file_basename}.bam.bai ${targetFolder}
cp ${file_basename}*.txt ${reportFolder}
cp ${file_basename}_report.html ${targetFolder}
cp -R Call ${targetFolder}
}
output {
File DedupBam = "${file_basename}.bam"
File DedupBam_index = "${file_basename}.bam.bai"
}
}
task makeDir {
String targetFolder
command {
if [ ! -d $targetFolder ] ; then
mkdir ${targetFolder}
else
rm -rf ${targetFolder}
mkdir ${targetFolder}
fi
}
output {
String outFolder = "${targetFolder}"
}
}
WDL使用流程
WDL 腳本核心結(jié)構(gòu)包括 5 個基本組件:workflow
, task
, call
, command
以及 output
肴掷。workflow
即是描述整個流程的框架敬锐,其中調(diào)用( call
)了不同的 task
。每一個 task
中又定義了各自的命令 command
和輸出結(jié)果 output
呆瞻,位于workflow
外部台夺。通過json文件傳入?yún)?shù)。
結(jié)構(gòu)概覽:
# workflow部分
workflow your_workflow_name {
String var1
Int var2
File var3
...
call task_name1 {
input:
var1 = ...
...
}
...
output {
...
}
}
# task部分
task task_name1 {
String var1
Int var2
File var3
...
command {
...
}
output{
...
{
{
...
編寫workflow部分
這一部分就是編寫你的整個流程痴脾,包括聲明變量颤介、調(diào)用(call)任務(wù)(task),定義輸出等部分。
語法結(jié)構(gòu)為:
workflow workflow_name {
var_type1 var1
var_type2 var2
...
call task1 {
input:
...
}
call task1 {
input:
...
}
output {
File ...
# 其實不太理解workflow部分的output有什么作用
}
在 WDL 中共有五種基本變量類型:
-
String
表示字符串 -
Float
為浮點數(shù) -
Int
為整數(shù) -
Boolean
為布爾型 -
File
表示文件
創(chuàng)建變量時,必須聲明變量類型:
String str1 = "hello"
另外滚朵,WDL 還支持復(fù)合類型:
-
Array
為數(shù)組 -
Map
為字典 -
Object
為對象
有些聲明加上了一個 ?
冤灾,這表示該變量為可選變量。
使用變量:
${var_name}
字符可以使用
+
直接相加辕近,合并后的字符之間無空格韵吨,相對應(yīng)R語言的paste0()
函數(shù)if語句:
if xxx then xxx else
,可讀性很高
按照示例代碼解釋一遍:
# 創(chuàng)建名為bismarkRRBS的workflow
workflow bismarkRRBS {
# 聲明所有要使用的變量
String file_basename
String masterFolder
File Fastq_R1
File? Fastq_R2
Boolean paired
Boolean MspI
Int clip_R1
Int? clip_R2
Int three_prime_clip_R1
Int? three_prime_clip_R2
String genome_folder
Int threadN
String targetFolder = masterFolder + "/" + file_basename + "/"
# 調(diào)用任務(wù)亏推。要讀懂call部分学赛,還是要先看定義的task部分
call makeDir as md1 {
input:
targetFolder = targetFolder
}
call makeDir as md2 {
input:
targetFolder = md1.outFolder + "report"
}
call trim_galore {
input:
Fastq_R1 = Fastq_R1,
Fastq_R2 = Fastq_R2,
clip_R1 = clip_R1,
clip_R2 = clip_R2,
three_prime_clip_R1 = three_prime_clip_R1,
three_prime_clip_R2 = three_prime_clip_R2,
paired = paired,
MspI = MspI,
targetFolder = md1.outFolder,
reportFolder = md2.outFolder,
file_basename = file_basename
}
call bismark {
input:
R1_trimmed = trim_galore.R1_trimmed,
R2_trimmed = trim_galore.R2_trimmed,
paired = paired,
threadN = threadN,
file_basename = file_basename,
genome_folder = genome_folder,
targetFolder = md1.outFolder,
reportFolder = md2.outFolder
}
call processBam {
input:
file_basename = file_basename,
alignedBam = bismark.alignedBam,
genome_folder = genome_folder,
alignedReport = bismark.alignedReport,
paired = paired,
threadN = threadN,
targetFolder = md1.outFolder,
reportFolder = md2.outFolder
}
output {
File DedupBam = processBam.DedupBam
File DedupBam_index = processBam.DedupBam_index
}
}
編寫task部分
task部分語法:
task task_name {
var_type1 var1
...
command {
...
}
output {
...
}
}
task的output部分可以被后面的task所調(diào)用,不用output聲明則不能被調(diào)用吞杭。
以結(jié)構(gòu)最簡單的創(chuàng)建目錄task為例:
# 創(chuàng)建名為makeDir的task
task makeDir {
# 聲明該task中要使用的變量
String targetFolder
# 編寫該task的命令
command {
if [ ! -d $targetFolder ] ; then
mkdir ${targetFolder}
else
rm -rf ${targetFolder}
mkdir ${targetFolder}
fi
}
# 該task的定義了名為outFolder內(nèi)容為"${targetFolder}"的一個變量
output {
String outFolder = "${targetFolder}"
}
}
創(chuàng)建json文件
編寫好的wdl腳本只是一個pipeline盏浇,沒有具體(specific)的參數(shù)。所用使用json文件進(jìn)行傳參芽狗。json文件以.json為后綴绢掰,其格式為:
{
“workflow_name.var1":"specific_var1",
“workflow_name.var1":"specific_var2",
...
}
需要注意的是,除了最后一行不需要加逗號童擎,其余行必須加逗號滴劲。
創(chuàng)建json文件的代碼:
# 創(chuàng)建包含配置信息的list
xList = list(var1 = ...,
var2 = ...,
var3 = ...,
...)
# 編寫writeJSON函數(shù)
writeJSON <- function(xList, mainTag, fileOut){
names(xList) = paste0(mainTag, ".", names(xList))
NM = names(xList)
OUT = file(fileOut, "w")
cat("{\n", file = OUT, append = FALSE, sep = "")
for(i in 1:length(xList)){
value = xList[[i]]
if(i < length(xList))
cat("\t\"", NM[i], "\":\"", value, "\",\n", file=OUT, append = TRUE, sep = "")
else
cat("\t\"", NM[i], "\":\"", value, "\"\n", file=OUT, append = TRUE, sep = "")
}
cat("}\n", file = OUT, append = TRUE, sep = "")
close(OUT)
}
JSON = paste0("JSON/", Tag, ".json")
writeJSON(xList, "workflow_name", JSON)
Tag
指的是數(shù)據(jù)樣本的Tag。
writeJSON函數(shù)
可以寫到另外一個文件中顾复,然后再source
調(diào)用比較方便班挖。
運行WDL腳本
不需要安裝,只需要用調(diào)用java包即可芯砸。
wget https://github.com/broadinstitute/cromwell/releases/download/47/cromwell-47.jar
java -jar cromwell-47.jar run test.wdl --inputs test.json
遞交任務(wù)
通過使用qsub命令來對資源使用限量進(jìn)行設(shè)置萧芙,并重定向stderr和stdout。
cmd = "java -jar cromwell-47.jar run test.wdl --inputs test.json"
cmd = paste0("echo ", "\"", cmd, "\"")
cmd = paste(cmd, "| qsub -N", jobName)
cmd = paste(cmd, "-l h_cpu=720:00:00 -V -m n -cwd -pe smp", Ncpu)
cmd = paste0(cmd, " -l h_vmem=", Memory, "G")
cmd = paste(cmd, "-o", outFile, "-e", errFile)
systerm(cmd)
同樣可以寫成一個函數(shù)假丧,source
調(diào)用双揪。
多樣本處理
以上講的是單個或單組樣本的workflow,但一般情況下包帚,會得到多個raw_data渔期,這樣就需要循環(huán)上面的workflow。寫個for循環(huán)即可渴邦,從1
到樣本的數(shù)量
循環(huán)疯趟,提取每個樣本的Tag
,并以此作為jobName
和要保存的文件的文件夾名几莽。