不能直接輸入vcf真的很煩啊,氨基酸序列fasta太長了還會直接Segmentation Fault
過濾獲得PASS的vcf
我只用了gatk里的af-onl-gnomAD文件過濾應(yīng)該還可以annovar基于篩選的注釋,然后用vcftools之類的東西篩選一下
vcftools --gzvcf /path/to/sample.filter.vcf.gz --remove-filtered-all --recode --stdout | gzip -c > /path/to/sample_PASS_only.vcf.gz
(應(yīng)該還可以annovar基于篩選的注釋再篩選一下,但是那樣就需要把annovar生成的vcf用convert2annovar.pl
轉(zhuǎn)換成avinput格式再用annotate_variation.pl
進行refGene注釋再coding_change.pl
轉(zhuǎn)換成氨基酸序列触菜,配對vcf包含正常組織和腫瘤組織至少2個樣本,convert2annovar.pl
會讓選擇全都轉(zhuǎn)換成分開的文件還是只轉(zhuǎn)換第1個杭隙,腫瘤是第2個及以后的放妈,所以如果這樣的話可能需要刪掉正常樣本那一列?)
cat input.vcf |awk -v OFS='\t' '{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$11}' > output.vcf
annovar轉(zhuǎn)換成氨基酸fasta
annovartable_annovar.pl
注釋的時候不要選--remove
湿右,就可以保留中間文件sample.exonic_variant_function
coding_change.pl --includesnp --onlyAltering path/to/sample.exonic_variant_function /path/to/annovar/humandb/hg38_refGene.txt /path/to/annovar/humandb/hg38_refGeneMrna.fa --outfile /path/to/outdir/sample.fasta
轉(zhuǎn)換fasta序列
修改序列ID格式
sed -i 's/>line.*NM_/>NM_/g' /path/to/outdir/*fasta
sed -i 's/ /;/g' /path/to/outdir/*fasta
去掉不需要的序列诅妹,此處參考NeoPredPipe腳本vcf_manipulate.py,不直接用這個軟件是因為報錯太多不想處理了-_- |||
#!/usr/bin/env python
import os
from Bio import SeqIO
# 指定需要處理的文件夾路徑
folder_path = '/path/to/outdir/fastaFiles_1/'
id_list = ['WILDTYPE', 'immediate-stopgain', 'stop-loss']
# 遍歷文件夾中的所有 fasta 文件
for file_name in os.listdir(folder_path):
file_path = os.path.join(folder_path, file_name)
# 打開文件并讀取所有序列
records = list(SeqIO.parse(file_path, 'fasta'))
# 遍歷每個序列毅人,如果 ID 不包含上述關(guān)鍵詞吭狡,將其保留到新列表中
filtered_records = []
for record in records:
if not any(keyword in record.id for keyword in id_list):
filtered_records.append(record)
# 將篩選后的序列寫回文件
if len(filtered_records) > 0: # 如果有篩選結(jié)果才更新文件
with open(file_path, 'w') as f:
SeqIO.write(filtered_records, f, 'fasta')
節(jié)選突變附近的序列,此處參考NeoPredPipe腳本NeoPredPipe.py
#!/usr/bin/env python
import sys
import os
import subprocess
from Bio import SeqIO
folder_path = '/public/home/xieruoqi/data/1.CR/neoantigen/fastaFiles_1/'
def ExtractSeq(record, pos, n, frameshift=False):
'''
Extracts the proper range of amino acids from the sequence and the epitope length
:param seq: Sequence of mutated amino acid sequence.
:param pos: Location of altered amino acid.
:param n: Length of epitope for predictions.
:return: Returns an amino acid sequence of appropriate lengths.
'''
seq = str(record.seq)
if pos<n: # Cases where start is less than n
miniseq = seq[0:pos+(n)]
elif len(seq)-pos < n: # Cases where end is less than n
miniseq = seq[pos - (n - 1):len(seq)]
else:
miniseq = seq[pos-(n-1):pos+(n)]
if frameshift:
if pos<n:
miniseq = seq[0:len(seq)] # When start is not n away from pos, we go until the end of the sequence
else:
miniseq = seq[pos-(n-1):len(seq)] # All other cases, we still go until the end of the sequence
return(miniseq)
epitopeLens = [8,9,10,11]
for file_name in os.listdir(folder_path):
file_path = os.path.join(folder_path, file_name)
eps = {n: 0 for n in epitopeLens}
epsIndels = {n: 0 for n in epitopeLens}
for n in epitopeLens:
mySeqs = []
mySeqsIndels = []
for record in SeqIO.parse(file_path, 'fasta'):
if 'silent' not in record.id.lower() and 'startloss' not in record.id.lower():
try:
pos = int(record.id.replace(";;",";").split(";")[5].split('-')[0])-1
except ValueError:
try:
pos = int(record.id.replace(";;",";").split(";")[6].split('-')[0])-1
except IndexError:
sys.exit("ERROR: Could not process fasta line in reformatted fasta: %s" % (record.id))
if 'dup' in record.id.lower() or 'del' in record.id.lower() or 'ins' in record.id.lower() or 'fs' in record.id:
miniseq = ExtractSeq(record, pos, n, True)
mySeqsIndels.append(">"+record.id[0:100]+"\n"+miniseq+"\n")
else:
miniseq = ExtractSeq(record, pos, n)
mySeqs.append(">"+record.id+"\n"+miniseq+"\n")
eps[n] = mySeqs
epsIndels[n] = mySeqsIndels
tmpFiles = {}
for n in epitopeLens:
tmpFasta = file_path.replace(".fasta",".tmp.%s.fasta"%(n))
tmpFastaIndels = file_path.replace(".fasta",".tmp.%s.Indels.fasta"%(n))
tmpFiles.update({n:tmpFasta})
tmpFiles.update({str(n)+'.Indels':tmpFastaIndels})
with open(tmpFasta, 'w') as outFile:
for line in eps[n]:
outFile.write(line)
with open(tmpFastaIndels, 'w') as outFile:
for line in epsIndels[n]:
outFile.write(line)
HLA分型提取及格式轉(zhuǎn)換
先提取可能包含6號染色體HLA區(qū)域reads丈莺,不要用腫瘤樣本划煮,用正常組織樣本
然后選一個HLA分型預(yù)測軟件,這里用的是OptiType
大多數(shù)軟件都要求輸入fq
samtools view -hb /path/to/sample_blood.final.bam \
chr6:28510120-33480577 \
chr6_GL000250v2_alt:1066038-4433734 \
chr6_GL000251v2_alt:1283988-4540572 \
chr6_GL000252v2_alt:1063230-4372611 \
chr6_GL000253v2_alt:1062914-4548533 \
chr6_GL000254v2_alt:1062887-4416229 \
chr6_GL000255v2_alt:1063190-4323464 \
chr6_GL000256v2_alt:1106450-4577757 \
> /path/to/sample_blood.chr6.bam && \
samtools view -hb -@ 8 -f 12 /path/to/sample_blood.final.bam \
> /path/to/sample_blood.unmapped.bam && \
samtools merge /path/to/sample_blood.merge.bam /path/to/sample_blood.chr6.bam /path/to/sample_blood.unmapped.bam && \
samtools sort -n /path/to/sample_blood.merge.bam -@ 8 -o /path/to/sample_blood.hla.bam && \
samtools fastq -@ 8 /path/to/sample_blood.hla.bam \
-1 /path/to/sample_blood.hla.1.fq \
-2 /path/to/sample_blood.hla.2.fq \
-s /dev/null && \
rm /path/to/sample_blood.chr6.bam /path/to/sample_blood.unmapped.bam /path/to/sample_blood.merge.bam && \
OptiTypePipeline.py -i /path/to/sample_blood.hla.1.fq /path/to/sample_blood.hla.2.fq --dna -v -o /path/to/sample
然后把輸出的tsv文件轉(zhuǎn)換為netMHCpan
需要的格式
cat /path/to/sample/*/*.tsv |awk -v OFS=',' 'NR==2 {print $2,$3,$4,$5,$6,$7}' > /path/to/sample/sample.HLA.txt && \
sed -i "s/,/,HLA-/g" /path/to/sample/sample.HLA.txt && \
sed -i "s/*//g" /path/to/sample/sample.HLA.txt && \
sed -i "s/^/HLA-/g" /path/to/sample/sample.HLA.txt
然后就可以順利運行netMHCpan啦缔俄,不過我還是不太理解為什么netMHCpan預(yù)測的Affinity對不上BindLevel