先搬一段VarScan的介绍http://dkoboldt.github.io/varscan/
VarScan is a platform-independent mutation caller for targeted, exome, and whole-genome resequencing data generated on Illumina, SOLiD, Life/PGM, Roche/454, and similar instruments.
It can be used to detect different types of variation:
- Germline variants (SNPs an dindels) in individual samples or pools of samples
- Multi-sample variants (shared or private) in multi-sample datasets (with mpileup)
- Somatic mutations, LOH events, and germline variants in tumor-normal pairs
- Somatic copy number alterations (CNAs) in tumor-normal exome data
从上面可以看出,VarScan可以用于Germline variants以及Somatic mutations,因此我用这软件来Call SNP。。。
至于为什么选用VarScan来call variant,VarScan是这么说的:对于NGS数据,大多数variant caller都是基于贝叶斯算法来识别变异位点,然后通过可信度来评估,但会受extreme read depth,pooled samples,contaminated or impure samples的影响;而VarScan是基于启发式算法来寻找变异位点,可以很好的meet desired thresholds for read depth, base quality, variant allele frequency, and statistical significance
安装很简单,下载jar文件即可使用,https://sourceforge.net/projects/varscan/files/
用samtools mpileup生成mpileup文件作为VarScan的输入文件,这里跟samtools+bcftools流程的是不同的,所以只需要输出mpileup文件即可,那么就不用加-g参数了,还有其他参数倒是可以加上
-q, -min-MQ 比对的mapping quality
-d, --max-depth 最大测序深度,过滤掉超深度测序的位点
samtools mpileup -q 1 -d 30000 -f ~/reference/genome/human_g1k_v37/human_g1k_v37.fasta merged_marked.bam 1>merged_marked.mpileup 2>mpileup.log
既然是call snp,那么肯定是用VarScan2的mpileup2snp命令,具体参数可以查看http://dkoboldt.github.io/varscan/using-varscan.html#v2.3_mpileup2snp,其他命令也类似
USAGE: java -jar VarScan.jar mpileup2snp [mpileup file] OPTIONS mpileup file - The SAMtools mpileup file
参数不多,全部列出来也可以:
--min-coverage Minimum read depth at a position to make a call(位点的测序深度,覆盖度)默认 8
--min-reads2 Minimum supporting reads at a position to call variants(paired reads,同上)默认 2
--min-avg-qual Minimum base quality at a position to count a read(位点上的reads碱基质量)默认 15
--min-var-freq Minimum variant allele frequency threshold(不太懂这个阈值是怎么卡的。。。。。。)默认 0.01
--min-freq-for-hom Minimum frequency to call homozygote(这个还是不懂)默认 0.75
--p-value Default p-value threshold for calling variants(call出来的varians的可行度)默认是0.99
--strand-filter Ignore variants with >90% support on one strand 默认1
--output-vcf If set to 1, outputs in VCF format
--variants Report only variant (SNP/indel) positions (mpileup2cns only)(对于mpileup2snp就无视这个了) [0]
由上可看出,大部分默认参数即可,特定情况再依情况修改
java -jar VarScan.jar mpileup2snp merged_marked.mpileup --output-vcf 1 >varscan.snp.vcf
但是VarScan的http://dkoboldt.github.io/varscan/germline-calling.html里面对于input文件(mpileup)又一个卡阈值,如下:
Only bases meeting the minimum base quality (default: 20) from reads meeting the minimum mapping quality (default: 1) are considered. The coverage (number of qualifying bases) is calculated. If this meets the minimum threshold (default: 20), VarScan examines each allele that was observed, testing to see if it:
Was supported by at least the minimum number of supporting reads [--min-reads2]
Meets the minimum allele frequency threshold [--min-var-freq]
Passes a basic strand-bias filter (if --strand-filter set to 1)
Has a Fisher's Exact Test p-value below the threshold (if --p-value specified)简单的说就是满足minimum base quality=20,minimum mapping quality=1,minimum coverage=20的位点才call variant,怎么跟mpileup2snp的默认参数又有些不一致了呢,我暂时也没搞清楚
过滤位点,VarScan call出来的vcf文件有点跟其他软件不同,最明显的地方就是没有QUAL值,毕竟算法不一样嘛,但是其他常用的参数比如GQ,DP还是都有值的
varscan也有自己filter命令,内容很简单http://dkoboldt.github.io/varscan/using-varscan.html#v2.3_filter,可以看看
所以也可以从vcf的第10列的信息来过滤,脚本可以简单的过滤低GQ和DP值的位点
本文出自于http://www.bioinfo-scrounger.com转载请注明出处