TransDecoder按照其官网的说明,主要用于识别转录本序列中的潜在的编码区域,也就是预测CDS。转录本可以由RNA-Seq数据通过Trinity组装来的,也可以由RNA-Seq比对到参考基因组上构建的转录本。 TransDecoder识别可能的编码区域是基于以下几个标准:
- a minimum length open reading frame (ORF) is found in a transcript sequence
- a log-likelihood score similar to what is computed by the GeneID software is > 0
- the above coding score is greatest when the ORF is scored in the 1st reading frame as compared to scores in the other 5 reading frames
- if a candidate ORF is found fully encapsulated by the coordinates of another candidate ORF, the longer one is reported. However, a single transcript can report multiple ORFs (allowing for operons, chimeras, etc)
- optional the putative peptide has a match to a Pfam domain above the noise cutoff score
其是基于马尔可夫模型,这点跟ESTScan软件(基于隐马尔科夫模型)有较大区别。但是相比较ESTScan,TransDecoder软件使用的简便性是非常令人满意的。
从FASTA格式文件预测编码区域
提取FASTA序列中最长的ORF
TransDecoder.LongOrfs -t target_transcripts.fasta
可以通过-m 参数来设定ORF的最小长度,按照其说法,随着ORF长度的减少,其假阳性的概率也会随之上升,其默认值是100。所以也可以按照自己的需求进行调整-m 参数
可选的步骤
通过Blast或者pfam搜索已知蛋白序列来识别ORF
预测可能的编码区域
TransDecoder.Predict -t target_transcripts.fasta [ homology options ]
这里要注意的是,运行该程序时,需跟第一步的当前路径保持一致。因此最终的文件可以在当前目录找到,也就是后缀为.pep, .cds, .gff3和.bed的文件
我觉得第二步不必按照其要求来做,一般来说,我们会使用TransDecoder对无参转录组的拼接结果序列预测其CDS,所以我们可以先将拼接序列用BLAST比对nr以及swissprot蛋白数据库,然后提取其比对上的同源序列的位置来识别CDS,最后再通过TransDecoder的第一步和第三步来预测那些未比对上的序列的CDS。
从有参考基因组的转录结果GTF文件预测编码区域
需要有参转录组比对后拼接的转录本的GTF文件以及参考基因组序列
util/cufflinks_gtf_genome_to_cdna_fasta.pl transcripts.gtf test.genome.fasta > transcripts.fasta
将GTF文件转化为GFF3文件
util/cufflinks_gtf_to_alignment_gff3.pl transcripts.gtf > transcripts.gff3
接着就跟上面的步骤一样了
TransDecoder.LongOrfs -t transcripts.fasta (optionally, identify peptides with homology to known proteins) TransDecoder.Predict -t transcripts.fasta [ homology options ]
最后生成一个基于有参基因组的编码区域注释文件
util/cdna_alignment_orf_to_genome_orf.pl transcripts.fasta.transdecoder.gff3 transcripts.gff3 transcripts.fasta > transcripts.fasta.transdecoder.genome.gff3
输出文件说明
其实结果文件中我们最关注的就2个文件,一个是预测后的CDS序列,也就是在当前路径下的后缀为.cds的文件;另一个则为其对应的蛋白序列文件.pep。
TransDecoder对每个文件有详细的说明:
- longest_orfs.pep: all ORFs meeting the minimum length criteria, regardless of coding potential.
- longest_orfs.gff3: positions of all ORFs as found in the target transcripts
- longest_orfs.cds: the nucleotide coding sequence for all detected ORFs
- longest_orfs.cds.top_500_longest: the top 500 longest ORFs, used for training a Markov model for coding sequences.
- hexamer.scores: log likelihood score for each k-mer (coding/random)
- longest_orfs.cds.scores : the log likelihood sum scores for each ORF across each of the 6 reading frames
- longest_orfs.cds.scores.selected : the accessions of the ORFs that were selected based on the scoring criteria (described at top)
- longest_orfs.cds.best_candidates.gff3 : the positions of the selected ORFs in transcripts
Then, the final outputs are reported in your current working directory:
- transcripts.fasta.transdecoder.pep : peptide sequences for the final candidate ORFs; all shorter candidates within longer ORFs were removed.
- transcripts.fasta.transdecoder.cds : nucleotide sequences for coding regions of the final candidate ORFs
- transcripts.fasta.transdecoder.gff3 : positions within the target transcripts of the final selected ORFs
- transcripts.fasta.transdecoder.bed : bed-formatted file describing ORF positions, best for viewing using GenomeView or IGV.
参考自:http://transdecoder.github.io/
本文出自于http://www.bioinfo-scrounger.com转载请注明出处