众所周知lncRNA属于RNA中的非编码RNA,在转录调控中扮演者重要角色。并且现在听说lncRNA的研究也很火热,使用TCGA的数据对lncRNA的研究也是非常常见的需求。而我们如果想对TCGA的lncRNA进行定量则必须从TCGA RNA-Seq的表达量数据中提取出lncRNA的那部分数据。 我之前也没做过lncRNA相关的研究,所以就从网上搜了下,整理了下思路
首先下载TCGA中的RNA-Seq数据,因为新版的TCGA数据库是用Ensembl id作为基因的命名的,所以这些RNA-Seq基因表达量文件中必然也包含了lncRNA的表达量数据
接着要获取TCGA RNA表达量的Ensembl id中属于lncRNA的id
最后从总的RNA表达量文件中提取lncRNA的表达量数据
第一步已经在前面的博文里做了,是以乳腺癌的TCGA数据为例
第二步获取ensembl id与lncRNA的对应关系
这问题的解决思路第一个想起来应该去GENCODE数据库中寻找,因为逛过ENCODE人类相关数据库的话,会发现有个
Long non-coding RNA gene annotation
选项,果断下载了其gtf文件,GENCODE对这个文件的表述:evidence-based annotation of the human genome (GRCh38), version 27 (Ensembl 90) - long non-coding RNAs,从中也能看出该文件的版本跟TCGA使用的版本也保持一致,因此只要对该gtf文件进行处理就能获得属于lncRNA的ensembl id粗略看几行gtf格式
chr1 HAVANA gene 29554 31109 . + . gene_id "ENSG00000243485.5"; gene_type "lincRNA"; gene_name "MIR1302-2HG"; level 2; tag "ncRNA_host"; havana_gene "OTTHUMG00000000959.2"; chr1 HAVANA transcript 29554 31097 . + . gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lincRNA"; gene_name "MIR1302-2HG"; transcript_type "lincRNA"; transcript_name "MIR1302-2HG-202"; level 2; transcript_support_level "5"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1"; chr1 HAVANA exon 29554 30039 . + . gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lincRNA"; gene_name "MIR1302-2HG"; transcript_type "lincRNA"; transcript_name "MIR1302-2HG-202"; exon_number 1; exon_id "ENSE00001947070.1"; level 2; transcript_support_level "5"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1"; chr1 HAVANA exon 30564 30667 . + . gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lincRNA"; gene_name "MIR1302-2HG"; transcript_type "lincRNA"; transcript_name "MIR1302-2HG-202"; exon_number 2; exon_id "ENSE00001922571.1"; level 2; transcript_support_level "5"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1"; chr1 HAVANA exon 30976 31097 . + . gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lincRNA"; gene_name "MIR1302-2HG"; transcript_type "lincRNA"; transcript_name "MIR1302-2HG-202"; exon_number 3; exon_id "ENSE00001827679.1"; level 2; transcript_support_level "5"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1";
因为TCGA是基于基因水平进行定量分析的,那么我需要将第3列是gene的行中的ensembl id,gene name提取出来,ensembl id需要去掉版本号;而至于为什么提gene name,因为我想最后以gene name来表示各个gene。
perl -F'\t' -alne 'next unless ($F[2] eq "gene");$F[8] =~ /gene_id\s+"(\w+)\.\d+";.*?gene_name\s+"(\S+)";/; print "$1\t$2"' gencode.v27.long_noncoding_RNAs.gtf >lncRNA_ensembl2name.list
lncRNA_ensembl2name.list文件总共有15778行,其中12行是重复的数据,那么去重后是15766行
如果仔细研究下,会发现lncRNA应该有其特定的几个gene_type,从上述的gtf文件就可看出,那么哪些gene_type属于lncRNA呢,首先我们从GENCODE的gtf文件中先提取下看看
perl -F'\t' -alne '$F[8] =~ /gene_type\s+"(\S+?)"/;print $1' gencode.v27.long_noncoding_RNAs.gtf |sort |uniq
几个gene_type结果如下:
3prime_overlapping_ncRNA antisense_RNA bidirectional_promoter_lncRNA lincRNA macro_lncRNA non_coding processed_transcript sense_intronic sense_overlapping TEC
总共有10种类型,那么ENSEMBL官网是怎么对lncRNA定义及分类的呢。http://www.ensembl.org/Help/Faq?id=468中lncRNA的biotype分类有:
3prime overlapping ncrna ambiguous orf antisense antisense RNA lincRNA ncrna host processed transcript sense intronic sense overlapping
ENSEMBL官网还有一个对lncRNA biotype的定义http://vega.archive.ensembl.org/info/about/gene_and_transcript_types.html,怎么同是ENSEMBL官网,怎么还有差别?;其实后者是Havana biotypes,ENSEMBL是这么说的:The full list of Havana biotypes can be found here,那么这个Havana对gene的分类(biotype)是怎么样的,如下:
Non coding 3prime_overlapping_ncRNA Antisense lincRNA (long interspersed ncRNA) Retained_intron Sense_intronic Sense_overlapping Macro_lncRNA Bidirectional lncRNA
其实我们最终还是要看ENSEMBL的Homo_sapiens.GRCh38.90.chr.gtf文件中的gene_biotype有哪些。。然后我们再根据gene_biotype中属于lncRNA的ensembl id提取出来,先看看Homo_sapiens.GRCh38.90.chr.gtf总共有哪些:
perl -F'\t' -alne '$F[8] =~ /gene_biotype\s+"(\S+?)";/; print $1' Homo_sapiens.GRCh38.90.chr.gtf |sort |uniq
结果比较长:
3prime_overlapping_ncRNA antisense_RNA bidirectional_promoter_lncRNA IG_C_gene IG_C_pseudogene IG_D_gene IG_J_gene IG_J_pseudogene IG_pseudogene IG_V_gene IG_V_pseudogene lincRNA macro_lncRNA miRNA misc_RNA Mt_rRNA Mt_tRNA non_coding polymorphic_pseudogene processed_pseudogene processed_transcript protein_coding pseudogene ribozyme rRNA scaRNA scRNA sense_intronic sense_overlapping snoRNA snRNA sRNA TEC TR_C_gene TR_D_gene TR_J_gene TR_J_pseudogene TR_V_gene TR_V_pseudogene transcribed_processed_pseudogene transcribed_unitary_pseudogene transcribed_unprocessed_pseudogene translated_processed_pseudogene unitary_pseudogene unprocessed_pseudogene vaultRNA
综合一下,以下biotype可以认为属于lncRNA?(其实我也不确定这样可行否)
3prime_overlapping_ncRNA antisense_RNA bidirectional_promoter_lncRNA lincRNA macro_lncRNA non_coding processed_transcript sense_intronic sense_overlapping
从这里可以看出来了,ENSEMBL的gtf文件中属于lncRNA的有以上9种,除了没有TEC外,其他跟从GENCODE提取出来的完全一致。ENSEMBL对TEC定义是这样的:This is used for non-spliced EST clusters that have polyA features. This category has been specifically created for the ENCODE project to highlight regions that could indicate the presence of novel protein coding genes that require experimental validation, either by 5' RACE or RT-PCR to extend the transcripts, or by confirming expression of the putatively-encoded peptide with specific antibodies.我理解为TEC biotype是GENCODE认为可能是新的转录本,但是需要进行实验验证,这就可以解释为何从GENCODE lncRNA的gtf文件提取出来的gene type中有TEC
最后我需要将ENSEMBL的Homo_sapiens.GRCh38.90.chr.gtf文件中属于lncRNA的ensembl id给提取出来,当然也顺便提取了gene name和biotype
#!/usr/bin/perl -w use strict; my $biotype = shift @ARGV; my %biotype_list; open my $fh1, $biotype or die; while (<$fh1>) { chomp; $biotype_list{$_} = 1; } close $fh1; my $gtf = shift @ARGV; open my $fh2, $gtf or die; while (<$fh2>) { chomp; my @array = split /\t/, $_; next unless ($array[2] eq "gene"); $array[8] =~ /gene_id\s+"(\S+?)";.*gene_name\s+"(\S+?)";.*gene_biotype\s+"(\S+?)";/; my $geneid = $1; my $genename = $2; my $genebiotype = $3; if ($biotype_list{$genebiotype}) { print "$geneid\t$genename\t$genebiotype\n"; } } close $fh2;
输出结果有3列,第一列是ensembl id,第二列是gene name,第三列是biotype,总共有14700行,然后跟GENCODE的结果比较下,后者包含了前者,说明结果是正常的,因为后者还多一个TEC的biotype。
最后我例行跟生信人工具再比较下,因为其也有一个ENSG_ID_LNC.txt文件,里面记录了ensembl id跟lncRNA的对应关系,当然也有biotype信息,先粗略看下biotype有几类,结果如下:
3prime_overlapping_ncRNA antisense bidirectional_promoter_lncRNA lincRNA macro_lncRNA processed_transcript sense_intronic sense_overlapping
跟我上面总结的结果比少了一个non_coding的biotype,那么再看一下相同的结果有多少,结果显示只有4000不到的对应关系是相同的,其实就是说至于4000不到的ensembl id对应的lncRNA的gene name是相同的,应该是lncRNA的命名的差别,只是还没查到其ensembl对应的lncRNA是什么来历;有些是lncRNA命名的版本号不同。
最后检查了好久,才晓得原来生信人用的人类ENSEMBL版本是GRCh37.p13,而我用的是GRCh38.p10,TCGA用的版本也是GRCh38,具体可参看网址GDC TCGA的历史版本,其在2016年就使用GRCh38了,比如mRNA-Seq数据则是BAM files aligned to GRCh38 using STAR 2-pass strategy,所以还是纯手工的比较靠谱,软件毕竟是别人开发的。。。
第三步,只要基于ensembl id和lncRNA id的对应关系,从TCGA的表达矩阵中属于lncRNA的那些基因的表达量提取出来即可,然后再按照需要再考虑要不要转化为ensembl的gene name
Summary
从上述几步应该可以顺利的将TCGA中lncRNA的数据从RNA-Seq中提取出来,至于拿lncRNA数据来获得什么具有生物意义的结论,我暂时还没学会。。从TCGA的数据下载到这篇的提取lncRNA可看出,TCGA数据的使用也不是一个太容易的事情。
本文出自于http://www.bioinfo-scrounger.com转载请注明出处