January 1, 1970 (Vol. , No. )
Julia Retey, Ph.D. Genedata
Sebastien Ribrioux, Ph.D. Genedata
You’ve just received the freshly assembled genomic sequence of your favorite organism. Now what?
A deluge of genome sequences has attended the rise of next-generation sequencing (NGS) technologies. However, learning more about the genotype-phenotype relationship can be challenging. Here are some tips to identify and understand the function of genes in a novel genome.
- Run an RNAseq experiment. In organisms where no transcript splicing occurs, ab initio prediction methods (implemented, for example, in the Glimmer program) can identify most of the protein-coding sequences. For all other organisms (principally eukaryotes), these algorithms have their limitations. For example, the correct identification of exon-intron boundaries remains a considerable challenge. Noncoding RNAs are also ignored by these approaches. A more straightforward and comprehensive approach to identify protein-coding sequences is based on empirical evidence, namely short-read sequencing of mRNA (RNAseq), which has become affordable with NGS. Using a paired-end read technology, representative RNA samples can be sequenced at the same time as the genomic DNA.
- Map the short-reads to the genome. A prerequisite to identifying full-length transcripts is mapping of the RNAseq reads to the assembled genome. A tool that can map reads to exon-intron boundaries (e.g. TopHat) is required to achieve the highest accuracy possible, and is indispensable for protein sequence identification (Step 4).
- Identify transcripts. Exons can be identified from the mapped reads and transcripts built from the exons. To get the most from the data, it is best to employ a tool that identifies different splice variants for each gene (e.g. CuffLinks).
- Generate protein sequences. Proteins are the end-product of a coding gene, and a plethora of tools can predict function based on protein sequence (Step 5). Extract the longest open reading frame (ORF) from each transcript, using a tool such as the EMBOSS getORFs.
- Annotate proteins. A lot of information about a gene product can be inferred from sequence similarity. Functional domains can be identified using dedicated packages (e.g., Pfam), and/or function can be predicted through sequence homology with proteins in other organisms (e.g., by Blasting against UniProt).
- Store and analyze genome annotation. Having generated all the annotation for a genome, it’s important to secure effective data mining tools. This is particularly important when the number of sequenced genomes grows and the data also need to be stored for other types of analysis (e.g. phenotype-genotype analysis). (Step 5).