With more and more genomes being sequenced, we are in the middle of an explosion of genomic information. The limited resources to manually annotate the growing number of sequenced genomes imply that automatic annotation will be the method of choice for many groups. The GO Consortium coordinated an effort to maximize and optimize the GO annotation of a large and representative set of key genomes, known as 'reference genomes'. The goal of this project was to completely annotate twelve reference genomes so that those annotations may be used to effectively seed the automatic annotation efforts of other genomes.
Reference Species and DatabasesThe reference genomes and responsible database groups are:
- Arabidopsis thaliana (The Arabidopsis Information Resource (TAIR))
- Caenorhabditis elegans (WormBase)
- Danio rerio (zebrafish; Zebrafish Information Network (ZFIN))
- Dictyostelium discoideum (dictyBase)
- Drosophila melanogaster (FlyBase)
- Escherichia coli (PortEco)
- Gallus gallus (AgBase)
- Homo sapiens (human UniProtKB-Gene Ontology Annotation [UniProtKB-GOA] @ EBI)
- Mus musculus (Mouse Genome Informatics)
- Rattus norvegicus (Rat Genome Database (RGD))
- Saccharomyces cerevisiae (Saccharomyces Genome Database (SGD))
- Schizosaccharomyces pombe (Pombase)
Reference Genome Project publications
- Reference Genome Group of the Gene Ontology Consortium. The Gene Ontology's Reference Genome Project: a unified framework for functional annotation across species. PLoS Comput. Biol.. Jul 2009;5(7):e1000431. PMID:19578431; doi:10.1371/journal.pcbi.1000431
Overview of project goal and strategyThe goals of the Reference Genome Project are:
- provide a set of comprehensive experimental GO annotations for all gene products in all twelve Reference Genomes
- provide tools for using these annotations to infer GO annotations for all fully sequenced genomes
- Better annotations: each model organism has unique strengths for probing gene function, and bringing this information together helps to interpret experimental results, which improves the accuracy and consistency of annotations.
- More annotations: homology relationships allow accurate inference of functions for genes that have not been characterized experimentally
- Improvements in the Gene Ontology: cross-organism discussion about annotations frequently leads to new terms being added to the Gene Ontology.
- Identify the initial set of target genes in (typically) one species. Genes are selected that belong to either one of the four following categories:
- Orthologs of human disease genes
- Topical or ‘hot’ genes
- Genes conserved from E. coli to human but currently lacking GO annotation
- Genes involved in biochemical and/or signaling pathways
- Identify the ortholog(s)/homolog(s) of the selected target genes in all Reference Genome species, from phylogenetic trees in the PANTHER database. Not all species may have orthologs/homologs to selected genes
- Curators from each model organism database collect available literature about the genes in their respective organism.
- Curators assign GO terms based on experimental data.
- Review existing GO annotations to make sure they conform to agreed GO annotation standards (see below).
- Overlay all annotations on the phylogenetic tree of the gene family. Annotations are reviewed for consistency, and modified if necessary. Ancestral nodes in the tree are annotated, allowing reliable, traceable inference of annotations based on homology. These processes are carried out using the PAINT (Phylogenetic Annotation INference Tool) software operating on the trees in the PANTHER database.
How does this project differ from standard GO annotation?The main results of this process are:
- additional quality assurance of experimental GO annotations by viewing each annotation in the context of annotations for related genes
- a set of high-quality inferred GO annotations derived from the annotated phylogenetic trees
- a fully traceable evidence trail for all annotations, both experimental and inferred
- Experimental evidence codes (IDA, IPI, IMP, IGI, IEP) should be used where possible. The ultimate objective would be to provide experimentally-based annotations for all gene products from these organisms.
- Terms inferred from sequence or structural similarity (ISS) should only be used where the terms are supported by experimental evidence for the similar sequence.
- Non-traceable author statements (NAS) should not be used.
- No new annotations should be based on traceable author statement (TAS); existing terms assigned with TAS should gradually be replaced with the appropriate experimental evidence code based on the primary literature.