The Reference Genome Annotation Project

The GO Consortium coordinated an effort to maximize and optimize GO annotations for a large and representative set of key genomes, known as 'reference genomes'. The Reference Genome Annotation Project aimed to completely annotate twelve reference genomes, producing a resource that may effectively seed automatic annotation efforts of other genomes. The project was funded by the main GO Consortium grant from the the US National Institutes of Health, Human Genome Research Institute.

Trained, highly skilled GO curators formed the Reference Genome GO Annotation Team, which coordinated annotation, facilitated implementation of GO Consortium annotation priorities, and provided quantitative measures to assess progress toward the goal of broad and deep annotation of the reference genomes. The work of this annotation team was published at:

Reference Species and Databases

The reference genomes and responsible database groups are:

Sequences used for building the trees can be obtained from the EBI's Reference Proteomes website. To access the complete proteomes for other species, please visit the UniProt Complete Proteomes collection.

Approach

The Reference Genome Project issued a set of comprehensive experimental GO annotations for all gene products in all twelve Reference Genomes and provided tools for using these annotations to infer GO annotations for all fully sequenced genomes.

Evolutionary relationships are the glue in the Reference Genome Project, and related genes across the twelve Reference Genomes were curated simultaneously. The results of this work provided:

  • Better annotations: each model organism has unique strengths for probing gene function, and bringing this information together helps to interpret experimental results, which improves the accuracy and consistency of annotations.
  • More annotations: homology relationships allow accurate inference of functions for genes that have not been characterized experimentally
  • Improvements in the Gene Ontology: cross-organism discussion about annotations frequently leads to new terms being added to the Gene Ontology.

Curation process

  1. The curation processed first identified an initial set of target genes in (typically) one species. Genes that met the requirements of at least one of the following categories were selected:
    • Orthologs of human disease genes.
    • Topical or "currently hot" genes.
    • Genes conserved from E. coli to human but currently lacking GO annotation.
    • Genes involved in biochemical and/or signaling pathways.
  2. Later, curators identified the ortholog(s)/homolog(s) of the selected target genes in all Reference Genome species, from phylogenetic trees in the PANTHER database. Not all species had orthologs/homologs to selected genes.
  3. Curators from each model organism database collected available literature about the genes in their respective organism.
  4. GO terms were assigned based on experimental data.
  5. Existing GO annotations were reviewed to ensure they conformed to agreed GO annotation standards (see below).
  6. Curators then overlaid all annotations on the phylogenetic tree of the gene family. Annotations were reviewed for consistency, and modified when necessary. Ancestral nodes in the tree were annotated, allowing reliable, traceable inference of annotations based on homology. These processes were carried out using the Phylogenetic Annotation INference Tool (PAINT), operating on the trees in the PANTHER database.

Results

The main results of this process were:

  • Additional quality assurance of experimental GO annotations by viewing each annotation in the context of annotations for related genes
  • A set of high-quality inferred GO annotations derived from the annotated phylogenetic trees
  • A fully traceable evidence trail for all annotations, both experimental and inferred

The reference genome databases agreed to follow more stringent guidelines than those used for standard GO annotation:

  • Experimental evidence codes (IDA, IPI, IMP, IGI, IEP) were used where possible, providing experimentally-based annotations for all gene products from these organisms.
  • Terms inferred from sequence or structural similarity (ISS) were only be used when the terms were supported by experimental evidence for the similar sequence.
  • Non-traceable author statements (NAS) were not used.
  • No new annotations were based on traceable author statements (TAS); an attempt was made to gradually replace existing terms assigned with TAS with the appropriate experimental evidence code based on the primary literature.

Availability

All GO annotations from this project are included in the gene association files that each group submits to GO. Annotations can also be viewed using the GO search engine and browser AmiGO 2.