Introduction to the Gene Ontology

The following pages describe in detail the aims of the Gene Ontology Consortium, explain how the ontologies work and how they are structured, and offer details on annotation and file formats. Further information is available on the Frequently Asked Questions page.


What does the Gene Ontology Consortium Do?

Biologists invest a considerable amount of time and effort in searching for all available information about their preferred areas of research. Wide variations in terminology are common, sometimes constituting an impediment for effective searches to both computers and researchers. For instance, when searching for new antibiotic targets, researchers may wish to find all gene products that are involved in bacterial protein synthesis and also have significantly different sequences or structures from those in humans. If one database describes these molecules as being involved in 'translation', whereas another uses the phrase 'protein synthesis', finding functionally equivalent terms will be difficult for the researcher, and even harder for a computer.

The Gene Ontology (GO) project is a collaborative effort to address the need for consistent descriptions of gene products in different databases. The project began in 1998 as a collaboration between three model organism databases: FlyBase (Drosophila), the Saccharomyces Genome Database (SGD) and the Mouse Genome Database (MGD). Since then, the GO Consortium has grown to incorporate many databases, including several of the world's major repositories for plant, animal, and microbial genomes. See the GO Consortium page for a full list of member organizations.

The GO project has developed three structured controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner. There are three separate aspects to this effort: first, the development and maintenance of the ontologies themselves; second, the annotation of gene products, which entails making associations between the ontologies and the genes and gene products in the collaborating databases; and third, the development of tools that facilitate the creation, maintenance and use of ontologies.

The use of GO terms by collaborating databases facilitates uniform queries across all of them. The controlled vocabularies are structured so that they can be queried at different levels: for example, you can use GO to find all the gene products in the mouse genome that are involved in signal transduction, or you can zoom in on all the receptor tyrosine kinases. This structure also allows annotators to assign properties to genes or gene products at different levels, depending on the depth of knowledge about that entity.

Please use the links provided below to learn more about the ontologies, the scope of GO, and to access more information about GO annotation tools, downloads and how you can contribute to GO.


The Scope of GO

It is important to clearly state the scope of GO, explaining what it does and does not cover. The following areas are outside the scope of GO, and terms in these domains will not appear in the ontologies:

  • Gene products: e.g. cytochrome c is not in the ontologies, but attributes of cytochrome c, such as oxidoreductase activity, are.
  • Processes, functions or components that are unique to mutants or diseases: e.g. oncogenesis is not a valid GO term because "causing cancer" is not the normal function of any gene.
  • Attributes of sequence such as intron/exon parameters: these are not attributes of gene products and will be described in a separate sequence ontology (see the OBO website for more information).
  • Protein domains or structural features.
  • Protein-protein interactions.
  • Environment, evolution and expression.
  • Anatomical or histological features above the level of cellular components, including cell types.

GO is not a database of gene sequences, nor a catalog of gene products. Rather, GO describes how gene products behave in a cellular context.

GO is not a dictated standard, mandating nomenclature across databases. Groups participate because of self-interest, and cooperate to arrive at a consensus.

GO is not a way to unify biological databases (i.e. GO is not a 'federated solution'). Sharing vocabulary is a step towards unification, but is not, in itself, sufficient. Reasons for this include the following:

  • Knowledge changes and updates lag behind.
  • Individual curators evaluate data differently. While we can agree to use the word 'kinase', we must also agree to support this by stating how and why we use 'kinase', and consistently apply it. Only in this way can we hope to compare gene products and determine whether they are related.
  • GO does not attempt to describe every aspect of biology; its scope is limited to the domains described above.