Documentation

The Gene Ontology (GO) project is a collaborative effort to address the need for consistent descriptions of gene products across databases. Founded in 1998, the project began as a collaboration between three model organism databases, FlyBase (Drosophila), the Saccharomyces Genome Database (SGD) and the Mouse Genome Database (MGD). The GO Consortium (GOC) has since grown to incorporate many databases, including several of the world's major repositories for plant, animal, and microbial genomes. The GO Contributors page lists all member organizations.

The GO project has developed three structured ontologies that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner. There are three separate aspects to this effort: first, the development and maintenance of the ontologies themselves; second, the annotation of gene products, which entails making associations between the ontologies and the genes and gene products in the collaborating databases; and third, the development of tools that facilitate the creation, maintenance and use of ontologies.

The use of GO terms by collaborating databases facilitates uniform queries across all of them. Controlled vocabularies are structured so they can be queried at different levels; for example, users may query GO to find all gene products in the mouse genome that are involved in signal transduction, or zoom in on all receptor tyrosine kinases that have been annotated. This structure also allows annotators to assign properties to genes or gene products at different levels, depending on the depth of knowledge about that entity.

Shared vocabularies are an important step towards unifying biological databases, but additional work is still necessary as knowledge changes, updates lag behind, and individual curators evaluate data differently. The GO aims to serve as a platform where curators can agree on stating how and why a specific term is used, and how to consistently apply it, for example, to establish relationships between gene products.

The Scope of GO

The following areas are outside the scope of GO, and terms in these domains will not appear in the ontologies:

  • Gene products: e.g. cytochrome c is not in the ontologies, but attributes of cytochrome c, such as oxidoreductase activity, are.
  • Processes, functions or components that are unique to mutants or diseases: e.g. oncogenesis is not a valid GO term, as "causing cancer" is the result of reprogrammed, not normal cells and thus it is not the normal function of a gene.
  • Attributes of sequence such as "intron" or "exon" parameters belong in a separate sequence ontology; see the Open Biological and Biomedical Ontologies website for more information.
  • Protein domains or structural features.
  • Protein-protein interactions.
  • Environment, evolution and expression.
  • Anatomical or histological features above the level of cellular components, including cell types.

Further information is available on the Frequently Asked Questions (FAQ) page.

Tags: 
User story: 

Ontology Documentation

The Gene Ontology project provides controlled vocabularies of defined terms representing gene product properties. These cover three domains: Cellular Component, the parts of a cell or its extracellular environment; Molecular Function, the elemental activities of a gene product at the molecular level, such as binding or catalysis; and Biological Process, operations or sets of molecular events with a defined beginning and end, pertinent to the functioning of integrated living units: cells, tissues, organs, and organisms.

The GO ontology is structured as a directed acyclic graph where each term has defined relationships to one or more other terms in the same domain, and sometimes to other domains. The GO vocabulary is designed to be species-agnostic, and includes terms applicable to prokaryotes and eukaryotes, and single and multicellular organisms.

In an example of GO annotation, the gene product "cytochrome c" can be described by the Molecular Function term "oxidoreductase activity", the Biological Process terms "oxidative phosphorylation" and "induction of cell death", and the Cellular Component terms "mitochondrial matrix" and "mitochondrial inner membrane".

Ontologies

Cellular Component

These terms describe a component of a cell that is part of a larger object, such as an anatomical structure (e.g. rough endoplasmic reticulum or nucleus) or a gene product group (e.g. ribosome, proteasome or a protein dimer).

Biological Process

A biological process term describes a series of events accomplished by one or more organized assemblies of molecular functions. Examples of broad biological process terms are "cellular physiological process" or "signal transduction". Examples of more specific terms are "pyrimidine metabolic process" or "alpha-glucoside transport". The general rule to assist in distinguishing between a biological process and a molecular function is that a process must have more than one distinct steps.

A biological process is not equivalent to a pathway. At present, the GO does not try to represent the dynamics or dependencies that would be required to fully describe a pathway.

Molecular Function

Molecular function terms describes activities that occur at the molecular level, such as "catalytic activity" or "binding activity". GO molecular function terms represent activities rather than the entities (molecules or complexes) that perform the actions, and do not specify where, when, or in what context the action takes place. Molecular functions generally correspond to activities that can be performed by individual gene products, but some activities are performed by assembled complexes of gene products. Examples of broad functional terms are "catalytic activity" and "transporter activity"; examples of narrower functional terms are "adenylate cyclase activity" or "Toll receptor binding".

It is easy to confuse a gene product name with its molecular function; for that reason GO molecular functions are often appended with the word "activity".

Details about the ontologies

Annotation

Annotation is the process of assigning GO terms to gene products. The annotation data in the GO database is contributed by members of the GO Consortium, and the Consortium is continuously encouraging new groups to start contributing their annotations. The list of links below offer details on the GO annotation policies and the annotation process, as well as direct users to other pages of interest on GO annotation conventions, the standard operating procedures used by some consortium members, and the GO annotation file format guide.

Annotation Extension

Each annotation in the Gene Ontology (GO) pairs a single gene product identifier to a single term from the ontology. This very powerful format can also restrict the descriptiveness of a specific instance of a function or a sub-cellular location, as there must be a pre-existing (pre-composed) term in the ontology that provides full details of the specific aspects of the function.

It is not always possible to create individual terms that precisely describe the context of each activity, for example, the cellular or anatomical location, the dependency on other processes, or specific protein targets. Annotation Extensions allow the curator a less restrictive environment to combine additional terms in a single annotation and provide a more detailed functional description for an individual gene product. Extensions in GO annotations allows GO terms to be further specified, using gene product and chemical identifiers, or terms from GO and external OBO ontologies.

When curators choose to use an Annotation Extension, they are effectively creating on-the-fly cross-product terms (post-composition). The combinatorial term created by Annotation Extension is not added to the ontology, but Ontology Editors it may choose to create an appropriate GO term describing the information on the Extension at a later stage.

To learn more about capturing information in the Annotation Extension field of the GO annotation format, please visit the Annotation Extension Guide on the GO Wiki.

Examples about capturing information in annotation extensions such as specific substrates, products or targets, contextual information (e.g, spatial and temporal information), and details on data that cannot be captured by the current annotation format are available on the Annotations Extension Examples section in the GO Wiki.

Additional general information about annotation can be found in the GO Annotation Policies guide.

Plan of Action

GO Database

The GO database is a relational database comprised of the GO ontologies as well as the annotations of genes and gene products to terms in the those ontologies. Housing both the ontologies and the annotations in a single database allows powerful queries of the annotations using the ontology. The GO database is the source of all data available through the legacy AmiGO 1.8 browser and search engine.

The GO database is maintained as a MySQL database, built at regular intervals. GO database builds can be downloaded and installed on a local machine or can be queried remotely.

File Format Guide

GO Formats

The GO File Format Guide documents the structure and syntax of the files available on the GO website, to assist users who need to read, write parsers for, or create these files. The following file formats are documented separately:

The combined GO annotation and ontology data is stored as a MySQL database; see the GO database documentation for more information, including the database schema.

OBO is fully supported via the OWL-API.

  • OBO format tools in GitHub:a wrapper for the Java (OWL-API) implementation of a parser for OBOF1.4 syntax and an implementation of the OBOF1.4 mapping to OWL (uses the OWL API)
  • OWL API in Github:a Java API for creating, manipulating and serialising OWL Ontologies.

 

Ontology Flat File Formats

The GO Consortium uses the OBO flat file format to store the ontology data. The current version is OBO 1.4, although the ontology data is also available in the previous version, OBO 1.2. The GO Consortium no longer uses or supports files in the legacy GO format. Should you require a file in this format, the command-line script obo2flat can be used to interconvert between OBO format and the legacy GO format. obo2flat is a Java script and comes as part of the OBO-Edit package; instructions on usage are provided in the OBO-Edit User Guide.

OBO-XML Format

OBO-XML is a direct XML serialization of the OBO 1.2 format specification. The schema is specified using RELAX-NG compact syntax: obo-xml.rnc. Currently, only the ontology is available as OBO-XML.

OWL RDF/XML Format

OWL is a standard for ontology languages, produced by the W3C. Details of the translation used for GO is available on the official OboInOwl page.

FASTA Format

Sequence data for gene products in the GO database is available in standard FASTA format from the GO database archives.

Annotation Tools, Downloads, and Beyond

Annotation Tools

Annotation is the practice of capturing the activities and localization of a gene product with GO terms, providing references and indicating what kind of evidence is available to support the annotations. More information on how this is done can be found in the Guide to GO Annotation Policies. Members of the GO Consortium make their annotation data freely available to the public as part of the data accessed by AmiGO 2, the GO browser and search engine. Annotation data sets from individual databases can found on the GO annotations page.

In addition, the GO consortium has prepared GO slims, 'slimmed down' versions of the ontologies that allow you to annotate genomes or sets of gene products to gain a high-level view of gene functions. Using GO slims you can, for example, work out what proportion of a genome is involved in signal transduction, biosynthesis or reproduction. See the GO Slim Guide for more information.

Downloads

All data from the GO project is freely available. Visit the 'Downloads' page to obtain the ontology data in a number of different formats, including XML and mySQL. The GO file format guide has more information on these formats.

If you need lists of the genes or gene products that have been associated with a particular GO term, the current Annotations table tracks the number of annotations and provides links to the gene association files for each of the collaborating databases is available.

Contributing to GO

The GO project is constantly evolving, and we welcome feedback from all users. Learn more about how you can contribute to the GO by visiting our instructions page.

Beyond GO

GO allows us to annotate genes and their products with a limited set of attributes. For example, the GO does not allow for the description of genes in terms of which cells or tissues they're expressed in, which developmental stages they're expressed at, or their involvement in disease. It is not necessary for the GO to do these things because other ontologies are being developed for these purposes. The GO Consortium supports the development of other ontologies, and all the tools for editing and curating ontologies are freely available to the public. A list of freely available ontologies that are relevant to genomics and proteomics and are structured similarly to GO can be found at the Open Biomedical Ontologies website. A larger list, which includes the ontologies listed at OBO and also other controlled vocabularies that do not fulfill the OBO criteria is available at the Ontology Working Group section of the Microarray Gene Expression Data (MGED) Network site.

Cross-products:

The existence of several ontologies will also allow us to create 'cross-products' that maximize the utility of each ontology while avoiding redundancy. For example, by combining the developmental terms in the GO process ontology with a second ontology that describes Drosophila anatomical structures, we could create an ontology of fly development. We could repeat this process for other organisms without having to clutter up GO with large numbers of species-specific terms. Similarly, we could create an ontology of biosynthetic pathways by combining the biosynthesis terms in the GO process ontology with a chemical ontology.

Mappings to other classification systems:

GO is not the only attempt to build structured controlled vocabularies for genome annotation, nor is it the only such series of catalogs in current use. The GO project provides mappings between GO and these other systems, although we caution that these mappings are neither complete nor exact and should only to be used as a guide.

Other Useful Links

  • SourceForge links: Useful links for the SourceForge site, including both general and GO-specific pages.
  • CVS Help: Help page for those who wish to access the anonymous GO CVS repository
  • GO Editor Guides: A set of guides for curators editing the ontologies.

Contributing to GO

Research groups may contribute to the Gene Ontology Consortium (GOC) by providing suggestions for updating the ontology (e.g. requests for new terms) or by providing annotations, that is, associations between genes or gene products and ontology terms. Suggested edits are reviewed by the ontology editors and implemented where appropriate.

The following pages explain how you can contribute to the project. Please begin by choosing whether you wish to contribute annotations or terms to the Gene Ontology.

 

We welcome your contributions:

Annotations-icon.png       ontology-terms-icon.png

Preparing GO Annotations for Submission

This page documents the steps required to take when supplying Gene Ontology annotations to the GO Consortium (GOC). For general information on how to conduct GO annotations, please see the GO Annotation Policies Guide.

Steps to prepare GO annotations for submission to the GOC

1. Contact the Gene Ontology Consortium

Please contact the GOC before carrying out the annotation work; this will ensure that GOC mentors and trainers can be of assistance in producing data sets in agreement with the GOC annotation policies and format requirements.

 

2. Provide a GAF2.0 formatted file

Research groups looking to supply Gene Ontology annotations to the Consortium must submit an appropriately formatted annotation file that conforms to syntactic and semantic requirements of the Consortium. The primary GO annotation format is the Gene Association Format (GAF) 2.0, or GAF2.0. This page contains details on how to build and populate the GAF2.0 File.

Please ensure that:

  • Submissions are made using this flat, tab-delimited format file: GAF2.0
  • The file has the correct file header
  • The file has the correct number of columns, even if some of them are not populated with data
  • If the file contains column names, these must be commented out using an exclamation mark ! at the start of the line
  • The file contains no leading or trailing spaces
 

2.1 Make annotations to UniProtKB accessions or NCBI identifiers

  • Human data, MODs: The ideal object identifiers for annotations are stable database identifiers. That is, ideally, all annotations should describe the activities or locations of protein accession from the UniProt KnowledgeBase (UniProtKB) that are present in the UniProt Reference Proteome Files.
  • Non-MODs: If this is not possible, research groups should first ensure that alternative identifiers are also stable, and then provide identifier mapping files (i.e. gp2protein, gp2rna; see below), where equivalent UniProtKB or NCBI identifiers should be supplied. A gp_unlocalized file should also be provided where no sequence or genomic location is known for a gene identifier.
  • If mapping to UniProtKB or NCBI identifiers is not a possibility: In this case the research group should contact the GOC to explore the alternatives.
 

2.2 Provide a database name

Each research group must provide a database name, which will be used to acknowledge the annotation set and to appropriately credit your work. This name would be visible in the 'assigned_by' field (Column 15) of all the annotation lines that the group is contributing. This name will also be added to the list of annotation providers.

 

2.3 Include bibliographic references

Each annotation line must include the citation of a bibliographic reference, which details the methods and results from which the annotation was made. The reference should be either a PubMed identifier or an abstract (GO_REF) describing how the annotation was made. Please see the Gene Ontology Reference Collection for a list of all current GO references.

 

3. State whether or not regular updates will be submitted

For research groups conducting curation, it is not always necessary to commit to supplying regular updates for their annotations. When the research team chooses to enter 'Longer-term Annotation Contribution / Collaboration' as the submitting group, a primary point of contact must also be identified so that requests may be redirected and proper action on such requests may be taken in a timely manner. The GOC will take responsibility for corrections and updates to datasets included in non-recurring submissions or those from annotation groups that become 'inactive' annotation providers.

   

4. Identifier Mapping Files

Providing complete identifier mapping files is necessary for:

  • Downloading sequences from UniProtKB and NCBI. These sequences are used for inferencing annotations in a phylogenetic context using the Phylogenetic Annotation and Inference Tool (PAINT).
  • Searching for GO annotations in AmiGO, using other database cross-reference IDs (UniProt or NCBI).
  • Helping to keep track of IDs and annotations, removing duplicates, etc.

Please be aware: when identifier mapping is carried out, due to different database release cycles, sequence identifiers that should correspond with each other may not always display the same data.

4.1 gp2protein file

  • The gp2protein format specifications are described here.
  • The gp2protein mapping file must contain the complete list of protein-coding genes in the respective organism (or community), including those proteins not annotated to GO.
  • The first column should contain all gene or gene product identifiers (these are typically MOD-specific identifiers). The second column should contain mappings to canonical identifiers. Protein coding genes must map to UniProtKB accessions (preferably Swiss-Prot, otherwise TrEMBL). If identifiers are unavailable in UniProtKB, NCBI identifiers (NP_ and XP_) are permissible.
  • If the annotation group is satisfied with identifier mappings from an external identifier type to UniProtKB accessions, as supplied by the UniProt Knowledgebase cross-references, then UniProtKB will take the responsibility of supplying the external ID -> UniProtKB mapping to the GO.

4.2 gp2rna file

  • The gp2rna format specifications are similar to those of gp2protein files. The differences between the two are described here.
  • If the annotation file includes non-coding RNAs (ncRNAs), then the corresponding gp2rna file must include all ncRNA genes currently identified in the genome build, including ncRNAs not annotated to GO.
  • Functional ncRNAs must map to NCBI (NR_ or XR_) if available; if unavailable, leave the field blank.

4.3 gp_unlocalized file

  • If your database supplies gene identifiers that have been manually curated from the literature, where no sequence or genomic location is known (such genes have been sometimes described as 'unlocalized genes', 'single heritable traits' or 'phenotypic orphans'), then you should additionally supply a complete gp_unlocalized_file.
  • This file should contain a list of all the non-genome localized gene identifiers available, including those not annotated to GO.
  • The file must meet the gp_unlocalized format specification, which should be similar to the gp2protein file format.

4.4 Exceptions for Macromolecular Complexes

  • If the annotation file includes macromolecular complexes as the subject of the annotation, no corresponding entry is required for the gp2protein file. Only gene or gene product mappings should be included.
  • Groups must regularly update their gp2protein or gp2rna files (i.e. in response to UniProtKB's feedback on inclusion of obsolete or secondary UniProtKB accessions in a group's gp2protein, or in the case NCBI identifiers are made obsolete). For groups who provide authoritative files for a species, or who are funded by the GO NIH grant, please consult the description of GO annotation activities by central GO Consortium members.