the Gene Ontology

Search
  • Open menus
  • Home
  • FAQ
  • Downloads
    • Ontologies
    • Annotations
    • Database
    • Mappings to GO
    • Teaching Resources
    • Other files
    • FTP and CVS downloads
  • Tools
    • Browsers
    • Microarray tools
    • Annotation tools
    • Other tools
    • Submit New Tools
  • Documentation
    • Introduction
    • Ontology...
      • Ontology structure
      • Ontology relations
      • Cellular Component
      • Molecular Function
      • Biological Process
      • GO Slim Guide
      • OBO v1.2 format
    • Annotation...
      • Annotation Guide
      • Evidence Codes
      • Conventions
      • SOPs
      • Species and Databases
      • File Format
      • Reference collection
    • Database...
      • GO Database Guide
      • Database schema
      • Database abbreviations
    • File Formats...
      • File Format Guide
      • Annotation
      • OBO v1.2
      • OBO v1.0
      • GO RDF-XML
    • Meeting minutes
  • About GO
    • GO Consortium
    • Publications
    • Citation Policy
    • Mailing lists
    • Interest Groups
    • GO People
    • Funding
    • Acknowledgements
    • Newsletter
  • Projects
    • Reference Genomes
    • Cardiovascular
    • Renal
  • Contact GO
    • News
    • RSS
    • Twitter
    • Facebook

GO Annotation File Format Guide

This page documents the file formats used to store gene associations (annotations), data capturing the attributes of gene products using terms from the Gene Ontology, and the QC checks run upon the data submitted by members of the GO Consortium. For more general information on annotation, please see the GO annotation guide.

  • Annotation File Format Guide
  • GAF 1.0
  • GAF 2.0
  • Annotation File Format Quality Control Script
  • Errors Checked
  • Taxon IDs
  • Script command line options

Annotation File Format Guide

The Gene Ontology Consortium stores annotation data, the representation of gene product attributes using GO terms, in tab-delimited plain text files. Each line in the file represents a single association between a gene product and a GO term with a certain evidence code and the reference to support the link.

There are two annotation file formats; GAF 1.0, the format currently used by the GO Consortium, and GAF 2.0, an extension of the current format which captures some extra information for each annotation.

GAF 1.0

  • GAF 1.0 format specification [used and recommended by the GO Consortium]

GAF 2.0

This format is not yet in use by the GO Consortium; it will be introduced in 2010.

  • GAF 2.0 format specification

Back to top

Annotation File Format Quality Control Script

This Perl script is provided as a quality control check in an effort to validate the format and to partially check the data provided within the gene association files. This script is used on all gene association files before they are loaded into the GO database. The results of this filtering step are reported back to the submitting group.

This script is intended to be generic and to enforce the standards defined by the GO Consortium. Use this script to validate your gene association file before committing it to the archive. The checks provided define the minimum standard format for the repository. Suggestions are welcome for enhancements to this process. Download the script directly, via the GO web CVS interface, or from the directory go/software/utilities in the GO CVS repository.

Submitted gene association files are committed to the GO CVS repository into the gene association file submissions directory (go/gene-associations/submission/). The checking and filtering script is run nightly on any newly deposited files by the GO Database staff at Stanford. The output of the script is placed in the gene association file directory (go/gene-associations/) and subsequently used to load the GO database.

Errors Checked

The input file is checked for the following types of errors. If a row of the gene association file is found to contain an error it is removed from the final output file.

The script checks each line for the correct number of columns, the cardinality of the columns, looks for leading or trailing whitespace and does a number of specific checks for data in particular columns.

These specific checks include use of the defined terms for Qualifier, Evidence, Aspect, and DB Object type columns. The DB:Reference, Taxon and GO ID columns are checked for minimal form. The Date is also verified to match the YYYYMMDD format.

Column 1, and all database abbreviations used within the gene association file is checked to see that the abbreviation (case insensitive) is defined within the GO database cross-references.

The GO IDs mentioned in the file are checked, using the current gene_ontology.obo file. Rows with obsolete GO IDs are removed, as well as any row containing an invalid GO ID.

All IEA annotations that are over one year old are removed. This filtering step is completed using the date of annotation stated in column 14. Obviously, the validity of the information in the date column is thus very important.

Taxon IDs

A major component to the filtering is the requirement that particular taxon IDs can only be included within the association files provided by specific projects. For example, the taxon ID for Mus musculus (taxon:10090) is limited to the file provided by the Mouse Genome Informatics project. Please see the list of species and relevant database groups for more details.

Script command line options

Usage help for the script is available with the -h option. The script is designed to be run from the go/gene-associations/submission directory within a GO CVS sandbox. By default the script needs the go/doc/GO.xrf_abbs and go/ontology/gene_ontology_edit.obo files. The input gene association file is read from STDIN by default, or from the specified file defined with the -i option.

Usage

A. check a file for any errors, obsolete GO IDs or old IEA annotations

filter-gene-association.pl -i gene_association.sgd.gz

B. filter any problems and output the validated lines, including headers

filter-gene-association.pl -i gene_association.fb.gz -w > filtered-output

C. check file without the taxid checking on, and write the bad lines to STDOUT

filter-gene-association.pl -i gene_association.fb.gz -p nocheck -e > bad-lines

System requirements

The script is written using basic Perl and should be portable to most systems. It has been tested on MacOSX with Perl 5.8.1 and Solaris with Perl 5.6.1 and greater.

Submitted by Mike Cherry, 2005-10-19

Back to top


Open Biomedical Ontologies logo Last modified Monday, 21-Dec-2009 01:40:01 PST
Help • Cite • Terms of use • Site Map
Copyright © 1999-Friday, 19-Mar-2010 15:42:49 PDT the Gene Ontology