File Format Guide
The GO File Format Guide documents the structure and syntax of the GO files available on the GO website, to assist users who need to read, write parsers for, or create these files.
See also the GO annotation file format guide for the format used in the gene association files.
Anatomy of a GO Term
Terms and unique identifiers
The structure of a GO term is very simple. At its bare minimum, each GO entry consists of a term name (e.g. cell) and a unique, zero-padded seven-digit identifier (or accession number) prefixed by GO: (e.g. GO:0005623), which is used as a unique idenfier and database cross-reference. The same number range is used across all three ontologies. The numeric portion of a GO ID does not have any 'meaning' or relation to the position of the term in the ontologies; instead, ranges of GO IDs are assigned to specific groups or individual curators, so a GO ID can be used to trace who added a term.
Secondary IDs
Terms may have one or more secondary IDs, alternate IDs that refer to the term. Secondary IDs come about when two or more terms are identical in meaning, and are merged into a single term. All terms IDs are preserved so that no information (for example, annotations to the merged IDs) is lost. More information on the protocols involved can be found in the documentation on term merges.
Synonyms
Any term may, but does not need to, include one or more synonyms (e.g. type I programmed cell death is a synonym of apoptosis). Synonyms are assigned a relationship to the primary term string; see the documentation on synonyms for more information.
Database cross-references
Another optional extra is one or more general database cross-references (dbxrefs), which refer to an identical object in another database. For instance, the molecular function term retinal isomerase activity has the database cross reference EC:5.2.1.3, which is the accession number of this enzyme activity in the Enzyme Commission database. There is a complete list of database cross-references and database abbreviations used by GO available.
Definition and Comment
GO terms should be equipped with a text definition, which includes an indication of the source of the definition. Terms may also have a comment, which gives more information about the term and its usage.
Ontology Flat File Formats
There are two types of ontology flat file format, the older GO flat file format and the newer OBO flat file format. The GO flat file format is now deprecated but will continue to be provided alongside the new format.
See also the Java OBO parser guide, which gives details of the OBO parser implemented as part of OBO-Edit, and how to use it.
GO RDF-XML Format
The GO RDF-XML version of GO, which includes all three ontologies and the definitions, can be downloaded from the GO database archive. The document type definition (DTD) is available from the GO FTP site.
The GO RDF-XML file is built from the flat files and the gene association files on a monthly basis.
Here's a GO RDF-XML snapshot (with some lines wrapped for legibility):
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE go:go>
<go:go xmlns:go="xml-dtd/go.dtd#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<go:version timestamp="Wed May 9 23:55:02 2001" />
<rdf:RDF>
<go:term rdf:about="go#GO:0003673">
<go:accession>GO:0003673</go:accession>
<go:name>Gene_Ontology</go:name>
<go:definition></go:definition>
</go:term>
<go:term rdf:about="go#GO:0003674">
<go:accession>GO:0003674</go:accession>
<go:name>molecular_function</go:name>
<go:definition>The action characteristic of a gene product.</go:definition>
<go:part-of rdf:resource="go#GO:0003673" />
<go:dbxref>
<go:database_symbol>go</go:database_symbol>
<go:reference>curators</go:reference>
</go:dbxref>
</go:term>
<go:term rdf:about="go#GO:0016209">
<go:accession>GO:0016209</go:accession>
<go:name>antioxidant</go:name>
<go:definition></go:definition>
<go:isa rdf:resource="go#GO:0003674" />
<go:association>
<go:evidence evidence_code="ISS">
<go:dbxref>
<go:database_symbol>fb</go:database_symbol>
<go:reference>fbrf0105495</go:reference>
</go:dbxref>
</go:evidence>
<go:gene_product>
<go:name>CG7217</go:name>
<go:dbxref>
<go:database_symbol>fb</go:database_symbol>
<go:reference>FBgn0038570</go:reference>
</go:dbxref>
</go:gene_product>
</go:association>
<go:association>
<go:evidence evidence_code="ISS">
<go:dbxref>
<go:database_symbol>fb</go:database_symbol>
<go:reference>fbrf0105495</go:reference>
</go:dbxref>
</go:evidence>
<go:gene_product>
<go:name>Jafrac1</go:name>
<go:dbxref>
<go:database_symbol>fb</go:database_symbol>
<go:reference>FBgn0040309</go:reference>
</go:dbxref>
</go:gene_product>
</go:association>
</go:term>
</rdf:RDF>
</go:go>
The basic unit of the GO RDF-XML database is GO:termid. Owing to limitations of the XML id and idref attributes (for instance, multiple parentage cannot be represented), the linking mechanism is RDF. RDF provides a much more flexible system for representing trees. To follow the links, note that term molecular function ; GO:0003674 has the attribute
rdf:about="go#GO:0003674"
This is roughly equivalent to
id="go#GO:0003674"
In rdf, unique urls are used as ids to make them universally unique. Now, note that term antioxidant activity ; GO:0016209 has the tag
<go:isa
rdf:resource="go#GO:0003674" />
This shows that its parent is molecular function ; GO:0003674. This tag represents the relationship "GO:0016209 isa GO:0003674" or, in plain English, "antioxidant is a molecular function". The other type of parentage relationship is go:part-of. molecular function ; GO:0003674 has the tag
<go:part-of
rdf:resource="go#GO:0003673" />
This shows the relationship "molecular function is part of the Gene Ontology".
In addition, each term can have one go:name, go:accession, go:definition, or multiple go:dbxrefs or go:associations. go:name, go:accession and go:definition are self-explanatory. go:dbxref represents the term in an external database, and go:association represents the gene associations of each term. go:association can have both go:evidence, which holds a go:dbxref to the evidence supporting the association, and a go:gene_product, which has the gene symbol and go:dbxref.
OBO-XML Format
OBO-XML is a direct XML serialization of the OBO 1.2 Format specification. The schema is specified using RELAX-NG compact syntax: obo-xml.rnc. Currently, only the ontology is available as OBO-XML
OWL Format
OWL is a standard for ontology languages, produced by the W3C. Details of the translation used for GO is available on the official OboInOwl page.
MySQL Format
The MySQL version of GO can be downloaded from the GO database archives. Four databases are built and made available for download:
- termdb
- ontologies, definitions and mappings to other dbs
- assocdb
- the above, plus associations to gene products
- seqdb
- the above, plus protein sequences for some of the gene products
- seqdblite
- the above, with IEA associations stripped out (this is the version that drives AmiGO)
There are two download options for each of these databases, giving 8 possible options. You only need to download one of these files. You should not attempt to parse these files yourself, they are meant to be loaded into a MySQL database. There is also a perl API for advanced queries on the database. For full details, see the README file in the archive. To obtain documentation for the GO database, you should should download either of two files from the archive:
- go_YYYYMM-schema-mysql.sql
- the MySQL table creation statements, plus documentation
- go_YYYYMM-schema-html
- Designed for viewing with a web browser; does not contain full documentation.
Further documentation on the GO database can be found in the GO database guide.
FASTA Format
There is a FASTA version of the gene products in the database available from the database archives.
Mappings to Other Classification Systems
Mappings of GO have been made to other many other classification systems; a full list is available on the Mappings to GO page. The syntax of these files is as follows:
The source of the external file is given in the line beginning !Uses:
!Uses:http://www.tigr.org/docs/tigr-scripts/egad_scripts/role_reports.spl, 15 aug 2000.
The line syntax for mappings is
external database:term identifier (id/name) > GO:GO term name ; GO:id
For example:
TIGR_role:11030 73 Amino acid biosynthesis Glutamate family > GO:glutamine family amino-acid biosynthesis ; GO:0009084
all on a single line. The relationship between terms from external systems to GO terms can also be one to many, and these should just be added with a further >. For example:
MultiFun:1.5.1.18 Isoleucine/valine > GO:isoleucine biosynthesis ; GO:0009097 > GO:valine biosynthesis ; GO:0009099
If no equivalent GO term exists for a term from another classification system, GO:. should be added as a mapping. For example:
MultiFun:1.5 Building block biosynthesis > GO:.