About GPAD/GPI files
The Gene Ontology Consortium stores annotation data, the representation of gene product attributes using GO terms, in tab-delimited text files. Gene Product Association Data (GPAD) and (Gene Product Information) (GPI) companion files reduce the redundancy of the Gene Association File (GAF). GAF files contains information about gene products that are present in each line of the GAF: each non-header line in an annotation file represents a single association between a gene product and a GO term with a certain evidence code and the reference to support the link. The GPAD/GPI file system normalizes the data by separating the annotations and metadata about gene and gene product entities in two separate files. GPAD/GPI is intended for internal GO use.
GO also provides annotations as GAF files and recommends use of the GAF format for most use cases. For more general information on annotation, please see the Introduction to GO annotation.
Gene Product Information (GPI) 2.0 file guidelines
This page is a summary of the Gene Product Information Data (GPI) 2.0 format; for full technical details and changes from GPI 1.2 see the GitHub specification page. Note that the GPI file is the companion file for the GPAD file. Both files should be submitted together using the same version.
Changes from the GPI 1.2 to GPI 2.0
- Characters allowed in all fields have been explicitly specified
- Extensions in file names are:
*.gpad
and*.gpi
Header
- The
gpi-version:
header must read2.0
for this format.
Columns
- Columns 1 & 2 in the GPI 1.2 are now combined in a single column containing an ID in CURIE syntax, e.g.
UniProtKB:P56704
. - NCBI taxon IDs are to be prefixed with
NCBITaxon:
to indicate the source of the ID, e.g.NCBITaxon:6239
GPI Header
All annotation files MUST start with a single line denoting the file format. The database/group generating the file, as listed in dbxrefs.yaml, and the ISO-8601 formatted date the file was generated MUST also be included in the header. Example for GPI 2.0:
!gpi-version: 2.0
!generated-by: SGD
!date-generated: 2024-05-01
The group in the generated-by
field must be present in the dbxrefs.yaml file. The year must be YYYY-MM-DD
, conforming to the date portion of ISO 8601 standards. Submitting groups may choose to include optional additional information in a file header by prefixing the line with an exclamation mark (!
); such lines will be ignored by parsers. For example:
!URL: http://www.yeastgenome.org/
!Project-release: WS275
!Funding: NHGRI grant number HG012212
!go-version: https://doi.org/10.5281/zenodo.8436609
GPI file fields
The file format comprises 11 tab-delimited fields. Fields with multiple values (for example, gene product synonyms) should separate values by pipes.
Column | Content | Required? | Cardinality | Example |
---|---|---|---|---|
1 | DB:DB_Object_ID | required | 1 | UniProtKB:Q4VCS5-1 |
2 | DB_Object_Symbol | required | 1 | AMOT |
3 | DB_Object_Name | optional | 0 or greater | Angiomotin |
4 | DB_Object_Synonym(s) | optional | 0 or greater | KIAA1071 |
5 | DB_Object_Type | required | 1 | PR:000000001 |
6 | DB_Object_Taxon | required | 1 | NCBITaxon:9606 |
7 | Encoded_by | optional | 0 or greater | HGNC:17810 |
8 | Parent_Protein | optional | 0 or 1 | UniProtKB:Q4VCS5 |
9 | Protein_Containing_Complex_Members | optional | 0 or greater | SGD:S000003821|SGD:S000001456|SGD:S000005047 |
10 | DB_Xref(s) | optional | 0 or greater | NCBIGene:154796|ENSEMBL:ENSG00000126016 |
11 | Gene_Product_Properties | optional | 0 or greater | db_subset=Swiss-Prot |
GPI 2.0 example content
SGD:S000005027 Sal1 ADP/ATP transporter YNL083W PR:000000001 NCBITaxon:559292 UniProtKB:D6W196
Complex:
SGD:S000217643 CBF1:MET4:MET28 sulfur metabolism transcription factor complex GO:0032991 NCBITaxon:559292 SGD:S000003821,SGD:S000001456,SGD:S000005047 ComplexPortal:CPX-1016
ncRNA:
RNAcentral:URS0000527F89_9606 Homo sapiens (human) hsa-miR-145-5p SO:0000276 NCBITaxon:9606 HGNC:31532 NCBIGene:406937|ENSEMBL:ENSG00000276365
Definitions and requirements for field contents
1. DB:DB Object ID
A unique identifier for the item being annotated. The DB prefix is the database from which the DB Object ID is drawn and must be one of the values from the set of GO database cross-references. The DB:DB Object ID is the combined identifier for the database object. Examples:
UniProtKB:P99999
SGD:S000002164
MGI:MGI:1919306
The identifier usually references the canonical form of a gene or gene product including functional RNAs. Identifiers may also describe gene variants, distinct proteins produced by to differential splicing, alternative translational starts, post-translational cleavage or post-translational modification. If the gene product is not a canonical gene or gene product identifier, the Gene Product Information (GPI) file should contain information about the canonical form of the gene or gene product.
This field is mandatory, cardinality 1.
2. DB Object Symbol
A (unique and valid) symbol to which the DB:DB_Object_ID is matched. No white spaces allowed.
The text entered in the DB_Object_Symbol should refer to the entity in DB:DB_Object_ID. The DB_Object_Symbol field should contain a symbol that is recognizable to a biologist wherever possible (gene product symbol, abbreviation widely used in the literature, ORF name, etc.). It is not a unique identifier or an accession number (unlike the DB:DB_Object_ID), although IDs can be used as a DB_Object_Symbol if there is no more biologically meaningful symbol available (e.g., when an unnamed gene is annotated). For example, several alternative transcripts from one gene may be annotated separately, each with specific gene product identifiers in DB:DB_Object_ID, but with the same gene symbol in the DB_Object_Symbol column.
This field is mandatory, cardinality 1.
3. DB Object Name
The name of the gene or gene product in DB:DB_Object_ID. The text entered in the DB_Object_Name should refer to the entity in DB:DB_Object_ID. White spaces are allowed in this field.
This field is not mandatory, cardinality 0, 1.
4. DB Object Synonym
Alternative names for the entity in DB:DB_Object_ID. These entries may be a gene symbol or other text. Note that we strongly recommend that synonyms are included in the GPI file, as this aids the searching of GO.
This field is not mandatory, cardinality 0, 1, >1 [white space allowed]; for cardinality >1 use a pipe to separate entries (e.g. YFL039C|ABY1|END7|actin gene).
5. DB Object Type
An ontology identifier for the biological entity in DB:DB_Object_ID which is annotated with GO. This field uses Sequence Ontology, Protein Ontology, and GO IDs and must correspond to one of the permitted GPI entity types or a more granular child term. Common entries include:
- protein PR:000000001
- protein-coding gene SO:0001217
- gene SO:0000704
- ncRNA SO:0000655
- any subtype of ncRNA in the Sequence Ontology, including ncRNA-coding gene SO:0001263
- protein-containing complex GO:0032991
The object type listed in the DB_Object_Type field must match the database entry identified by the DB:DB_Object_ID.
This field is mandatory, cardinality 1.
6. DB Object Taxon
The NCBI taxon ID of the species encoding the DB:DB_Object_ID, including the prefix NCBITaxon:
.
This field is mandatory, cardinality 1.
7. Encoded by
For proteins and transcripts, Encoded by refers to the gene ID that encodes those entities, e.g. ENSG00000197153.
This field is not mandatory, cardinality 0, 1, >1 ; for cardinality >1 use a pipe to separate entries.
8. Parent Protein
When column 1 refers to a protein isoform or modified protein, this column refers to the gene-centric reference protein accession of the column 1 entry.
This field is not mandatory, cardinality 0, 1, >1 ; for cardinality >1 use a pipe to separate entries.
9. Protein Containing Complex Members
When column 1 references a protein-containing complex, this column contains the gene-centric reference protein accessions.
This field is not mandatory, cardinality 0, 1, >1 ; for cardinality >1 use a pipe to separate entries.
10. DB Xrefs
Identifiers for the object in DB:DB_Object_ID found in other databases. Identifiers used must be standard 2-part global identifiers, e.g. UniProtKB:Q60FP0. For gene products in model organism databases, DB_Xrefs must include the UniProtKB ID, and may also include NCBI gene or protein IDs, etc.
This field is not mandatory, cardinality 0, 1, >1 ; for cardinality >1 use a pipe to separate entries.
11. Gene Product Properties
The Properties column can be filled with a pipe separated list of values in the format “property_name = property_value”. There is a fixed vocabulary for the property names and this list can be extended when necessary. Supported properties will include: ‘GO annotation complete’, “Phenotype annotation complete’ (the value for these two properties would be a date), ‘Target set’ (e.g. Reference Genome, kidney, etc.), ‘Database subset’ (e.g. Swiss-Prot, TrEMBL).
This field is not mandatory, cardinality 0, 1, >1 ; for cardinality >1 use a pipe to separate entries.