Gene Product Information (GPI) file description
The (Gene Product Information) (GPI) file represents every annotable entity in an organism: protein-coding gene, non-coding RNA genes, protein isoforms (i. e., splice variants) and modified forms, such as cleaved forms or proteins modified by post-translational modifications. The entities should be non-redundant.
This file is used to normalize annotations to single genes, and to map different identifiers for the same entity across different resources.
GPI 2.0 file format
This page is a summary of the GPI 2.0 file format; for full technical details and changes from the previous formal, GPI 1.2, see the Full GPI 2.0 Specification page.
GPI File Header
Each line of the file header must be prefixed with an exclamation mark (!
).
Mandatory elements of the GPI 2.0 file header are:
- gpi-version
- the name of database or group generating the file, as listed in dbxrefs.yaml file
- the date the file was generated conforming to the date portion of ISO 8601 standards, i. e.
YYYY-MM-DD
- Example GPI 2.0 header:
!gpi-version: 2.0 !generated-by: SGD !date-generated: 2024-05-01
- Additional information may also be included, for example project URL and funding sources. For example:
!URL: http://www.yeastgenome.org/ !Project-release: WS275 !Funding: NHGRI grant number HG012212
GPI File Contents
The GPI 2.0 file comprises 11 tab-delimited fields. For fields that multiple values, those should be separated by pipes (|
).
Required fields are in bold.
Column | Content | Cardinality | Example 1: protein | Example 2: isoform | Example 3: protein complex | Example 4: modified form | Example 5: ncRNA | |
---|---|---|---|---|---|---|---|---|
1 | DB:Object ID | 1 | UniProtKB:Q4VCS5 | UniProtKB:Q4VCS5-1 | SGD:S000217643 | PR:Q9DAQ4-1 | RNAcentral:URS0000527F89_9606 | |
2 | Object Symbol | 1 | AMOT | AMOT | CBF1:MET4:MET28 | m1700003E16Rik/iso:m1 | URS0000527F89_9606 | |
3 | Object Name | 0 or 1 | Angiomotin | Angiomotin | sulfur metabolism transcription factor complex | uncharacterized protein C2orf81 homolog isoform m1 (mouse) | Homo sapiens (human) hsa-miR-145-5p | |
4 | Object_Synonym(s) | 0 or > | KIAA1071 | KIAA1071 | m1700003E16Rik/iso:m1 PR:000000001 | |||
5 | Object Type | 1 | PR:000000001 | PR:000000001 | GO:0032991 | PR:000000001 | SO:0000276 | |
6 | Object Taxon | 1 | NCBITaxon:9606 | NCBITaxon:9606 | NCBITaxon:559292 | NCBITaxon:10090 | NCBITaxon:9606 | |
7 | Encoded by | 0 or > | HGNC:17810 | HGNC:17810 | MGI:MGI:1919087 | HGNC:31532 | ||
8 | Parent Protein | 0 or 1 | UniProtKB:Q4VCS5 | PR:Q9DAQ4 | ||||
9 | Protein Complex Members | 0 or > | SGD:S000003821 | SGD:S000001456 | SGD:S000005047 |
|||||
10 | Cross-reference(s) | 0 or > | NCBIGene:154796 | ENSEMBL:ENSG00000126016 |
NCBIGene:154796 | ENSEMBL:ENSG00000126016 |
ComplexPortal:CPX-1016 | UniProtKB:Q9DAQ4-1 | ENSG00000276365 | |
11 | Gene Product Properties | 0 or > | db_subset=Swiss-Prot |
Definitions and requirements for GPI 2.0 field contents
1. DB:Object ID
- A unique identifier for the entity being annotated, composed of two elements: a DB prefix is the database, that must be described in the GO dbxrefs.yaml file, and a DB Object ID, which is the alphanumerical identifier corresponding to the entity. The DB:DB Object ID is the combined identifier for the database object. Examples:
- UniProtKB:P99999
- SGD:S000002164
- MGI:MGI:1919306
- The identifier may reference the canonical form of a gene or gene product including functional RNAs, as well as gene variants, distinct proteins produced by to differential splicing, alternative translational starts, post-translational cleavage or post-translational modification. If the gene product is not a canonical gene or gene product identifier, the corresponding canonical form must be referenced in Column 8 (Parent Protein) of the GPI file.
- Cardinality = 1
2. Object Symbol
- The unique symbol corresponding to the DB:Object_ID in Column 1; usually the name of the gene. No white spaces allowed.
- The symbol is not a unique identifier or an accession number (unlike the DB:Object_ID), but if the entity does not have a symbol, the DB:Object_ID may be used as Object Symbol. For example, several alternative transcripts from one gene may be annotated separately, each with specific gene product identifiers in DB:Object_ID, but with the same gene symbol in the Object_Symbol column.
- Cardinality = 1
3. Object Name
- The name of the gene or gene product corresponding to the DB:Object_ID in Column 1. White spaces are allowed in this field.
- Cardinality = 0 or 1
4. Object Synonym
- Alternative names for the entity in DB:Object_ID in Column 1. These entries may be a gene symbol, clone ID, or any other label ot identifier. Object synonyms are useful for searching.
- Cardinality = 0, 1, > 1; for cardinality > 1, values must be pipe-separated.
5. Object Type
- An ontology identifier describing the class of biological entity of the DB:Object_ID in Column 1. The ontology identifier must be a value from Protein Ontology for proteins, Gene Ontology for protein complexes, or Sequence Ontology for all other entities. Allowed entity types:
- PR:000000001: protein
- GO:0032991: protein-containing complex
- SO:0001217: protein-coding gene
- SO:0000704: gene
- SO:0000655: ncRNA or any SO child term
- SO:0001263: ncRNA-coding gene or any SO child term
- Note on object types: This field should descibe the type of biological object as defined by the contributing database. For example, WormBase identifiers represent genes, PomBase identifiers represent protein-coding genes, and SGD identifiers represent proteins.
- GO strongly recommends against using ‘gene’ or ‘gene product’ as this does not allow to differentiate between proteins and ncRNAs.
- Cardinality = 1
6. Object Taxon
- The NCBI taxon ID of the organism (species or strain) encoding the DB:Object_ID from Column 1, in the format
NCBITaxon:numerical_identifier
. - Cardinality = 1
7. Encoded by
- For proteins and transcripts, Encoded by refers to the gene ID that encodes those entities, e.g. ENSG00000197153.
- Cardinality = 0, 1, > 1; for cardinality > 1, values must be pipe-separated.
8. Parent Protein
- When the DB:Object_ID in Column 1 describes a protein isoform or a modified protein, this column refers to the gene-centric reference protein accession of the column 1 entry.
- Cardinality = 0, 1
9. Protein-Containing Complex Members
- When the DB:Object_ID in Column 1 describes a protein-containing complex, this column contains the gene-centric reference protein accessions.
- Cardinality = 0, 1, > 1; for cardinality > 1, values must be pipe-separated.
10. Database cross-references (DB_Xrefs)
- Identifiers for the object in DB:Object_ID found in other databases. Identifiers used must be standard 2-part global identifiers, e.g. UniProtKB:Q60FP0. For proteins in model organism databases, DB_Xrefs must include the correponding UniProtKB ID, and may also include NCBI gene or protein IDs, etc.
- Cardinality = 0, 1, > 1; for cardinality > 1, values must be pipe-separated.
11. Gene Product Properties
- The Properties column can be filled with a pipe separated list of values in the format “property_name = property_value”. There is a fixed vocabulary for the property names and this list can be extended when necessary. Supported properties will include: ‘GO annotation complete’, “Phenotype annotation complete’ (the value for these two properties would be a date), ‘Target set’ (e.g. Reference Genome, kidney, etc.), ‘Database subset’ (e.g. Swiss-Prot, TrEMBL).
- Cardinality = 0, 1, > 1; for cardinality > 1, values must be pipe-separated.