April 21, ; Last Update: How and when are gene symbols and names assigned? How can I obtain the genomic sequence for a introes Notification of changes in Gene. Differing Representations of RefSeqs. How does Gene maintain certain types of information? How are they maintained? How are they reported from the web? How are they reported on w ftp site? Why can I sometimes display a record, but then cannot retrieve it by a query?

In what order are exons presented in ASN. Why are links from Gene to EST not comprehensive? How does Gene represent genes spanning the origin of replication of a circular genome? What is a readthrough locus and how is it represented? How can I determine the position of genes and exons for intrpnes species of interest? How can I retrieve all records for my species of interest? How can I identify genes that have related pseudogenes?

How can I find all genes located within a specific region of a chromosome? Why did many bacterial GeneIDs disappear?

This section includes more details about sourcesupdatesand conventions for genes of uncertain function LOC symbols. Species-specific nomenclature committees, introens great appreciation, as enumerated here and here.

The gene name symbol and protein names provided in submissions used as sources for RefSeq records. Symbols and full descriptions submitted by contributors of information about loci not defined by sequence. Gene attempts to maintain current nomenclature.

You may notice, for example, that symbols in genomic RefSeq annotation, Genome Data Viewer, HomoloGene or UniGene, and their respective ftp sites, are not the same as those you f in Gene. RefSeq, for example, does not resubmit the full annotation of a genomic sequence to the nucleotide database each time a symbol changes.

The symbols seen in Genome Data Viewer and RefSeqs for ontrones, scaffolds, and chromosomes, however, should be the same, because all are updated only with each major re-annotation of a genome.

It may help to consider that the Gene GeneID is unique across all taxa. You can therefore convert any GeneID exonrs its current names by using the definitions provided in the file available as ftp: Gene does not enforce uniqueness in preferred symbols. If the same symbol intrnes been assigned to different genes, and a nomenclature committee has not provided a unique name for these genes, Gene will not impose its own solution.

In other words, please consider use of the GeneID rather than a symbol as the stable identifier of a gene.

Symbols beginning with LOC. This is not nitrones when a replacement symbol has been identified, although queries by the LOC term are still supported.

Names beginning with ‘similar to’. The protein sequences are compared to public protein sequence records from several model organisms.

If a significant match is found, and the name is informative, then the automatic annotation process previously constructed the name of the model by combining ‘similar to ‘ and the name of the matching protein. This is not necessarily the case:. To the greatest extent possible, each protein-coding gene in mitochondria has been assigned the introns name symbol and full description across species.


In some instances, this is at variance with the symbol assigned by species-specific nomenclature committees.

In those intrknes, the species-specific nomenclature is provided, but not as the default. When a gene is annotated on a RefSeq for a chromosome or scaffold, there is an embedded display of the annotation of that gene. A YouTube video describing how to obtain genomic sequence in this manner is also available. Clicking on ‘gene’ results in a display in GenBank format of that subsequence.

For a limited number of genes in the human genome, gene-specific genomic RefSeqs, termed RefSeqGene s, have been created. Gene maintains an RSS feed that is used to notify subscribers of current or future changes in Gene and any of its reports. If this is of interest to you, please subscribe. These links result in a display of RefSeqs specific to the gene in the Nucleotide or Protein databases, as appropriate. The diagram of the placement of RefSeq transcripts in the Transcripts and Products Section is based on the annotation of the positions of exons and coding sequences on the indicated RefSeq.

In most cases, this RefSeq is for the chromosome record of the reference assembly. If there are alternate assemblies, they can be selected for display from the Gene Table display. For some genomes, the genomic RefSeqs are updated independently of the annotated product RNAs, with the latter being updated more frequently.

This means that several kinds of discrepancies between the diagram and the current RefSeq RNAs may result. They therefore can differ from the reference genomic sequence, either for biological reasons variation or RNA editing or some unresolved sequence discrepancy. As discussed in the section above, it is also possible that the sequence of the RefSeq RNA was updated after it was aligned to, and used to annotate, the reference sequence. This also might result in discrepancies between the annotation on the genomic sequence, and the current RefSeq RNA.

At times, one gene record may be merged into another gene record. If genes are merged after an annotation is released, there may be more than one location reported on a genomic sequence per GeneID in the Summary report, each resulting from the annotation before the merge. The names are self-explanatory. If you find a difference in position information that is ‘off-by-one’, please review the conventions used in each file.

The zero-offset convention is used in the ASN. Reports designed for browsing use the convention of one-offset. Please be aware of this when processing these files. Links provided from the Links menu in the upper right-hand part of the Gene record are based on both types of MIM numbers.

Within the body of the record, the MIM number associated with the gene is reported in the See Related and Additional links sections; a MIM number associated with a disease may be reported in the Phenotypes section, along with the name of the condition. Both types of MIM numbers associated with Gene records are reported in the ftp file mim2gene.

Data are also provided by OMM at http: As sequence records are added to or updated in the Protein database, they are compared to records in the Conserved Domain Database CDD to identify likely domain content.


The results of these analyses for RefSeq proteins are indexed for retrieval in Gene, are displayed when a Gene record is retrieved from Entrez, and are integrated into the ASN. The sequence of events is therefore:. To extract domain information directly for any protein sequence, consider using E-utilities. The url to fetch domain data based on a protein gi follows the intronee.


GeneRIFs are established by three primary methods. Extraction from the published literature by staff of the National Library of Medicine. Summary reports from HuGE Navigator. In the first case, the records are updated weekly. In the second case, Gene processes information about how a citation in PubMed is related to a GeneID, and converts that to a standard text. In the last case, RefSeq staff reviews the submission before release, and contacts the submitter if questions arise.

User-submitted data should be public within a week. GeneRIFs are reported from the full report in the Bibliography section. A scrolling window provides unique text of a GeneRIF; the citation or citations that support that statement are available by clicking on the document icon at the left of the GeneRIF. Please be certain to note the report of the number of records return by the query, and scroll through the web page to review all the citations. GeneRIFs are reported from this subdirectory: In these files, each GeneRIF is reported separately.

If there are multiple records for the same gene with the same text, each will be reported from one line in the file. If there are multiple records for the same gene with different text but the same PubMed id, each will be reported from one line in the file. For human, the connection is made from common protein accessions.

Most current gaps in the human set, therefore, result from lags in matching protein accessions to GeneIDs. According to Gene’s current data flow, any association of a protein accession with more than one gene record must be reviewed by a curator. This multiplicity can be frequently with gene families where multiple genes encode the same protein sequence. Gene currently reports, and uses for indexed queries, only the explicit GO term or terms assigned to any gene.

It does not support querying at any node of the GO graph, nor retrieving all genes that match terms at more specific nodes based on a query at a higher node. Gene represents interaction data as pairs. Gene staff does not curate these data, but does validate identifiers supplied with the source files.

The full content of discontinued records is indexed for retrieval in Gene. Often, a comment is provided in the summary section indicating why a record was discontinued. If the record is now secondary to another, the link to the current record is provided.

To retrieve all discontinued records, use this query all[filter] NOT alive[prop]. For recent records, it is possible that the record itself is public, but the indexing of that record is not yet complete so retrieval by Entrez search returns no results.