------------------------------------------------------------------------
Swiss-Prot Protein Knowledgebase
Release Notes
Release 41, February 2003
------------------------------------------------------------------------
Table of contents
1 Introduction
2 Description of the changes made to Swiss-Prot since release 40
3 Forthcoming changes
4 Status of the documentation files
5 The ExPASy World-Wide Web server
6 TrEMBL - a supplement to Swiss-Prot
7 FTP access to Swiss-Prot and TrEMBL
8 ENZYME and PROSITE
9 We need your help!
A Appendix A
1 Introduction
Release 41.0 of Swiss-Prot contains 122'564 sequence entries, comprising
44'986'459 amino acids abstracted from 103'486 references. This represents
an increase of 20% over release 40.0. The growth of the database is
summarized below.
Release Date Number of Number of
entries amino acids
2.0 09/86 3'939 900'163
3.0 11/86 4'160 969'641
4.0 04/87 4'387 1'036'010
5.0 09/87 5'205 1'327'683
6.0 01/88 6'102 1'653'982
7.0 04/88 6'821 1'885'771
8.0 08/88 7'724 2'224'465
9.0 11/88 8'702 2'498'140
10.0 03/89 10'008 2'952'613
11.0 07/89 10'856 3'265'966
12.0 10/89 12'305 3'797'482
13.0 01/90 13'837 4'347'336
14.0 04/90 15'409 4'914'264
15.0 08/90 16'941 5'486'399
16.0 11/90 18'364 5'986'949
17.0 02/91 20'024 6'524'504
18.0 05/91 20'772 6'792'034
19.0 08/91 21'795 7'173'785
20.0 11/91 22'654 7'500'130
21.0 03/92 23'742 7'866'596
22.0 05/92 25'044 8'375'696
23.0 08/92 26'706 9'011'391
24.0 12/92 28'154 9'545'427
25.0 04/93 29'955 10'214'020
26.0 07/93 31'808 10'875'091
27.0 10/93 33'329 11'484'420
28.0 02/94 36'000 12'496'420
29.0 06/94 38'303 13'464'008
30.0 10/94 40'292 14'147'368
31.0 02/95 43'470 15'335'248
32.0 11/95 49'340 17'385'503
33.0 02/96 52'205 18'531'384
34.0 10/96 59'021 21'210'389
35.0 11/97 69'113 25'083'768
36.0 07/98 74'019 26'840'295
37.0 12/98 77'977 28'268'293
38.0 07/99 80'000 29'085'965
39.0 05/00 86'593 31'411'114
40.0 10/01 101'602 37'315'215
41.0 02/03 122'564 44'986'459
2 Description of the changes made to Swiss-Prot since release 40
2.1 Sequences and annotations
21'133 sequences have been added since release 40, the sequence data of
3'251 existing entries has been updated and the annotations of 57'525
entries have been revised.
2.2 The HPI project
The Human Proteomics Initiative (HPI) puts a major effort on the annotation
of all known human sequences according to the quality standards of
Swiss-Prot. This means that, for each known protein, a wealth of
information is provided, which includes the description of its function,
its domain structure, subcellular location, post-translational
modifications (PTMs), variants, similarities to other proteins, etc. This
not only implies the annotation of newly detected proteins, but also the
integration of new research data into the existing entries by specialized
biologists, who are in close contact with experts all over the world.
There are currently 9'172 annotated human sequences in Swiss-Prot.
Up-to-date detailed statistics concerning the HPI project are available at:
http://www.expasy.org/sprot/hpi/hpi_stat.html
Simultaneously, two further efforts were increased: the description of
human diseases associated with deficiency(ies) in the protein, and
mammalian orthologs of human proteins are annotated at a level equivalent
to that of the cognate human sequences.
For all aspects of the HPI project, we would appreciate the help and
collaboration of the scientific community. Information concerning the human
proteome is highly critical to a large section of the life science
community. We therefore appeal to the user community to fully participate
in this initiative by providing all the necessary information to define and
to speed up the comprehensive annotation of the human proteome.
For a detailed description of the HPI project please consult:
http://www.expasy.org/sprot/hpi/
2.3 The HAMAP project
The first complete microbial genome sequence was that of the bacterium
Haemophilus influenzae, which became available in 1995. Since then, more
than 100 bacterial and archaeal genomes have been sequenced and many more
sequencing projects of pathogenic and nonpathogenic microbes are in
progress. To date, the publicly available microbial genomes encode more
than 230'000 different proteins.
In order to handle the large amount of "raw" data coming from microbial
genome sequencing, the High quality Automated Microbial Annotation of
Proteomes (HAMAP) project was initiated. The project aims to automatically
annotate a significant percentage of protein sequences, which originate
from microbial genome sequencing projects.
To maintain a high level quality of annotation, specific tools are
developed to deal with two completely separate subsets of bacterial and
archaeal proteins: proteins that have no recognizable similarity to any
other microbial or non-microbial proteins ("ORFans") and proteins that are
part of well-defined families or subfamilies. This is done by using a rule
system that describes the level and extent of annotations that can be
assigned by similarity with a prototype manually annotated entry. The
result is a curated entry whose quality is identical to that produced
manually by an expert annotator.
Programs under development are designed to recognize protein peculiarities,
and only proteins which match the defined criteria are processed
automatically. Protein sequences which fail to fit into the rule system are
further analyzed by Swiss-Prot expert annotators.
For a detailed description of the HAMAP project and its current status
please consult:
http://www.expasy.org/sprot/hamap/
and:
Gattiker A., Michoud K., Rivoire C., Auchincloss A.H., Coudert E., Lima T.,
Kersey P., Pagni M., Sigrist C.J.A., Lachaize C., Veuthey A.-L., Bairoch A.
Automatic annotation of microbial proteomes in Swiss-Prot.
Comput. Biol. Chem. 27:49-58(2003).
2.4 What's happening with the model organisms?
We have selected a number of organisms that are the target of genome
sequencing and/or mapping projects and for which we intend to:
* be as complete as possible. All sequences available at a given time
should be immediately included in Swiss-Prot. This also includes
sequence corrections and updates;
* provide a higher level of annotation;
* provide cross-references to specialized database(s) that contain,
among other data, some information about the genes that code for these
proteins;
* provide specific indexes and documents.
From our efforts to annotate human sequence entries as completely as
possible arose the HPI project (see 2.2), and the bacterial model organisms
became the focus of the HAMAP project (see 2.3). Here is the current status
of the model organisms which are not covered by these two projects:
Organism Database Index file Number of
cross-references sequences
------------ ---------------- -------------- ---------
A.thaliana None yet arath.txt 1'952
C.albicans None yet calbican.txt 264
C.elegans Wormpep celegans.txt 2'291
D.discoideum DictyDB dicty.txt 316
D.melanogaster FlyBase fly.txt 1'764
M.musculus MGD mgdtosp.txt 6'169
S.cerevisiae SGD yeast.txt 4'892
S.pombe GeneDB_SPombe pombe.txt 2'116
2.5 'Nucleomorph' added to the OrGanelle (OG) line
The OG (OrGanelle) line indicates from which genome a gene for a protein
originates. Until now, defined terms in the OG line where 'Chloroplast',
'Cyanelle', 'Mitochondrion' and 'Plasmid'. The term 'Nucleomorph' has been
added, which is the residual nucleus of an algal endosymbiont that resides
inside its host cell.
2.6 Progress in the conversion of Swiss-Prot to mixed-case
characters
We are gradually converting Swiss-Prot entries from all 'UPPER CASE' to
'MiXeD CaSe'. With this release the RC (Reference Comment) line topic
STRAIN and the CC line topic 'CATALYTIC ACTIVITY' have been converted.
As described in section 3.2, the process of converting all of Swiss-Prot to
mixed case continues.
2.7 Multiple RP lines
Starting with release 41, there can be more than one RP (Reference
Position) line per reference in a Swiss-Prot entry. The RP line describes
the extent of the work carried out by the authors of the reference, e.g.
the type of molecule that has been sequenced, protein characterization, PTM
characterization, protein structure analysis, variation detection, etc.
As the number of experimental results per publication has increased over
the years, the limitation of using a single RP line per reference no longer
allowed to add all the information while maintaining a consistent format.
Therefore we decided to permit multiple RP lines.
Example:
RP SEQUENCE FROM N.A., SEQUENCE OF 23-42 AND 351-365, AND
RP CHARACTERIZATION.
2.8 Changes concerning cross-references (DR line)
2.8.1 Schizosaccharomyces pombe GeneDB database
We have added cross-references to the Schizosaccharomyces pombe GeneDB
database (available at http://www.genedb.org/genedb/pombe/index.jsp), which
contains all S. pombe known and predicted protein coding genes, pseudogenes
and tRNAs. It is hosted by the Sanger Institute.
The identifiers of the appropriate DR line are:
Data bank identifier: GeneDB_SPombe
Primary identifier: GeneDB's unique identifier for a S. pombe gene.
Secondary identifier: None; a dash '-' is stored in that field.
Example: DR GeneDB_SPombe; SPAC9E9.12c; -.
2.8.2 Genew
We have added cross-references to the Human Gene Nomenclature Database
Genew (available at http://www.gene.ucl.ac.uk/nomenclature/searchgenes.pl),
which provides data for all human genes which have approved symbols. It is
managed by the HUGO Gene Nomenclature Committee (HGNC).
The identifiers of the appropriate DR line are:
Data bank identifier: Genew
Primary identifier: HGNC's unique identifier for a human gene
Secondary identifier: HGNC's approved gene symbol.
Example: DR Genew; HGNC:5217; HSD3B1.
2.8.3 Gramene
We have added cross-references to the Gramene database, a comparative
mapping resource for grains (available at http://www.gramene.org/). The
format for the explicit links are:
Data bank
identifier: Gramene
Primary identifier: Unique identifier for a protein, which is identical
to the Swiss-Prot primary AC number of that protein.
Secondary identifier: None; a dash '-' is stored in that field.
Example: DR Gramene; Q06967; -.
2.8.4 HAMAP
We have added cross-references to the collection of orthologous microbial
protein families, generated manually by expert curators of the HAMAP
(High-quality Automated and Manual Annotation of microbial Proteomes)
project in the framework of the Swiss-Prot protein knowledgebase. The data
is accessible at http://www.expasy.org/sprot/hamap/families.html.
The identifiers of the appropriate DR line are:
Data bank
identifier: HAMAP
Primary identifier: HAMAP unique identifier for a microbial protein
family
Secondary The values are either '-', 'fused', 'atypical' or
identifier: 'atypical/fused'. The value '-' is a placeholder for
an empty field; the 'fused' value indicates that the
family rule does not cover the entire protein; the
value 'atypical' points out that the protein is
divergent in sequence or has mutated functional
sites, and should not be included in family datasets.
The value 'atypical/fused' indicates both latter
findings.
Tertiary Number of domains found in the protein, generally
identifier: '1', rarely '2' for the fusion of 2 identical
domains.
Example: DR HAMAP; MF_00012; -; 1.
2.8.5 Phosphorylation Site Database
We have added cross-references to the Phosphorylation Site Database,
PhosSite (available at http://vigen.biochem.vt.edu/xpd/xpd.htm), which
provides access to information from scientific literature concerning
prokaryotic proteins that undergo covalent phosphorylation on the hydroxyl
side chains of serine, threonine or tyrosine residues. The identifiers of
the appropriate DR line are:
Data bank identifier: PhosSite
Primary identifier: Unique identifier for a phosphoprotein, which is
identical to the Swiss-Prot primary AC number of
that protein.
Secondary identifier: None; a dash '-' is stored in that field.
Example: DR PhosSite; P00955; -.
2.8.6 TIGRFAMs
We have added cross-references to TIGRFAMs, a protein family database
available at http://www.tigr.org/TIGRFAMs/. The identifiers of the
appropriate DR line are:
Data bank identifier: TIGRFAMs
Primary identifier: TIGRFAMs unique identifier for a protein family.
Secondary identifier: TIGRFAMs entry name for a protein family.
Tertiary identifier: Number of hits found in the sequence.
Example: DR TIGRFAMs; TIGR00630; uvra; 1.
2.8.7 CarbBank
We have removed the Swiss-Prot cross-references to CarbBank.
2.8.8 GCRDb
We have removed the Swiss-Prot cross-references to GCRDb.
2.8.9 Mendel
We have removed the Swiss-Prot cross-references to Mendel.
2.8.10 YEPD
We have removed the Swiss-Prot cross-references to the yeast
electrophoresis protein database (YEPD).
2.9 Explicit links to dbSNP in FT VARIANT lines of human sequence
entries
In human protein sequence entries we have introduced explicit links to the
Single Nucleotide Polymorphism database (dbSNP) from the feature
description of FT VARIANT keys. The format of such links is:
FT VARIANT from to description (IN dbSNP:accession_number).
FT /FTId=VAR_number.
Example:
FT VARIANT 65 65 T -> I (IN dbSNP:1065419).
FT /FTId=VAR_012009.
2.10 Feature key 'SIMILAR' became obsolete
The feature key 'SIMILAR' was used to describe the extent of a similarity
with another protein sequence. Nowadays, most domains with similarity to
other proteins are known regions described in domain and family databases,
which are annotated in Swiss-Prot with the feature key 'DOMAIN' or 'REPEAT'
and the comment (CC) line topic 'SIMILARITY'; thus the feature key
'SIMILAR' became obsolete and will not be used again.
2.11 Version of SP in XML format
A distribution version of Swiss-Prot and TrEMBL in XML format is being
developed. The first draft of the XML specification was released for public
review on February 21, 2002.
For more information see http://www.ebi.ac.uk/swissprot/SP-ML/.
Please send comments and suggestions by electronic mail to sp-ml@ebi.ac.uk.
3 Forthcoming changes
Please note that these are the last release notes in this format. In
future, forthcoming changes and recent modifications are announced to
users also between major Swiss-Prot releases. The distinct sections of
this document will move to the following sites:
* 2. Description of the changes made to Swiss-Prot since the last
release: http://www.expasy.org/sprot/relnotes/sp_news.html. This new
document contains all recent modifications in Swiss-Prot including
minor changes with no impact on the work of software developpers.
Thus this document contains more information than announced in the
document 'sp_soon.html' (see below).
* 3. Forthcoming changes:
http://www.expasy.org/sprot/relnotes/sp_soon.html. All
modifications, which have an impact on the Swiss-Prot format are
announced in this document.
* 4. Status of the documentation files:
http://www.expasy.org/sprot/userman.html#documentation
* 5. The ExPASy World-Wide Web server:
o Explicit general and continuously updated documentation:
http://www.expasy.org/doc/expasy.pdf
o History of changes, improvements and new features:
http://www.expasy.org/history.html
o Swiss-Flash, a service that reports news of databases, software
and service developments: http://www.expasy.org/swiss-flash/
* 6. TrEMBL - a supplement to Swiss-Prot:
ftp://ftp.ebi.ac.uk/pub/databases/trembl/relnotes.txt
* 7. FTP access to Swiss-Prot and TrEMBL:
http://www.expasy.org/sprot/userman.html#ftp and
http://www.expasy.org/sprot/download.html
* 8. ENZYME and PROSITE: Enzyme release notes (not yet) and
http://www.expasy.org/prosite/psrelnot.html
* Appendix A (Release statistics):
http://www.expasy.org/sprot/relnotes/relstat.html
* Appendix B (Relationships between Swiss-Prot and some biomolecular
databases): http://www.expasy.org/sprot/userman.html#relship
3.1 Extension of the entry name format
We endeavor to assign meaningful entry names that facilitate the
identification of the proteins and the species of origin. Currently the
entry name consists of up to ten uppercase alphanumeric characters.
Swiss-Prot uses a general purpose naming convention that can be symbolized
as X_Y, where X is a mnemonic code of at most 4 alphanumeric characters
representing the protein name, the '_' sign serves as a separator, and the
Y is a mnemonic species identification code of at most 5 alphanumeric
characters representing the biological source of the protein.
We are planning to elongate the mnemonic code for the protein name from up
to 4 characters to up to 5 characters. E.g. the mnemonic code for the
meiotic recombination protein rec10 is currently 'RE10'. After the
introduction of extended entry names it could be modified to the 5-letter
code 'REC10'.
3.2 Continuation of the conversion of Swiss-Prot to mixed-case
characters
We will continue to convert Swiss-Prot entries from all 'UPPER CASE' to
'MiXeD CaSe'. We are proceeding in the conversion of CC (Comment) lines, we
will start to convert the GN (Gene Name) lines to mixed case, but also any
other line type might be effected.
3.3 Reference Comment (RC) line topics may span lines
The RC (Reference Comment) line store comments relevant to the reference
cited, in currently 5 distinct topics: PLASMID, SPECIES, STRAIN, TISSUE and
TRANSPOSON. It is not always possible to list all information within one
line. Therefore we will allow multiple RC lines, in which one topic might
span over a line. Example:
RC STRAIN=Various strains;
could become
RC STRAIN=AZ.026, DC.005, GA.039, GA2181, IL.014, IN.018, KY.172, KY2.37,
RC LA.013, MN.001, MNb027, MS.040, NY.016, OH.036, TN.173, TN2.38,
RC UT.002, AL.012, AZ.180, MI.035, VA.015, and IL2.17;
3.4 New format of comment line (CC) topics
We are continuing a major overhaul of various comment line topics. We would
like the majority of the information stored to be usable by computer
programs (while remaining human-readable). We are therefore standardizing
the format of the topics.
3.4.1 ALTERNATIVE PRODUCTS
We are gradually restructuring the CC (comment) line topic ALTERNATIVE
PRODUCTS and introducing unique identifiers for each described isoform.
Qualifiers, which will be introduced are described in the table below:
Topic Description
Event Biological process that results in the
production of the alternative forms (Alternative
promoter, Alternative splicing, Alternative
initiation).
Format: Event=controlled vocabulary;
Example: Event=Alternative splicing;
Named Number of isoforms listed in the topics 'Name'
isoforms below the topic 'Event=Alternative splicing'.
Format: Named isoforms=number;
Example: Named isoforms=6;
Comment Any comments concerning one or more isoforms;
optional; may be longer than 1 line.
Format: Comment=free text;
Example: Comment=Experimental confirmation may
be lacking for some isoforms;
Name A common name for an isoform used in the
literature or assigned by Swiss-Prot (currenty
only available for spliced isoforms).
Format: Name=common name;
Example: Name=Alpha;
Synonyms Synonyms for an isoform as used in the
literature; optional.
Format: Synonyms=synonym_1[, synonym_n];
Example: Synonyms=B, KL5;
IsoId Unique identifier for an isoform, consisting of
the Swiss-Prot accession number, followed by a
dash and an identifier for this isoform.
Format: IsoId=acc#-isoform_number[,acc#-isoform_number];
Example: IsoId=P05067-1;
Sequence Lists all FT VARSPLIC identifiers (VSP_#), which
are needed to build the sequence for a specific
isoform. If the accession number of the IsoId
does not correspond to the accession number of
the current entry, this topic contains the term
'External'.
Format: Sequence=VSP_#[,VSP_#]|Displayed|External|Not described;
Example: Sequence=Displayed;
Example: Sequence=VSP_000013, VSP_000014;
Note Notes concerning current isoform; optional;
Format: Note=free text;
Example: Note=Predicted;
In the case of 'Alternative initiation' the topic 'Event' can be followed
by a 'Comment' of free text. Format:
CC -!- ALTERNATIVE PRODUCTS:
CC Event=Alternative initiation;
CC Comment=Optional free text with information on alternative
CC initiation or the products retrieved from this event. In the
CC case of alternative initiation there will be no other topics;
In the case of 'Alternative splicing' the topic 'Event' can be followed by
a 'Comment' of free text and a listing of all described isoforms. Format:
CC -!- ALTERNATIVE PRODUCTS:
CC Event=Alternative splicing;
CC Comment=Optional free text with information on alternative
CC splicing or the products retrieved from this event;
CC Name=isoform_1; Synonyms=synonym_1[, synonym_n];
CC IsoId=isoform_identifier_1[, isoform_identifer_n];
CC Sequence=VSP_identifier_1 [, VSP_identifier_n];
CC Note=Optional note concerning isoform_1;
CC Name=isoform_n; Synonyms=synonym_1[, synonym_n];
CC IsoId=isoform_identifier_1[, isoform_identifer_n];
CC Sequence=VSP_identifier_1 [, VSP_identifier_n];
CC Note=Optional note concerning isoform_n;
Example for new format of the CC lines and the corresponding FT lines for
an entry with alternative splicing:
...
CC -!- ALTERNATIVE PRODUCTS:
CC Event=Alternative splicing; Named isoforms=9;
CC Comment=Additional isoforms seem to exist. APP695, APP751 and
CC APP770 are the major isoforms. The L-isoforms are referred to as
CC appicans. Experimental confirmation may be lacking for some
CC isoforms;
CC Name=APP770; Synonyms=Prea4 770;
CC IsoId=P05067-1; Sequence=Displayed;
CC Name=APP305;
CC IsoId=P05067-2; Sequence=VSP_000005, VSP_000006;
CC Name=L-APP677;
CC IsoId=P05067-3; Sequence=VSP_000002, VSP_000004, VSP_000009;
CC Name=APP695; Synonyms=Prea4 695;
CC IsoId=P05067-4; Sequence=VSP_000002, VSP_000004;
CC Name=L-APP696;
CC IsoId=P05067-5; Sequence=VSP_000002, VSP_000003, VSP_000009;
CC Name=APP714;
CC IsoId=P05067-6; Sequence=VSP_000002, VSP_000003;
CC Name=L-APP733;
CC IsoId=P05067-7; Sequence=VSP_000007, VSP_000008, VSP_000009;
CC Name=APP751; Synonyms=Prea4 751;
CC IsoId=P05067-8; Sequence=VSP_000007, VSP_000008;
CC Name=L-APP752;
CC IsoId=P05067-9; Sequence=VSP_000009;
...
FT VARSPLIC 289 289 E -> V (in isoform APP695, isoform
FT L-APP696, isoform L-APP677 and isoform
FT APP714).
FT /FTId=VSP_000002.
FT VARSPLIC 290 345 Missing (in isoform L-APP696 and isoform
FT APP714).
FT /FTId=VSP_000003.
FT VARSPLIC 290 364 Missing (in isoform APP695 and isoform
FT L-APP677).
FT /FTId=VSP_000004.
FT VARSPLIC 290 305 VCSEQAETGPCRAMIS -> KWYKEVHSGQARWLML (in
FT isoform APP305).
FT /FTId=VSP_000005.
FT VARSPLIC 306 770 Missing (in isoform APP305).
FT /FTId=VSP_000006.
FT VARSPLIC 345 345 M -> I (in isoform L-APP733 and isoform
FT APP751).
FT /FTId=VSP_000007.
FT VARSPLIC 346 364 Missing (in isoform L-APP733 and isoform
FT APP751).
FT /FTId=VSP_000008.
FT VARSPLIC 637 654 Missing (in isoform L-APP677, isoform
FT L-APP696, isoform L-APP733 and isoform
FT L-APP752).
FT /FTId=VSP_000009.
...
3.4.2 PATHWAY
We are gradually structuring the comment line topic PATHWAY. To describe
the biochemical pathway in which the protein is involved, we use the
following format:
CC -!- PATHWAY: biochemical pathway; nth step.[ Comment.]
Example:
CC -!- PATHWAY: Coenzyme A (CoA) biosynthesis; first step.
3.4.3 COFACTOR
The comment line topic COFACTOR is gradually being modified to the
following format:
CC -!- COFACTOR: cofactor1[, cofactor2 and cofactor3].[ Comment.]
Examples:
CC -!- COFACTOR: Magnesium.
CC -!- COFACTOR: Copper, Manganese and Nickel.
3.5 Changes concerning cross-references (DR line)
We will add cross-references to the Gene Ontology (GO) database (available
at http://www.geneontology.org/), which provides controlled vocabularies
for the description of the molecular function, biological process and
cellular component of gene products.
The identifiers of the appropriate DR line are:
Data bank identifier: GO
Primary identifier: GO's unique identifier for a GO term.
Secondary identifier: A 1-letter abbreviation for one of the 3 ontology
aspects, separated from the GO term by a column. If
the term is longer than 45 characters, the first 43
characters are indicated followed by 3 dots ('...').
The abbreviations for the 3 distinct aspects of the
ontology are P (biological Process), F (molecular
Function) and C (cellular Component).
Tertiary identifier: 3-character GO evidence code.
Example: DR GO; GO:0003677; F:DNA binding; TAS.
3.6 Modifications concerning the feature table (FT line)
We are investigating a major effort in the annotation of posttranslational
modifications, which has an effect on various feature keys and feature
descriptions. Major format changes are described below.
3.6.1 New feature key 'CROSSLNK'
The feature key 'CROSSLNK' will be introduced to describe bonds between
amino acids, which are formed posttranslationally within a peptide or
between peptides, such as isopeptidic bonds, carbon-carbon linkages,
carbon-nitrogen linkages and backbone condensations. It will also include
the description of tioether bonds and thiolester bonds and thus the feature
keys 'THIOETH' and 'THIOLEST' will be removed.
Note: Disulfide bonds occur so often in proteins, that we will keep the
special feature key 'DISULFID' to describe this kind of linkage.
Format:
FT CROSSLNK from to Description.
3.6.2 Removal of the feature key 'THIOETH'
See section 3.6.1.
3.6.3 Removal of the feature key 'THIOLEST'
See section 3.6.1.
4 Status of the documentation files
Swiss-Prot is distributed with a large number of documentation files. Some
of these files have been available for a long time (the user manual,
release notes, the various indexes for authors, citations, keywords, etc.),
but many have been created recently and we are continuously adding new
files, and updating and modifying existing files. Please note that the
header in many documentation files has changed. The following table lists
all the documents that are currently available.
See also section 7.3 for information on how to access updated versions of
all documents between major releases.
userman.txt User manual
relnotes.txt Release notes for the current release (41)
shortdes.txt Short description of entries in Swiss-Prot
jourlist.txt List of cited journals
keywlist.txt List of keywords
plasmid.txt List of plasmids
speclist.txt List of organism (species) identification codes
tisslist.txt List of tissues
experts.txt List of on-line experts for PROSITE and Swiss-Prot
dbxref.txt List of databases cross-referenced in Swiss-Prot
submit.txt Submission of sequence data to Swiss-Prot
acindex.txt Accession number index
autindex.txt Author index
citindex.txt Citation index
keyindex.txt Keyword index
speindex.txt Species index
deleteac.txt Deleted accession number index
7tmrlist.txt List of 7-transmembrane G-linked receptor entries
aatrnasy.txt List of aminoacyl-tRNA synthetases
allergen.txt Nomenclature and index of allergen sequences
annbioch.txt Swiss-Prot annotation: how is biochemical information
assigned to sequence entries
arath.txt Index of Arabidopsis thaliana entries and their
corresponding gene designations [see 2]
bacsu.txt Index of Bacillus subtilis strain 168 chromosomal entries
and their corresponding SubtiList cross-references [see 1]
bloodgrp.txt Blood group antigen proteins
bucai.txt Index of Buchnera aphidicola (subsp. Acyrthosiphon pisum)
entries [see 2]
bucap.txt Index of Buchnera aphidicola (subsp. Schizaphis graminum)
entries[see 2]
calbican.txt Index of Candida albicans entries and their corresponding
gene designations
cdlist.txt CD nomenclature for surface proteins of human leucocytes
Index of Caenorhabditis elegans entries and their
celegans.txt corresponding gene designations and WormPep
cross-references
Index of Dictyostelium discoideum entries and their
dicty.txt corresponding gene designations and DictyDB
cross-references
ec2dtosp.txt Index of Escherichia coli Gene-protein database
(ECO2DBASE) entries referenced in Swiss-Prot
ecoli.txt Index of Escherichia coli strain K12 chromosomal entries
and their corresponding EcoGene cross-references
embltosp.txt Index of EMBL Nucleotide Sequence Database entries
referenced in Swiss-Prot
extradom.txt Nomenclature of extracellular domains
fly.txt Index of Drosophila entries and their corresponding
FlyBase cross-references
glycosid.txt Classification of glycosyl hydrolase families and index of
glycosyl hydrolase entries in Swiss-Prot
haein.txt Index of Haemophilus influenzae strain Rd chromosomal
entries [see 1]
helpy.txt Index of Helicobacter pylori strain 26695 chromosomal
entries [see 1]
hoxlist.txt Vertebrate homeotic Hox proteins: nomenclature and index
humchr01.txt Index of proteins encoded on human chromosome 1
humchr02.txt Index of proteins encoded on human chromosome 2
humchr03.txt Index of proteins encoded on human chromosome 3
humchr04.txt Index of proteins encoded on human chromosome 4
humchr05.txt Index of proteins encoded on human chromosome 5
humchr06.txt Index of proteins encoded on human chromosome 6
humchr07.txt Index of proteins encoded on human chromosome 7
humchr08.txt Index of proteins encoded on human chromosome 8
humchr09.txt Index of proteins encoded on human chromosome 9
humchr10.txt Index of proteins encoded on human chromosome 10
humchr11.txt Index of proteins encoded on human chromosome 11
humchr12.txt Index of proteins encoded on human chromosome 12
humchr13.txt Index of proteins encoded on human chromosome 13
humchr14.txt Index of proteins encoded on human chromosome 14
humchr15.txt Index of proteins encoded on human chromosome 15
humchr16.txt Index of proteins encoded on human chromosome 16
humchr17.txt Index of proteins encoded on human chromosome 17
humchr18.txt Index of proteins encoded on human chromosome 18
humchr19.txt Index of proteins encoded on human chromosome 19
humchr20.txt Index of proteins encoded on human chromosome 20
humchr21.txt Index of proteins encoded on human chromosome 21
humchr22.txt Index of proteins encoded on human chromosome 22
humchrx.txt Index of proteins encoded on human chromosome X
humchry.txt Index of proteins encoded on human chromosome Y
humpvar.txt Index of human proteins with sequence variants
initfact.txt List and index of translation initiation factors
intein.txt Index of intein-containing entries referenced in
Swiss-Prot
metallo.txt Classification of metallothioneins and index of the
entries in Swiss-Prot
metja.txt Index of Methanococcus jannaschii entries [see 1]
mgdtosp.txt Index of MGD entries referenced in Swiss-Prot
mimtosp.txt Index of MIM entries referenced in Swiss-Prot
mycge.txt Index of Mycoplasma genitalium strain G-37 chromosomal
entries [see 1]
mycpn.txt Index of Mycoplasma pneumoniae strain M129 chromosomal
entries [see 2]
ngr234.txt Table of predicted proteins in Rhizobium plasmid pNGR234a
nomlist.txt List of nomenclature related references for proteins
pdbtosp.txt Index of Protein Data Bank (PDB) entries referenced in
Swiss-Prot
peptidas.txt Classification of peptidase families and index of
peptidase entries in Swiss-Prot
plastid.txt List of chloroplast and cyanelle encoded proteins
pombe.txt Index of Schizosaccharomyces pombe entries and their
corresponding gene designations
restric.txt List of restriction enzyme and methylase entries
ribosomp.txt Index of ribosomal proteins classified by families on the
basis of sequence similarities
ricpr.txt Index of Rickettsia prowazekii strain Madrid E entries
[see 1]
salty.txt Index of Salmonella typhimurium strain LT2 chromosomal
entries and their corresponding StyGene cross-references
syny3.txt Index of Synechocystis sp. strain PCC 6803 entries [see 1]
upflist.txt List of UPF (Uncharacterized Protein Families) and index
of members
yeast.txt Index of Saccharomyces cerevisiae entries in Swiss-Prot
and their corresponding gene designations
yeast1.txt Yeast chromosome I entries
yeast2.txt Yeast chromosome II entries
yeast3.txt Yeast chromosome III entries
yeast5.txt Yeast chromosome V entries
yeast6.txt Yeast chromosome VI entries
yeast7.txt Yeast chromosome VII entries
yeast8.txt Yeast chromosome VIII entries
yeast9.txt Yeast chromosome IX entries
yeast10.txt Yeast chromosome X entries
yeast11.txt Yeast chromosome XI entries
yeast13.txt Yeast chromosome XIII entries
yeast14.txt Yeast chromosome XIV entries
Notes:
1) The filenames for indexes of microbe-specific entries have been
renamed; the filename is now composed of the 5-letter code used for
the species in the Swiss-Prot entry name and the extension 'txt'.
This modification concerns the following files:
'bacsu.txt' (formerly: 'subtilis.txt'), 'haein.txt' (formerly:
'haeinflu.txt'), 'helpy.txt' (formerly: 'hpylori.txt'), 'metja.txt'
(formerly: 'mjannasc.txt'), 'mycge.txt' (formerly: 'mgenital.txt'),
'ricpr.txt' (formerly: 'rprowaze.txt'), 'syny3.txt' (formerly:
'pcc6803.txt').
2) The files 'arath.txt', 'bucai.txt', 'bucap.txt' and 'mycpn.txt' are
new documents introduced since release 40.
We have continued to include in some Swiss-Prot documentation files the
references to Web sites relevant to the subject under consideration. There
are now 89 documents that include such links.
5 New features of the ExPASy World-Wide Web server related to
Swiss-Prot
Explicit general and continuously updated documentation about the ExPASy
server is available at http://www.expasy.org/doc/expasy.pdf.
ExPASy is constantly modified and improved. If you wish to be informed on
the changes made to the server you can either:
* Read the document 'History of changes, improvements and new features'
which is available at the address: http://www.expasy.org/history.html
* Subscribe to Swiss-Flash, a service that reports news of databases,
software and service developments. By subscribing to this service, you
will automatically get Swiss-Flash bulletins by electronic mail. To
subscribe, use the address: http://www.expasy.org/swiss-flash/.
Among all the improvements and the new features introduced since the last
Swiss-Prot release, here are those that we believe are specifically useful
to Swiss-Prot users:
1. The NiceProt view of Swiss-Prot has been further improved: access to
documentation has been facilitated by adding "mouse-over" hypertext
links from various sections in NiceProt to the corresponding
information in the user manual. Those hypertext links, which give
access to documentation rather than the data related to the protein
entry, are visually different from the ordinary hyperlinks. While they
are not immediately recognizable as such, the user can see that they
are clickable by moving the mouse pointer over the section headings
such as "References" or "Keywords". A short description of the linked
information appears at the bottom of the web browser, and when
clicked, a small additional window is opened with related information
extracted from the user manual.
Similarly, in the "Cross-references" section, the names of the
databases to which an entry is cross-referenced are linked to the
corresponding sections in the document dbxref.txt (List of databases
cross-referenced in Swiss-Prot).
2. Implicit links have been added to the resources AraC-XylS, Ensembl and
ModBase. We have removed the implicit links to DOMO, which is no
longer maintained.
For more details on Swiss-Prot cross-references, implicit and explicit
links, you can read:
Gasteiger E., Jung E., Bairoch A.
Swiss-Prot: connecting biological knowledge via a protein database.
Curr. Issues Mol. Biol. 3:47-55(2001)
3. A few improvements have been applied to the pages describing the Human
Proteomics Initiative (HPI). For each human chromosome a link is
provided to the corresponding index of Swiss-Prot entries, to relevant
information in the EBI Proteome database, in Ensembl, in the Human
Genome Resources at NCBI and in euGenes at Indiana University.
The HPI status report has been modified to include, for each of the
counted items (e.g. splice variants, variants, references) not only
the absolute number, but also the maximal and average number of
occurrences per entry, and the number of entries concerned by the
counted item.
6 TrEMBL - a supplement to Swiss-Prot
The ongoing genome sequencing and mapping projects have dramatically
increased the number of protein sequences to be incorporated into
Swiss-Prot. Since we do not want to dilute the quality standards of
Swiss-Prot by incorporating sequences into the database without proper
sequence analysis and annotation, we cannot speed up the incorporation of
new incoming data indefinitely. But as we also want to make the sequences
available as quickly as possible, we introduced in 1995 a computer
annotated supplement to Swiss-Prot. This supplement consists of entries in
Swiss-Prot-like format derived from the translation of all coding sequences
(CDS) in the EMBL nucleotide sequence database, except those already
included in Swiss-Prot.
This supplement is named TrEMBL (Translation from EMBL). It can be
considered as a preliminary section of Swiss-Prot. This Swiss-Prot release
is supplemented by TrEMBL release 21.
TrEMBL is available by FTP from the EBI and ExPASy servers in the directory
'/databases/trembl'. It can be queried on WWW by the EBI and ExPASy SRS
servers. It is distributed with its own set of release notes.
7 FTP access to Swiss-Prot and TrEMBL
7.1 Generalities
Swiss-Prot is available for download on the following anonymous FTP
servers:
Organization Swiss Institute of Bioinformatics (SIB)
ftp.expasy.org, au.expasy.org, bo.expasy.org,
Address ca.expasy.org, cn.expasy.org, kr.expasy.org,
tw.expasy.org, us.expasy.org
Directory /databases/swiss-prot/
Organization European Bioinformatics Institute (EBI)
Address ftp.ebi.ac.uk
Directory /pub/databases/swissprot/
7.2 Non-redundant database
On the ExPASy and EBI FTP servers we distribute files that make up a
non-redundant and complete protein sequence database consisting of three
components:
1) Swiss-Prot
2) TrEMBL
3) New entries to be integrated later into TrEMBL (hereafter known as
TrEMBL_New)
Every week three files are completely rebuilt. These files are named:
sprot.dat.gz, trembl.dat.gz and trembl_new.dat.gz. As indicated by their '.
gz' extension, these are gzip-compressed files which, when decompressed,
produce ASCII files in Swiss-Prot format.
Three other files are also available (sprot.fas.gz, trembl.fas.gz and
trembl_new.fas.gz) which are compressed 'fasta' format sequence files
useful for building the databases used by FASTA, BLAST and other sequence
similarity search programs. Please do not use these files for any other
purpose, as you will lose all annotations by using this stripped-down
format.
The files for the non-redundant database are stored in the directory
'/databases/sp_tr_nrdb' on the ExPASy FTP server (ftp.expasy.org) and in
the directory '/pub/databases/sp_tr_nrdb' on the EBI FTP server
(ftp.ebi.ac.uk).
Additional notes:
* The Swiss-Prot file continuously grows as new annotated sequences are
added.
* The TrEMBL file decreases in size as sequences are moved out of that
section after being annotated and moved into Swiss-Prot. Four times a
year a new release of TrEMBL is built at EBI, at this point the TrEMBL
file increases in size as it then includes all of the new data (see
next section) that has accumulated since the last release.
* The TrEMBL_New file starts as a very small file and grows in size
until a new release of TrEMBL is available.
* Swiss-Prot and TrEMBL share the same system of accession numbers.
Therefore you will not find any primary accession number duplicated
between the two sections. A TrEMBL entry (and its associated accession
number(s)) can either move to Swiss-Prot as a new entry or be merged
with an existing Swiss-Prot entry. In the latter case, the accession
number(s) of that TrEMBL entry are added to that of the Swiss-Prot
entry.
* TrEMBL_New does not have real accession numbers. However it was
necessary to have an 'AC' line so as to be able to use it with
different software products. This AC line contains a temporary
identifier which consists of the protein_ID (protein sequence
identifier) of the coding sequence in the parent nucleotide sequence.
* TrEMBL_New is quite messy! You will of course find new sequence
entries but you will also encounter sequences that are going to be
used to update existing TrEMBL or Swiss-Prot entries. None of the
"cleaning" steps that are applied to produce a TrEMBL release are run
on TrEMBL_New nor are any of the computer-annotation software tools
that are used to enhance the information content of TrEMBL. TrEMBL_New
is provided only so that users can be sure not to miss any important
new sequences when they run similarity searches.
* While these three files allow you to build what we call a
'non-redundant' database, it must be noted that this is not completely
a true statement. Without going into a long explanation we can say
that this is currently the best attempt in providing a complete
selection of protein sequence entries while trying to eliminate
redundancies. While Swiss-Prot is completely (well 99.994% !)
non-redundant, TrEMBL is far from being non-redundant and the addition
of Swiss-Prot + TrEMBL is even less so.
* To describe to your users the version of the non-redundant database
that you are providing them with, you should use a statement of the
form:
Swiss-Prot release 41.x of xx-yyy-2003;
TrEMBL release 23.x of xx-yyy-2003;
TrEMBL_New of xx-yyy-2003.
7.3 Weekly updates of Swiss-Prot documents
Whilst the ExPASy FTP server so far only allowed FTP access to the
Swiss-Prot documents and indexes in their versions at the time of the last
full release, all documents are now updated with every weekly release of
Swiss-Prot. They are available for FTP download from the directory
/databases/swiss-prot/updated_doc/.
7.4 Weekly updates of Swiss-Prot
Weekly updates of Swiss-Prot are available by anonymous FTP. Three files
are generated at each update:
new_seq.dat Contains all the new entries since the last full
release;
upd_seq.dat Contains the entries for which the sequence data has
been updated since the last release;
upd_ann.dat Contains the entries for which one or more annotation
fields have been updated since the last release.
Important notes
* Although we try to follow a regular schedule, we do not promise to
update these files every week. In most cases two weeks may elapse
between two updates.
* Instead of using the above files, you can, every week, download an
updated copy of the Swiss-Prot database. This file is available in the
directory containing the non-redundant database (see section 7.2).
8 ENZYME and PROSITE
8.1 The ENZYME nomenclature database
Release 30.0 of the ENZYME nomenclature database is distributed with
release 41 of Swiss-Prot. ENZYME release 30.0 contains information relative
to 4'136 enzymes. In this release, we have added a significant number of
new entries and we also updated many entries.
8.2 The PROSITE database
PROSITE now comes with its own release notes.
9 We need your help!
We welcome feedback from our users. We would especially appreciate your
notifying us if you find that sequences belonging to your field of
expertise are missing from the database. We also would like to be notified
about annotations to be updated, if, for example, the function of a protein
has been clarified or if new information about post-translational
modifications has become available. To facilitate this feedback we offer,
on the ExPASy WWW server, a form that allows the submission of updates
and/or corrections to Swiss-Prot:
http://www.expasy.org/sprot/update.html
It is also possible, from any entry in Swiss-Prot displayed by the ExPASy
server, to submit updates and/or corrections for that particular entry.
Finally, you can also send your comments by electronic mail to the address:
swiss-prot@expasy.org
Note that all update requests are assigned a unique identifier of the form
UR-Xnnnn (example: UR-A0123). This identifier is used internally by the
Swiss-Prot staff at SIB and EBI to track requests and is also used in
e-mail exchanges with the persons who have submitted a request.
APPENDIX A: Some statistics
A.1 Amino acid composition
A.1.1 Composition in percent for the complete database
Ala (A) 7.72 Gln (Q) 3.92 Leu (L) 9.56 Ser (S) 6.98
Arg (R) 5.24 Glu (E) 6.54 Lys (K) 5.96 Thr (T) 5.51
Asn (N) 4.28 Gly (G) 6.90 Met (M) 2.36 Trp (W) 1.18
Asp (D) 5.27 His (H) 2.26 Phe (F) 4.06 Tyr (Y) 3.13
Cys (C) 1.60 Ile (I) 5.88 Pro (P) 4.88 Val (V) 6.66
Asx (B) 0.000 Glx (Z) 0.000 Xaa (X) 0.01
A.1.2 Classification of the amino acids by their frequency
Leu, Ala, Ser, Gly, Val, Glu, Lys, Ile, Thr, Asp, Arg, Pro, Asn, Phe,
Gln, Tyr, Met, His, Cys, Trp
A.2 Taxonomic origin
Total number of species represented in this release of Swiss-Prot: 7'778
The first twenty species represent 51'656 sequences: 42.1% of the total
number of entries.
A.2.1 Table of the frequency of occurrence of species
Species represented 1x: 3679
2x: 1206
3x: 619
4x: 403
5x: 273
6x: 251
7x: 192
8x: 146
9x: 120
10x: 66
11- 20x: 331
21- 50x: 250
51-100x: 84
>100x: 158
A.2.2 Table of the most represented species
------ --------- --------------------------------------------
Number Frequency Species
------ --------- --------------------------------------------
1 9172 Homo sapiens (Human)
2 6169 Mus musculus (Mouse)
3 4892 Saccharomyces cerevisiae (Baker's yeast)
4 4832 Escherichia coli
5 3442 Rattus norvegicus (Rat)
6 2402 Bacillus subtilis
7 2291 Caenorhabditis elegans
8 2116 Schizosaccharomyces pombe (Fission yeast)
9 1952 Arabidopsis thaliana (Mouse-ear cress)
10 1773 Haemophilus influenzae
11 1764 Drosophila melanogaster (Fruit fly)
12 1529 Methanococcus jannaschii
13 1485 Escherichia coli O157:H7
14 1389 Bos taurus (Bovine)
15 1371 Mycobacterium tuberculosis
16 1240 Salmonella typhimurium
17 1062 Gallus gallus (Chicken)
18 942 Shigella flexneri
19 919 Synechocystis sp. (strain PCC 6803)
20 914 Escherichia coli O6
21 876 Archaeoglobus fulgidus
22 839 Pseudomonas aeruginosa
23 838 Xenopus laevis (African clawed frog)
24 822 Sus scrofa (Pig)
25 771 Salmonella typhi
26 716 Aquifex aeolicus
27 704 Oryctolagus cuniculus (Rabbit)
28 687 Mycoplasma pneumoniae
29 670 Rhizobium meliloti (Sinorhizobium meliloti)
30 609 Vibrio cholerae
31 599 Treponema pallidum
32 581 Mycobacterium leprae
33 572 Buchnera aphidicola (subsp. Acyrthosiphon pisum)
34 560 Buchnera aphidicola (subsp. Schizaphis graminum)
35 536 Helicobacter pylori (Campylobacter pylori)
36 535 Rickettsia prowazekii
37 524 Yersinia pestis
38 519 Helicobacter pylori J99 (Campylobacter pylori J99)
39 519 Streptomyces coelicolor
40 494 Bacillus halodurans
41 491 Zea mays (Maize)
42 491 Methanobacterium thermoautotrophicum
43 486 Mycoplasma genitalium
44 480 Pasteurella multocida
45 454 Anabaena sp. (strain PCC 7120)
46 432 Lactococcus lactis (subsp. lactis) (Streptococcus lactis)
47 419 Thermotoga maritima
48 416 Oryza sativa (Rice)
49 405 Borrelia burgdorferi (Lyme disease spirochete)
50 404 Chlamydia trachomatis
51 403 Rhizobium sp. (strain NGR234)
52 393 Canis familiaris (Dog)
53 391 Chlamydia pneumoniae (Chlamydophila pneumoniae)
54 390 Neisseria meningitidis (serogroup B)
55 386 Neisseria meningitidis (serogroup A)
56 381 Chlamydia muridarum
57 366 Caulobacter crescentus
58 365 Pyrococcus horikoshii
59 359 Listeria monocytogenes
60 359 Clostridium acetobutylicum
61 357 Pyrococcus abyssi
62 354 Ralstonia solanacearum (Pseudomonas solanacearum)
63 352 Listeria innocua
64 352 Rhizobium loti (Mesorhizobium loti)
65 350 Streptococcus pneumoniae
66 346 Agrobacterium tumefaciens (strain C58 / ATCC 33970)
67 341 Nicotiana tabacum (Common tobacco)
68 337 Xylella fastidiosa
69 335 Deinococcus radiodurans
70 332 Ovis aries (Sheep)
71 326 Xanthomonas campestris (pv. campestris)
72 325 Halobacterium sp. (strain NRC-1)
73 320 Staphylococcus aureus (strain N315)
74 320 Campylobacter jejuni
75 317 Staphylococcus aureus (strain Mu50 / ATCC 700699)
76 316 Dictyostelium discoideum (Slime mold)
77 311 Clostridium perfringens
78 299 Sulfolobus solfataricus
79 297 Staphylococcus aureus (strain MW2)
80 290 Corynebacterium glutamicum (Brevibacterium flavum)
81 288 Pisum sativum (Garden pea)
82 287 Xanthomonas axonopodis (pv. citri)
83 285 Streptococcus pyogenes
84 283 Aeropyrum pernix
85 278 Pyrococcus furiosus
86 278 Staphylococcus aureus
87 269 Brucella melitensis
88 268 Bacteriophage T4
89 266 Neurospora crassa
90 265 Triticum aestivum (Wheat)
91 264 Candida albicans (Yeast)
92 261 Rickettsia conorii
93 258 Hordeum vulgare (Barley)
94 254 Vaccinia virus (strain Copenhagen)
95 251 Glycine max (Soybean)
96 250 Lycopersicon esculentum (Tomato)
97 248 Rhodobacter capsulatus (Rhodopseudomonas capsulata)
98 247 Thermoanaerobacter tengcongensis
99 246 Solanum tuberosum (Potato)
100 244 Pseudomonas putida
A.2.3 Taxonomic distribution of the sequences
Kingdom Sequences (% of the database)
Archaea 7119 ( 6%)
Bacteria 46344 ( 38%)
Eukaryota 60623 ( 49%)
Viruses 8478 ( 7%)
Within Eukaryota:
Category sequences (% of Eukaryota) (% of the complete database)
Human 9172 ( 15%) ( 7%)
Other Mammalia 16041 ( 26%) ( 13%)
Other Vertebrata 5806 ( 10%) ( 5%)
Viridiplantae 9581 ( 16%) ( 8%)
Fungi 9337 ( 15%) ( 8%)
Insecta 3352 ( 6%) ( 3%)
Nematoda 2504 ( 4%) ( 2%)
Other 4830 ( 8%) ( 4%)
A.3 Sequence size
A.3.1 Repartition of the sequences by size (excluding fragments)
From To Number From To Number
1- 50 2283 1001-1100 1127
51- 100 8420 1101-1200 796
101- 150 12542 1201-1300 550
151- 200 11267 1301-1400 379
201- 250 11387 1401-1500 305
251- 300 10019 1501-1600 213
301- 350 10039 1601-1700 166
351- 400 9804 1701-1800 118
401- 450 7435 1801-1900 128
451- 500 6547 1901-2000 106
501- 550 5067 2001-2100 59
551- 600 3400 2101-2200 96
601- 650 2753 2201-2300 99
651- 700 2015 2301-2400 57
701- 750 1766 2401-2500 56
751- 800 1474 >2500 326
801- 850 1101
851- 900 1142
901- 950 817
951-1000 704
A.3.2 Longest and shortest sequences
The shortest sequence is GRWM_HUMAN (P24272) : 3 amino acids.
The longest sequence is NEBU_HUMAN (P20929) : 6669 amino acids.
A.4 Journal citations
Note: the following citation statistics reflect the number of distinct
journal citations.
Total number of journals cited in this release of Swiss-Prot: 1'316
A.4.1 Table of the frequency of journal citations
Journals cited 1x: 496
2x: 167
3x: 84
4x: 61
5x: 46
6x: 47
7x: 26
8x: 25
9x: 22
10x: 11
11- 20x: 98
21- 50x: 98
51-100x: 39
>100x: 96
A.4.2 List of the most cited journals in Swiss-Prot
Nb Citations Journal name
-- --------- -------------------------------------------------------------
1 9138 Journal of Biological Chemistry
2 5013 Proceedings of the National Academy of Sciences of the U.S.A.
3 3631 Nucleic Acids Research
4 3612 Journal of Bacteriology
5 3381 Gene
6 2663 FEBS Letters
7 2598 Biochemical and Biophysical Research Communications
8 2429 European Journal of Biochemistry
9 2383 Biochemistry
10 2171 The EMBO Journal
11 2045 Nature
12 2024 Biochimica et Biophysica Acta
13 1821 Journal of Molecular Biology
14 1752 Genomics
15 1579 Cell
16 1542 Molecular and Cellular Biology
17 1243 Biochemical Journal
18 1146 Science
19 1123 Plant Molecular Biology
20 1117 Molecular and General Genetics
21 1068 Molecular Microbiology
22 855 Journal of Biochemistry
23 830 Virology
24 748 Human Molecular Genetics
25 693 Journal of Cell Biology
26 645 Nature Genetics
27 597 Journal of Virology
28 588 Plant Physiology
29 582 Human Mutation
30 579 Genes and Development
31 550 Oncogene
32 538 The American Journal of Human Genetics
33 530 Infection and Immunity
34 529 Yeast
35 516 Journal of Immunology
36 494 Journal of General Virology
37 469 Archives of Biochemistry and Biophysics
38 454 Structure
39 446 FEMS Microbiology Letters
40 433 Microbiology
41 394 Development
42 379 Human Genetics
43 376 Current Genetics
44 376 Nature Structural Biology
45 347 Genetics
46 343 Molecular and Biochemical Parasitology
47 335 Blood
48 317 Applied and Environmental Microbiology
49 313 Journal of Clinical Investigation
50 299 Molecular Endocrinology
51 283 DNA and Cell Biology
52 282 Protein Science
53 281 Journal of Molecular Evolution
54 276 Developmental Biology
55 276 Mammalian Genome
56 271 Biological Chemistry Hoppe-Seyler
57 251 Cancer Research
58 248 Journal of Experimental Medicine
59 246 Neuron
60 241 Immunogenetics
61 240 Mechanisms of Development
62 229 Journal of General Microbiology
63 228 Endocrinology
64 221 DNA Sequence
65 217 Acta Crystallographica, Section D
66 213 Hoppe-Seyler's Zeitschrift fur Physiologische Chemie
67 209 Molecular Biology of the Cell
68 207 The Plant Cell
69 203 Journal of Cell Science
70 191 Molecular Biology and Evolution
71 190 Brain Research. Molecular Brain Research
72 187 The Plant Journal
73 183 Journal of Neurochemistry
74 180 Journal of Neuroscience
75 160 Comparative Biochemistry and Physiology
76 158 Cytogenetics and Cell Genetics
77 156 DNA
78 154 Bioscience, Biotechnology, and Biochemistry
79 152 The Journal of Clinical Endocrinology and Metabolism
80 145 Toxicon
81 144 Molecular Pharmacology
82 143 Antimicrobial Agents and Chemotherapy
83 140 American Journal of Physiology
84 131 Biochimie
85 127 Bioorganicheskaia Khimiia
86 125 Virus Research
87 125 Proteins
88 122 DNA Research
89 121 Molecular Plant-Microbe Interactions
90 119 Hemoglobin
91 116 Peptides
92 114 Agricultural and Biological Chemistry
93 112 Current Biology
94 111 Journal of Investigative Dermatology
95 110 Molecular and Cellular Endocrinology
96 106 Genome Research
A.5 Statistics for some line types
The following table summarizes the total number of some Swiss-Prot lines,
as well as the number of entries with at least one such line, and the
frequency of the lines.
Total Number of Average
Line type / subtype number entries per entry
--------------------------------- -------- --------- ---------
References (RL) 232571 1.90
Journal 195556 111991 1.60
Submitted to EMBL/GenBank/DDBJ 34500 27873 0.28
Unpublished observations 536 532 <0.01
Submitted to Swiss-Prot 464 462 <0.01
Plant Gene Register 463 453 <0.01
Book citation 460 450 <0.01
Thesis 190 188 <0.01
Submitted to other databases 190 189 <0.01
Unpublished results 123 121 <0.01
Patent 87 86 <0.01
Worm Breeder's Gazette 2 2 <0.01
Comments (CC) 405433 3.31
SIMILARITY 117866 103489 0.96
FUNCTION 77092 75796 0.63
SUBCELLULAR LOCATION 55038 55038 0.45
CATALYTIC ACTIVITY 39528 37138 0.32
SUBUNIT 33846 33846 0.28
PATHWAY 17449 16966 0.14
TISSUE SPECIFICITY 13626 13626 0.11
COFACTOR 12141 12141 0.10
MISCELLANEOUS 7816 7190 0.06
PTM 7140 6571 0.06
ALTERNATIVE PRODUCTS 3946 3946 0.03
INDUCTION 3558 3558 0.03
DOMAIN 3535 3241 0.03
DEVELOPMENTAL STAGE 3362 3362 0.03
CAUTION 3342 3172 0.03
DISEASE 2244 1868 0.02
ENZYME REGULATION 1753 1753 0.01
MASS SPECTROMETRY 893 810 0.01
DATABASE 818 751 0.01
POLYMORPHISM 343 334 <0.01
BIOTECHNOLOGY 50 50 <0.01
PHARMACEUTICAL 47 47 <0.01
Features (FT) 655938 5.35
DOMAIN 95401 28727 0.78
TRANSMEM 77067 16988 0.63
CONFLICT 47337 16661 0.39
CARBOHYD 45507 11138 0.37
DISULFID 41846 10872 0.34
TURN 39177 2956 0.32
METAL 36827 10004 0.30
STRAND 36304 2644 0.30
HELIX 27742 2845 0.23
ACT_SITE 24322 15216 0.20
CHAIN 23456 19176 0.19
VARIANT 23307 4423 0.19
REPEAT 22336 3704 0.18
NP_BIND 15500 10893 0.13
SIGNAL 14828 14826 0.12
MOD_RES 13336 7528 0.11
NON_TER 10321 7875 0.08
BINDING 8145 6285 0.07
ZN_FING 7821 2770 0.06
VARSPLIC 6951 3249 0.06
SITE 6265 4319 0.05
INIT_MET 5574 5545 0.05
PROPEP 4686 4026 0.04
MUTAGEN 4273 1337 0.03
DNA_BIND 4193 3949 0.03
CA_BIND 4049 1149 0.03
LIPID 2946 2395 0.02
TRANSIT 2582 2562 0.02
PEPTIDE 2517 1001 0.02
NON_CONS 804 411 0.01
UNSURE 290 123 <0.01
SE_CYS 111 73 <0.01
THIOETH 94 32 <0.01
THIOLEST 23 23 <0.01
Cross-references (DR) 999237 8.15
EMBL 230657 116257 1.88
InterPro 195677 104236 1.60
Pfam 133012 99557 1.09
PROSITE 105218 66696 0.86
PIR 47040 35736 0.38
PRINTS 39413 34822 0.32
SMART 38729 29473 0.32
HSSP 38069 38069 0.31
TIGRFAMs 31394 29063 0.26
ProDom 30120 28820 0.25
HAMAP 23868 23778 0.19
PDB 11737 3547 0.10
TIGR 11065 11020 0.09
MIM 8171 7086 0.07
Genew 7836 7788 0.06
MGD 5820 5805 0.05
SGD 4936 4882 0.04
EcoGene 4228 4226 0.03
MEROPS 3316 3222 0.03
TRANSFAC 2464 2214 0.02
WormPep 2413 2239 0.02
SubtiList 2362 2361 0.02
FlyBase 2236 2173 0.02
GeneDB_SPombe 2131 2101 0.02
TubercuList 1400 1363 0.01
StyGene 1196 1193 0.01
SWISS-2DPAGE 810 809 0.01
ListiList 712 658 0.01
Leproma 585 581 <0.01
Gramene 411 411 <0.01
MaizeDB 405 401 <0.01
HIV 370 354 <0.01
REBASE 358 353 <0.01
ECO2DBASE 351 299 <0.01
DictyDb 319 316 <0.01
GlycoSuiteDB 259 259 <0.01
ZFIN 225 225 <0.01
PHCI-2DPAGE 211 211 <0.01
MypuList 131 131 <0.01
Aarhus/Ghent-2DPAGE 128 98 <0.01
Siena-2DPAGE 104 104 <0.01
HSC-2DPAGE 85 85 <0.01
PhosSite 53 53 <0.01
COMPLUYEAST-2DPAGE 50 50 <0.01
PMMA-2DPAGE 47 47 <0.01
Maize-2DPAGE 39 39 <0.01
SagaList 25 25 <0.01
ANU-2DPAGE 15 15 <0.01
A.6 Miscellaneous statistics
Total number of distinct authors cited in Swiss-Prot: 164'410
Total number of chloroplast-encoded sequences: 3'131
Total number of mitochondrial-encoded sequences: 2'385
Total number of cyanelle-encoded sequences: 145
Total number of plasmid-encoded sequences: 2'624
Number of additional sequences encoded in splice variants : 5'661
--End of document--