SWISS-PROT RELEASE 37.0 RELEASE NOTES
!! Important: do not forget to read section 10 of these release notes. It
contains an important announcement relevant to SWISS-PROT and PROSITE !!
1. INTRODUCTION
Release 37.0 of SWISS-PROT contains 77'977 sequence entries, comprising
28'268'293 amino acids abstracted from 62'513 references. This represents
an increase of 5.3% over release 36. The growth of the data bank is
summarized below.
Release Date Number of Number of amino
entries acids
2.0 09/86 3939 900 163
3.0 11/86 4160 969 641
4.0 04/87 4387 1 036 010
5.0 09/87 5205 1 327 683
6.0 01/88 6102 1 653 982
7.0 04/88 6821 1 885 771
8.0 08/88 7724 2 224 465
9.0 11/88 8702 2 498 140
10.0 03/89 10008 2 952 613
11.0 07/89 10856 3 265 966
12.0 10/89 12305 3 797 482
13.0 01/90 13837 4 347 336
14.0 04/90 15409 4 914 264
15.0 08/90 16941 5 486 399
16.0 11/90 18364 5 986 949
17.0 02/91 20024 6 524 504
18.0 05/91 20772 6 792 034
19.0 08/91 21795 7 173 785
20.0 11/91 22654 7 500 130
21.0 03/92 23742 7 866 596
22.0 05/92 25044 8 375 696
23.0 08/92 26706 9 011 391
24.0 12/92 28154 9 545 427
25.0 04/93 29955 10 214 020
26.0 07/93 31808 10 875 091
27.0 10/93 33329 11 484 420
28.0 02/94 36000 12 496 420
29.0 06/94 38303 13 464 008
30.0 10/94 40292 14 147 368
31.0 02/95 43470 15 335 248
32.0 11/95 49340 17 385 503
33.0 02/96 52205 18 531 384
34.0 10/96 59021 21 210 389
35.0 11/97 69113 25 083 768
36.0 07/98 74019 26 840 295
37.0 12/98 77977 28 268 293
2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 36
2.1 Sequences and annotations
3'988 sequences have been added since release 36, the sequence data of
667 existing entries has been updated and the annotations of 12'047
entries have been revised.
2.2 What's happening with the model organisms
We have selected a number of organisms that are the target of genome
sequencing and/or mapping projects and for which we intend to:
o Be as complete as possible. All sequences available at a given time
should be immediately included in SWISS-PROT. This also includes
sequence corrections and updates;
o Provide a higher level of annotation;
o Provide cross-references to specialized database(s) that contain,
among other data, some genetic information about the genes that code
for these proteins;
o Provide specific indices or documents.
Here is the current status of the model organisms in SWISS-PROT:
Organism Database Index file Number of
cross-referenced sequences
-------------- ---------------- -------------- ---------
A.thaliana None yet In preparation 792
B.subtilis SubtiList SUBTILIS.TXT 2046
C.albicans None yet CALBICAN.TXT 194
C.elegans Wormpep CELEGANS.TXT 1956
D.discoideum DictyDB DICTY.TXT 285
D.melanogaster FlyBase FLY.TXT 1064
E.coli EcoGene ECOLI.TXT 4476
H.influenzae HiDB (TIGR) HAEINFLU.TXT 1701
H.sapiens MIM MIMTOSP.TXT 5146
H.pylori HpDB (TIGR) HPYLORI.TXT 367
M.genitalium MgDB (TIGR) MGENITAL.TXT 470
M.musculus MGD MGDTOSP.TXT 3387
M.jannaschii MjDB (TIGR) MJANNASC.TXT 1307
M.tuberculosis None yet None yet 918
S.cerevisiae SGD YEAST.TXT 4806
S.typhimurium StyGene SALTY.TXT 723
S.pombe None yet POMBE.TXT 1406
S.solfataricus None yet None yet 84
We plan to finish as quickly as possible the annotation of the
Escherichia coli, Haemophilus influenzae, Methanococcus jannaschii and
yeast (S.cerevisiae) sequence entries which are not yet part of SWISS-
PROT.
2.3 Switch to the NCBI taxonomy
To contribute to the standardization of the taxonomies used in molecular
sequence databases we have changed our taxonomy with release 37. We have
switched to the NCBI taxonomy, which is already used by the
DDBJ/EMBL/GenBank nucleotide sequence databases. The taxonomic
classification maintained at the NCBI is available from the server
http://www.ncbi.nlm.nih.gov/Taxonomy.
This modification affects the OC (Organism Classification) lines. However
it has no impact on the format of that line-type, only on its content.
For example, the OC lines for Homo sapiens (human) used to be:
OC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA;
OC EUTHERIA; PRIMATES.
and is now:
OC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; MAMMALIA; EUTHERIA;
OC PRIMATES; CATARRHINI; HOMINIDAE; HOMO.
The switch to the new taxonomy indirectly brings along additional
changes. Most of these changes are subtle, yet they may have an impact on
some users and some specific usage of SWISS-PROT. We will describe here
some of these changes.
The NCBI taxonomy is much more detailed than that formerly used by SWISS-
PROT. The number of nodes listed in the OC lines is therefore generally
larger. For example, the taxonomic lineage for Pisum sativum (garden pea)
used to be:
OC EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE;
OC FABALES; FABACEAE.
It is now:
OC EUKARYOTA; VIRIDIPLANTAE; STREPTOPHYTA; EMBRYOPHYTA; TRACHEOPHYTA;
OC EUPHYLLOPHYTES; SPERMATOPHYTA; MAGNOLIOPHYTA; EUDICOTYLEDONS;
OC ROSIDAE; FABALES; FABACEAE; PAPILIONOIDEAE; PISUM.
The names of the taxonomic kingdoms at the root of the NCBI taxonomic
tree differ from the old SWISS-PROT taxonomy in the following manner:
NCBI Old SWISS-PROT
---------- --------------
Archaea Archaebacteria
Bacteria Prokaryota
Eukaryota Eukaryota
Viruses Viridae
This is important for users selecting a subset of the database based on a
particular taxonomic kingdom.
We also changed the names of the corresponding files in the
special_selection section of the anonymous FTP server (see section 7.1).
The files:
archaebacteria.seq.xxxxxx
eukaryota .seq.xxxxxx
prokaryota .seq.xxxxxx
viridae .seq.xxxxxx
(where 'xxxxxx' is the date the file was created) are now renamed:
archaea .seq.xxxxxx
eukaryota .seq.xxxxxx
bacteria .seq.xxxxxx
viruses .seq.xxxxxx
The format and content of the 'speclist.txt' documentation file (see
section 4) has changed. It no longer contains the section that used to
list the taxonomic nodes as it would now be too cumbersome to be included
in such a document. The SWISS-PROT taxonomic node code is replaced by the
NCBI taxonomy ID (TaxID). As the NCBI code does not convey any
information per se on which taxonomic kingdom a species belongs to, we
have followed each organism code by a letter that indicates the taxonomic
kingdom a species belongs to. It can be one of the following:
'A' for archaea (=archaebacteria);
'B' for bacteria (=prokaryota or eubacteria);
'E' for eukaryota;
'V' for viruses and phages (=viridae).
Example:
DROME E 007227: N=Drosophila melanogaster
C=Fruit fly
On the ExPASy WWW version (http://www.expasy.ch/cgi-bin/speclist) of this
document, the NCBI TaxID is an active link to the NCBI server, querying
the Taxonomic database on the lineage of the selected organism.
While in the process of mapping the old SWISS-PROT taxonomy to that of
NCBI, we corrected more than 100 misspelling in species names. We also
updated many names to newer and more appropriate designations (but kept
the previous names as synonyms).
2.4 Introduction of the Reference Title (RT) line-type
In release 37 we have introduced a new line type, the RT (Reference
Title) line. This optional line is placed between the RA and RL line. The
RT line gives the title of the paper (or other work) cited as exactly as
possible given the limitations of the computer character set. The format
of the RT line is:
RT "TITLE";
An example of the use of RT lines is shown below:
RT "Sequence analysis of the genome of the unicellular cyanobacterium
RT Synechocystis sp. strain PCC6803. I. Sequence features in the 1 Mb
RT region from map positions 64% to 92% of the genome.";
It should be noted that:
o The form used is that which would be used in a citation rather than
that displayed at the top of the published paper. For instance, where
journals capitalize major title words this is not preserved;
o The text of a title ends with either a period '.', a question mark '?'
or an exclamation mark '!';
o Double quotation marks '"' are not present in the text of the title;
they are replaced by single quotation marks;
o Titles of articles published in a language other than English have been
translated into English;
o Greek letters are spelled out (alpha, beta, etc.).
The RT lines were introduced in journal, book and patent references as
well as in some other types of references (Plant Gene Register, Worm
Breeder Gazette). They have not yet been systematically introduced for
unpublished submissions. The RT lines were introduced using the following
sources of information:
o For all references linked to Medline, the titles were automatically
extracted from the relevant Medline abstracts;
o The EMBL DNA sequence database was then automatically scanned to
retrieve additional titles. We then searched for the remaining missing
titles in a variety of on-line resources:
o The LITDB bibliographic database from the Protein Research
Foundation in Japan;
o The AGRICOLA bibliographic database from NAL;
o The Web sites of various journals;
o The Korean journals abstract database;
o The PDB 3D-structure database;
o The MIM database;
o The Plant Gene Register;
o The NCBI Entrez protein search tool;
o The European Patent Office patent database;
o About 200 titles were typed-in by going to various libraries in Geneva
to find the relevant papers;
o Finally some authors, editors or publishers were contacted by email.
We want to thank all those that responded and sent us titles that
would otherwise have been very difficult to find.
Currently out of more than 62000 references, we only lack the title for
less than 50 (this corresponds to a coverage of more than 99.9%).
The RT line has been introduced in mixed-case, instead of the ALL UPPER-
CASE format used elsewhere in SWISS-PROT. As you will see in section 3.1,
we plan to gradually convert all of SWISS-PROT to mixed-case.
2.5 Changes affecting the accession numbers
With the creation of the TrEMBL database (see section 6) and the rapid
increase in the amount of sequence data, we are faced with a problem of
availability of accession numbers. Currently we use a system based on a
one-letter prefix followed by 5 digits. This system was also used by the
nucleotide sequence databases which had originally reserved for SWISS-
PROT the prefix letters 'P' and 'Q'. The nucleotide databases having run
out of space (due mainly to EST's), have been forced to start using a new
format based on a two-letter prefix followed by 6 digits.
We have used up all possible numbers with 'P' and 'Q' and the only letter
prefix which was not used by the nucleotide database is 'O'. As we
believe that changing the format of the accession numbers to that used
now by the nucleotide database would create havoc on the numerous
software packages using SWISS-PROT, we have decided to keep a system of
accession numbers based on a six-character code, but with the following
changes:
o We have started using 'O'. This extra letter should allow the
continuation of the present format (1 prefix letter + 5 digits) for
approximately one year.
o When we will have finished using up 'O', we will introduce a system
based on the following format:
1 2 3 4 5 6
[O,P,Q] [0-9] [A-Z, 0-9] [A-Z, 0-9] [A-Z, 0-9] [0-9]
What the above means is that we will keep a six-character code, but that
in positions 3, 4 and 5 of this code any combination of letters and
numbers can be present. This format allows a total of 14 million
accession numbers (up from 300'000 with the current system).
We only allow numbers in positions 2 and 6 so that the SWISS-PROT
accession numbers can not be mistaken with gene names, acronyms, other
type of accession numbers or any type of words!
Examples: P0A3S2, Q2ASD4, O13YX2, P9B123
2.6 Changes concerning the reference location line (RL)
The (IN) prefix is mainly used for book citations. We have slightly
changed the format of these book citations so that the format is now
similar to that used by the EMBL nucleotide sequence database. The new
format is:
RL (IN) EDITOR_1 I.[, EDITOR2 I., EDITOR_X I.] (EDS.);
RL BOOK-NAME, PP.[VOL:]FIRST-LAST, PUBLISHER, CITY (YEAR).
So, what was before:
RL (IN) TRENDS IN QSAR AND MOLECULAR MODELING 92, WERMUTH C.G., ED.,
RL PP.485-486, ESCOM, LEIDEN, (1993).
is now:
RL (IN) WERMUTH C.G. (EDS.);
RL TRENDS IN QSAR AND MOLECULAR MODELLING 92, PP.485-486, ESCOM
RL SCIENCE PUBLISHERS, LEIDEN (1993).
Since release 36, the (IN) prefix has also been used for citations to the
electronic Plant Gene Register. In release 37 it can additionally be used
for references to the Worm Breeders Gazette (see
http://elegans.swmed.edu/wli/). Example:
RL (IN) WORM BREEDER'S GAZETTE 15(3):34(1998).
2.7 Cleaning up of the SIMILARITY comment line (CC) topic
We are continuing a major overhaul of the SIMILARITY topic. We would like
the majority of the information stored in this topic to be usable by
computer programs (while being human-readable). We are therefore
standardizing the format of this topic using two different subformats.
One to describe to which family a protein belongs:
CC -!- SIMILARITY: BELONGS TO THE <Name1> FAMILY [OF <Name2>].
CC [<Name3> SUBFAMILY.]
Examples:
CC -!- SIMILARITY: BELONGS TO THE 14-3-3 FAMILY.
CC -!- SIMILARITY: BELONGS TO THE 6-PHOSPHOGLUCONATE DEHYDROGENASE
CC FAMILY.
CC -!- SIMILARITY: BELONGS TO THE AAA FAMILY OF ATPASES.
CC -!- SIMILARITY: BELONGS TO THE IRON/ASCORBATE-DEPENDENT FAMILY OF
CC OXIDOREDUCTASES.
CC -!- SIMILARITY: BELONGS TO THE ANTP FAMILY OF HOMEOBOX PROTEINS.
CC "DEFORMED" SUBFAMILY.
CC -!- SIMILARITY: BELONGS TO THE KINESIN-LIKE PROTEIN FAMILY. KINESIN
CC SUBFAMILY.
And one to describe which domains are found in a given protein:
CC -!- SIMILARITY: CONTAINS n <Name> [DOMAIN|REPEAT][S].
Examples:
CC -!- SIMILARITY: CONTAINS 1 FHA DOMAIN.
CC -!- SIMILARITY: CONTAINS 45 EGF-LIKE DOMAINS.
CC -!- SIMILARITY: CONTAINS 2 SH3 DOMAINS.
CC -!- SIMILARITY: CONTAINS 2 SUSHI (SCR) REPEATS.
We have already updated many entries in this and the previous releases
and plan to complete this change for the next release.
2.8 Changes concerning cross-references (DR line)
We have added cross-references from SWISS-PROT to the Pfam protein domain
database (see http://www.sanger.ac.uk/Pfam/; reference: Bateman A.,
Birney E., Durbin R., Eddy S.R., Finn R.D. and Sonnhammer E.L.L.; Nucleic
Acids Res. 27:260-262(1999)). These cross-references are present in the
DR lines. The specific format for cross-references to the Pfam databases
is almost identical to that used for the PROSITE database:
DR PFAM; ACCESSION_NUMBER; ENTRY_NAME; STATUS.
Where 'ACCESSION_NUMBER' stands for the accession number of the Pfam HMM-
profile entry; 'ENTRY_NAME' is the name of the entry and 'STATUS' is
either 'n' or 'PARTIAL'. 'n' is the number of hits of the profile in
that particular protein sequence. The 'PARTIAL' status indicates that the
profile did not detect the sequence because that sequence is not complete
and lacks the region on which is the profile is based. The difference
between the cross-references to Pfam and those to PROSITE is that the
PROSITE DR lines make use of two additional 'STATUS': 'FALSE_NEG' and
'UNKNOWN'.
Examples of Pfam cross-references:
DR PFAM; PF00017; SH2; 1.
DR PFAM; PF00008; EGF; 8.
DR PFAM; PF00595; PDZ; PARTIAL.
In this release, we have also updated all the DR lines pointing to the
HSSP, Mendel and TRANSFAC databases.
3. PLANNED CHANGES
3.1 Conversion of SWISS-PROT to mixed-case characters
We are happy to announce that we will gradually start the conversion of
SWISS-PROT entries from all 'UPPER CASE' to 'MiXeD CaSe'. The first line-
type that follows the new format is the newly introduced RT line (see
section 2.4). In release 38 we are planning to convert the following line
types:
DT, OS, OG, OC, RL and KW
Further lines will be converted in release 39, and this process will
probably be completed for January 1, 2000. We can't enter the third
millennium with a carry over from the time of punched tapes and
teletypes!
Here is an example of what a SWISS-PROT entry will look like in release
38:
ID PETG_CYAPA STANDARD; PRT; 37 AA.
AC P14236;
DT 01-JAN-1990 (Rel. 13, Created)
DT 01-JAN-1990 (Rel. 13, Last sequence update)
DT 01-NOV-1997 (Rel. 35, Last annotation update)
DE CYTOCHROME B6-F COMPLEX SUBUNIT 5.
GN PETG.
OS Cyanophora paradoxa.
OG Cyanelle.
OC Eukaryota; Glaucocystophyceae; Cyanophoraceae; Cyanophora.
RN [1]
RP SEQUENCE FROM N.A.
RC STRAIN=LB555 / PRINGSHEIM;
RX MEDLINE; 90098772.
RA STIREWALT V.L., BRYANT D.A.;
RT "Molecular cloning and nucleotide sequence of the petG gene of the
RT cyanelle genome of Cyanophora paradoxa.";
RL Nucleic Acids Res. 17:10095-10095(1989).
RN [2]
RP SEQUENCE FROM N.A.
RC STRAIN=LB555 / PRINGSHEIM;
RA STIREWALT V.L., MICHALOWSKI C.B., LUFFELHARDT W., BOHNERT H.J.,
RA BRYANT D.A.;
RL Submitted (JUL-1995) to the EMBL/GenBank/DDBJ databases.
CC -!- FUNCTION: THE CYTOCHROME B6-F COMPLEX FUNCTIONS IN THE LINEAR
CC CROSS-MEMBRANE TRANSPORT OF ELECTRONS BETWEEN PHOTOSYSTEM II AND
CC I, AS WELL AS IN CYCLIC ELECTRON FLOW AROUND PHOTOSYSTEM I.
CC PETG IS REQUIRED FOR EITHER THE STABILITY OR ASSEMBLY OF THE
CC CYTOCHROME B6-F COMPLEX.
CC -!- SUBCELLULAR LOCATION: THYLAKOID MEMBRANE-ASSOCIATED.
CC -!- SIMILARITY: BELONGS TO THE PETG FAMILY.
DR EMBL; X16974; G12549; -.
DR EMBL; U30821; G1016164; -.
DR PIR; S06916; S06916.
DR MENDEL; 7879; CYApa;petG;1.
KW Electron transport; Respiratory chain; Cyanelle;
KW Thylakoid membrane; Transmembrane.
FT DOMAIN 1 4 LUMENAL (POTENTIAL).
FT TRANSMEM 5 25 POTENTIAL.
FT DOMAIN 26 37 STROMAL (POTENTIAL).
SQ SEQUENCE 37 AA; 4139 MW; 265A8973 CRC32;
MVEPLLSGIV LGLIPVTLIG LFVAAYLQYR RGNQFEF
//
3.2 Extension of the accession number system
As already explained in detail under 2.5, we will extend the accession
number system when we will have used up the 'O' series of accession
numbers. This can be anticipated for early 1999.
3.3 Introduction of a new CC line-type topic: MISCELLANEOUS
We will introduce in the next release a new 'topic' for the comments (CC)
line-type: 'MISCELLANEOUS'. This topic will be used for all comments
which do not belong to any other already defined topic. What this means
is that, starting with release 38, all comment lines will be assigned to
a topic. Example:
CC -!- BINDS TO BACITRACIN.
will become:
CC -!- MISCELLANEOUS: BINDS TO BACITRACIN.
3.4 Introduction of a unique identifier in the VARIANT feature
description of human sequence entries
We plan to introduce in release 38 a unique identifier for all VARIANT
feature keys in human sequence entries. This change is the first step
toward providing a unique identifier to all SWISS-PROT features. Human
sequence variants were chosen as a prototype for this planned
improvement. It will be possible, as soon as these identifiers become
available, to directly link specific sequence variants to the relevant
entries in disease mutation databases as well as to provide these
databases with a method to implement reciprocal links.
The unique identifier will be of the form of '/FTId=VAR_nnnnnn' and will
be added as the last part of the description field of a 'VARIANT' feature
keys. Examples:
FT VARIANT 6 6 E -> V (IN S; SICKLE CELL ANEMIA).
FT VARIANT 11 11 V -> D (IN WINDSOR; O2 AFFINITY UP;
FT UNSTABLE).
will become:
FT VARIANT 6 6 E -> V (IN S; SICKLE CELL ANEMIA);
FT /FTId=VAR_000001.
FT VARIANT 11 11 V -> D (IN WINDSOR; O2 AFFINITY UP;
FT UNSTABLE); /FTId=VAR_000234.
3.5 Small change in the format of RL lines for submissions to the DNA
databases
Along with the conversion of the RL to mixed-case (see 3.1) we will also
make a small change to the format of RL lines for submissions to the DNA
databases. What is now:
RL SUBMITTED (MMM-YEAR) TO EMBL/GENBANK/DDBJ DATA BANKS.
will be changed to:
RL Submitted (MMM-YEAR) to the EMBL/GenBank/DDBJ databases.
Such a change is made so as to follow more closely the format used by the
EMBL nucleotide sequence database.
4. STATUS OF THE DOCUMENTATION FILES
SWISS-PROT is distributed with a large number of documentation files.
Some of these files have been available for a long time (the user manual,
release notes, the various indices for authors, citations, keywords,
etc.), but many have been created recently and we are continuously adding
new files. The following table lists all the documents that are currently
available.
USERMAN.TXT User manual
RELNOTES.TXT Release notes for current release (37)
OLDRLNOT.TXT Release notes for previous release (36)
SHORTDES.TXT Short description of entries in SWISS-PROT
JOURLIST.TXT List of abbreviations for journals cited [see 1]
KEYWLIST.TXT List of keywords in use [see 2]
SPECLIST.TXT List of organism identification codes [see 3]
TISSLIST.TXT List of tissues
EXPERTS.TXT List of on-line experts for PROSITE and SWISS-PROT
SUBMIT.TXT Submission of sequence data to SWISS-PROT
ACINDEX.TXT Accession number index
AUTINDEX.TXT Author index
CITINDEX.TXT Citation index
KEYINDEX.TXT Keyword index
SPEINDEX.TXT Species index
DELETEAC.TXT Deleted accession number index
7TMRLIST.TXT List of 7-transmembrane G-linked receptors entries
AATRNASY.TXT List of aminoacyl-tRNA synthetases
ALLERGEN.TXT Nomenclature and index of allergen sequences
BLOODGRP.TXT List of blood group antigen proteins
CALBICAN.TXT Index of Candida albicans entries and their
corresponding gene designations
CDLIST.TXT CD nomenclature for surface proteins of human
leucocytes
CELEGANS.TXT Index of Caenorhabditis elegans entries and their
corresponding gene Wormpep cross-references
DICTY.TXT Index of Dictyostelium discoideum entries and
their corresponding gene designations and DictyDb
cross-references
EC2DTOSP.TXT Index of Escherichia coli Gene-protein database
entries referenced in SWISS-PROT
ECOLI.TXT Index of Escherichia coli K12 chromosomal entries
and their corresponding EcoGene cross-references
EMBLTOSP.TXT Index of EMBL Database entries referenced in
SWISS-PROT
EXTRADOM.TXT Nomenclature of extracellular domains
FLY.TXT Index of Drosophila entries and FlyBase cross-
references
GLYCOSID.TXT Classification of glycosyl hydrolase families and
index of glycosyl hydrolase entries
HAEINFLU.TXT Index of Haemophilus influenzae RD chromosomal
entries
HOXLIST.TXT Vertebrate homeotic Hox proteins: nomenclature and
index
HPYLORI.TXT Index of Helicobacter pylori strain 26695
chromosomal entries
HUMCHR17.TXT Index of protein sequence entries encoded on human
chromosome 17
HUMCHR18.TXT Index of protein sequence entries encoded on human
chromosome 18
HUMCHR19.TXT Index of protein sequence entries encoded on human
chromosome 19
HUMCHR20.TXT Index of protein sequence entries encoded on human
chromosome 20
HUMCHR21.TXT Index of protein sequence entries encoded on human
chromosome 21
HUMCHR22.TXT Index of protein sequence entries encoded on human
chromosome 22
HUMCHRX.TXT Index of protein sequence entries encoded on human
chromosome X
HUMCHRY.TXT Index of protein sequence entries encoded on human
chromosome Y
HUMPVAR.TXT Index of human proteins with sequence variants
INITFACT.TXT List and index of translation initiation factors
MIMTOSP.TXT Index of MIM entries referenced in SWISS-PROT
METALLO.TXT Classification of metallothioneins and index of
entries in SWISS-PROT
MGDTOSP.TXT Index of MGD entries referenced in SWISS-PROT
MGENITAL.TXT Index of Mycoplasma genitalium chromosomal entries
MJANNASC.TXT Index of Methanococcus jannaschii entries
NGR234.TXT Table of putative genes in Rhizobium plasmid
pNGR234a
NOMLIST.TXT List of nomenclature related references for
proteins
PCC6803.TXT Index of Synechocystis strain PCC 6803 entries
PDBTOSP.TXT Index of X-ray crystallography Protein Data Bank
(PDB) entries referenced in SWISS-PROT
PEPTIDAS.TXT Classification of peptidase families and index of
peptidase entries
PLASTID.TXT List of chloroplast and cyanelle encoded proteins
POMBE.TXT Index of Schizosaccharomyces pombe entries in
SWISS-PROT and their corresponding gene
designations
RESTRIC.TXT List of restriction enzyme and methylase entries
RIBOSOMP.TXT Index of ribosomal proteins classified by families
on the basis of sequence similarities
SALTY.TXT Index of Salmonella typhimurium LT2 chromosomal
entries and their corresponding StyGene cross-
references
SUBTILIS.TXT Index of Bacillus subtilis 168 chromosomal entries
and their corresponding SubtiList cross-references
UPFLIST.TXT UPF (Uncharacterized Protein Families) list and
index of members
YEAST.TXT Index of Saccharomyces cerevisiae entries and
their corresponding gene designations
YEAST1.TXT Yeast Chromosome I entries
YEAST2.TXT Yeast Chromosome II entries
YEAST3.TXT Yeast Chromosome III entries
YEAST5.TXT Yeast Chromosome V entries
YEAST6.TXT Yeast Chromosome VI entries
YEAST7.TXT Yeast Chromosome VII entries
YEAST8.TXT Yeast Chromosome VIII entries
YEAST9.TXT Yeast Chromosome IX entries
YEAST10.TXT Yeast Chromosome X entries
YEAST11.TXT Yeast Chromosome XI entries
YEAST13.TXT Yeast Chromosome XIII entries
YEAST14.TXT Yeast Chromosome XIV entries
Notes:
[1] The journal list ('jourlist.txt') has been extensively updated. This
document now lists for each journal the name of its publisher.
Journal subtitles, when they are available, have also been added.
This file can now be considered as a mini-database on life science
journals. It lists 1073 journals and contains more than 800 Web
links. Example of an entry in the journal list:
Abbrev: Allergy
Title : Allergy
[European Journal of Allergy and Clinical Immunology]
ISSN : 0105-4538
CODEN : LLRGDY
Publis: Munksgaard
Note : Replaces Acta Allergol., starts with vol. 33 in 1978.
Server: http://www.munksgaard.dk/allergy/
[2] The keyword list ('keywlist.txt') has been converted to mixed-case
characters.
[3] The species list ('speclist.txt') has been extensively updated due
to the switch to the NCBI taxonomy (see section 2.3); it also has
been converted to mixed-case characters.
We have continued to include in some SWISS-PROT document files the
references of Web sites relevant to the subject under consideration.
There are now 40 documents that include such links.
5. THE EXPASY WORLD-WIDE WEB SERVER
5.1 Background information
The most efficient and user-friendly way to browse interactively in
SWISS-PROT, PROSITE, ENZYME, SWISS-2DPAGE and other databases is to use
the World-Wide Web (WWW) molecular biology server ExPASy. The ExPASy
server was made available to the public in September 1993 and is
reachable at the following address:
http://www.expasy.ch/
The ExPASy WWW server allows access, using the user-friendly hypertext
model, to the SWISS-PROT, PROSITE, ENZYME, SWISS-2DPAGE, SWISS-3DIMAGE
and CD40Lbase databases. And, through any SWISS-PROT protein sequence
entry, to other databases such as EMBL, Eco2DBASE, EcoCyc, FlyBase,
GCRDb, MaizeDB, Mendel, OMIM, PDB, HSSP, Pfam, ProDom, REBASE, SGD,
SubtiList/NRSub, TRANSFAC, YPD and Medline. ExPAsy also offers many tools
for the analysis of protein sequences and 2D gels.
5.2 Swiss-Shop
We provide, on ExPASy, a service called Swiss-Shop. Swiss-Shop is an
automated sequence alerting system which allows users to obtain, by
email, new sequence entries relevant to their field(s) of interest.
Various criteria can be combined:
o By entering one or more words that should be present in the
description line;
o By entering one or more species name(s) or taxonomic division(s);
o By entering one or more keywords;
o By entering one or more author names;
o By entering the accession number (or entry name) of a PROSITE
pattern or a user-defined sequence pattern;
o By entering the accession number (or entry name) of an existing
SWISS-PROT entry or by entering a private sequence.
Every week, the new sequences entered in SWISS-PROT are automatically
compared with all the criteria that have been defined by the users. If a
sequence corresponds to the selection criteria defined by a user, that
sequence is sent by electronic mail.
5.3 What is new on ExPASy
ExPASy is constantly modified and improved. If you wish to be informed on
the changes made to the server you can either:
o Read the document History of changes, improvements and new features
which is available at the address:
http://www.expasy.ch/www/history.html
o Subscribe to Swiss-Flash, a service that reports news of databases,
software and services developments. By subscribing to this service,
you will automatically get Swiss-Flash bulletins by electronic mail.
To subscribe use the address:
http://www.expasy.ch/www/swiss-flash.html
Among all the improvements and the new features introduced during the
last six months, there are at least three that we believe are
specifically useful to SWISS-PROT users:
o NiceProt is a tool that provides a user-friendly tabular view of SWISS-
PROT entries. The 'NiceProt View of SWISS-PROT' is accessible from the
top and bottom of each SWISS-PROT entry on ExPASy. You can use this
tool to link to any SWISS-PROT by using the following style of URL:
http://www.expasy.ch/cgi-bin/niceprot.pl?P01585 (where the last part of
the URL is a valid primary accession number).
o The SWISS-PROT/TrEMBL full text search tool has been improved. The
databases are now indexed using the Glimpse search engine, wildcards
can be used in query strings, more fields (line types) are indexed and
response times are much shorter than before. See:
http://www.expasy.ch/cgi-bin/sprot-search-ful
o Users who wish to save and retrieve all SWISS-PROT entries originating
from a species can do this via the SWISS-PROT 'speclist.txt' document.
By clicking on any of the species codes and specifying a file name, one
can save all corresponding entries to a file that can be retrieved
from the anonymous ExPASy FTP server.
6. TREMBL - A SUPPLEMENT TO SWISS-PROT
The ongoing genome sequencing and mapping projects have dramatically
increased the number of protein sequences to be incorporated into SWISS-
PROT. Since we do not want to dilute the quality standards of SWISS-PROT
by incorporating sequences into the database without proper sequence
analysis and annotation, we cannot speed up the incorporation of new
incoming data indefinitely. But as we also want to make the sequences
available as fast as possible, we have introduced with SWISS-PROT a
computer annotated supplement. This supplement consists of entries in
SWISS-PROT-like format derived from the translation of all coding
sequences (CDS) in the EMBL nucleotide sequence database, except those
already included in SWISS-PROT.
We name this supplement TrEMBL (Translation from EMBL). It can be
considered as a preliminary section of SWISS-PROT. This SWISS-PROT
release is supplemented by TrEMBL release 8. TrEMBL is split in two main
sections; SP-TrEMBL and REM-TrEMBL:
SP-TrEMBL (SWISS-PROT TrEMBL) contains the entries (180'763 in release 8)
which should be incorporated into SWISS-PROT. SWISS-PROT accession
numbers have been assigned for all SP-TrEMBL entries.
REM-TrEMBL (REMaining TrEMBL) contains the entries (43'780 in release 8)
that we do not want to include in SWISS-PROT for a variety of reasons
(synthetic sequences, pseudogenes, translations of incorrect open reading
frames, fragments with less than eight amino acids, patent-derived
sequences, immunoglobulins and T-cell receptors, etc.)
TrEMBL is available by FTP from the EBI and ExPASy servers in the
directory databases/trembl'. It can be queried on WWW by the EBI and
ExPASy SRS servers. It is also searchable on the FASTA, BIC-SW and BLAST
servers of the EBI.
7. FTP ANONYMOUS ACCESS TO SWISS-PROT
7.1 Generalities
SWISS-PROT is available for download on the following anonymous FTP
servers:
Organization Swiss Institute of Bioinformatics (SIB)
Address ftp.expasy.ch
Directory /databases/swiss-prot/
Organization European Bioinformatics Institute (EBI)
Address ftp.ebi.ac.uk
Directory /pub/databases/swissprot/
We have reorganized the directory on the ExPASy FTP server where SWISS-
PROT is stored. The new organization is shown below.
+--swiss-prot-+
|
|--release The files for the current release of
| SWISS-PROT
|
|--release_compressed The files of the compressed version
| (*.Z) of the current release of SWISS-
| PROT
|
|--special_selections Files storing SWISS-PROT entries either
| from a specific taxonomic subset or
| linked to a specific database
|
|--sw_old_releases The compressed 'tar' (archive) files
| of previous releases of SWISS-PROT
|
+--updates The files of the cumulative weekly
| updates
|
+--updates_compressed The files of the compressed version
(*.Z) of the cumulative weekly updates
The main differences from the previous release are:
o The SWISS-PROT release files are now in a subdirectory (swiss-
prot/release) instead of the main directory which is now devoid of data
files.
o A new subdirectory (swiss-prot/sw_old_releases) was created. It
contains Unix compressed 'tar' (archive) files of previous releases of
SWISS-PROT. Each release is stored in a file with the name
sprotNN.tar.Z where NN is a release number. Such a file stores all the
documentation (*.txt) files and the data file (sprotNN.dat) of the
corresponding SWISS-PROT release. The release notes are renamed from
release.txt to release.NN. We have decided to provide these files to
answer two kinds of requests. The main one originates from users who
want to compare sequence analysis algorithms by benchmarking them on a
specific release of the database so as to compare their results with
those of a competing program. The second type of requests originates
from legal departments of biotech companies that often want to be able
to check the state of knowledge on a particular sequence at a given
time frame.
7.2 Weekly updates of SWISS-PROT
Weekly updates of SWISS-PROT are available by anonymous FTP. Three files
are generated at each update:
new_seq.dat Contains all the new entries since the last full release;
upd_seq.dat Contains the entries for which the sequence data has been
updated since the last release;
upd_ann.dat Contains the entries for which one or more annotation
fields have been updated since the last release.
!! Important notes !!
o Although we try to follow a regular schedule, we do not promise to
update these files every week. In most cases two weeks may elapse
between two updates.
o Instead of using the above files, you can, every week, download an
updated copy of the SWISS-PROT database. This file is available in the
directory containing the non-redundant database (see next section).
7.3 Non-redundant database
About a year ago, we started to distribute on the ExPASy and EBI FTP
servers, files that make up a non-redundant (see further) and complete
protein sequence database consisting of three components:
1) SWISS-PROT
2) TrEMBL
3) New entries to be later integrated into TrEMBL (hereafter known as
TrEMBL_New)
Every week three files are completely rebuilt. These files are named:
sprot.dat.Z, trembl.dat.Z and trembl_new.dat.Z. As indicated by their .Z
extension these are Unix compress format files which, when decompressed,
will produce ASCII files in SWISS-PROT format.
Three others files are also available (sprot.fas.Z, trembl.fas.Z and
trembl_new.fas.Z) Which are compressed fasta format sequence files useful
for building the databases used by FASTA, BLAST and other sequence
similarity search programs. Please do not use these files for any other
purpose, as you will lose all annotations by using this very primitive
format.
The files for the non-redundant database are stored in the directory
/databases/sp_tr_nrdb on the ExPASy FTP server (ftp.expasy.ch) and in the
directory /pub/databases/sp_tr_nrdb on the EBI FTP server
(ftp.ebi.ac.uk).
Additional notes
o The SWISS-PROT file continuously grows as new annotated sequences are
added.
o The TrEMBL file decreases in size as sequences are moved out of that
section after being annotated and moved into SWISS-PROT. Four times a
year a new release of TrEMBL is built at EBI, at this point the TrEMBL
file increases in size as it then includes all of the new data (see
next section) that has accumulated since the last release.
o The TrEMBL_New file starts as a very small file and grows in size until
a new release of TrEMBL is available.
o SWISS-PROT and TrEMBL share the same system of accession numbers.
Therefore you will not find any primary accession number duplicated
between the two sections. A TrEMBL entry (and its associated accession
number(s)) can either move to SWISS-PROT as new entry or be merged with
an existing SWISS-PROT entry. In the latter case, the accession
number(s) of that TrEMBL entry are added to that of the SWISS-PROT
entry.
o TrEMBL_New does not have real accession numbers. However it was
necessary to have an AC line so as to be able to use it with different
software products. This AC line contains a temporary identifier which
consists of the pID (protein identifier) of the coding sequence in the
parent nucleotide sequence.
o While these three files allow you to build what we call a non-redundant
database, it must be noted that this is not completely a true
statement. Without going into a long explanation we can say that this
is currently the best attempt in providing a complete selection of
protein sequence entries while trying to eliminate redundancies. Also
SWISS-PROT is completely (well 99.994% !) non-redundant, TrEMBL is far
from being non-redundant and the addition of SWISS-PROT + TrEMBL is
even less.
o To describe to your users the version of the non-redundant database
that you are providing them with, you should use a statement of the
form:
SWISS-PROT release 37 and updates until <current_date>;
TrEMBL release 8 minus data integrated into SWISS-PROT as of
<current_date>;
New preliminary TrEMBL entries created since release 8 of TrEMBL
8. ENZYME and PROSITE
8.1 The ENZYME data bank
Release 24.0 of the ENZYME data bank is distributed with release 37 of
SWISS-PROT. ENZYME release 24.0 contains information relative to 3704
enzymes. It differs from the previous release (23 of July 1998) in that
we have converted the CA (Catalytic Activity) and DI (DIsease) lines to
mixed-case characters. The conversion of the ENZYME database from ALL
UPPER-CASE to mixed-case is therefore completed.
Example, what was before:
ID 1.14.15.4
DE Steroid 11-beta-monooxygenase.
AN Steroid 11-beta-hydroxylase.
AN Steroid 11-beta/18-hydroxylase.
AN Cytochrome p450 XIB1.
CA A STEROID + REDUCED ADRENAL FERREDOXIN + O(2) = AN 11-BETA-
CA HYDROXYSTEROID + OXIDIZED ADRENAL FERREDOXIN + H(2)O.
CF Heme-thiolate.
CC -!- Also hydroxylates steroids at the 18-position, and converts
CC 18-hydroxycorticosterone into aldosterone.
DI ADRENAL HYPERPLASIA IV; MIM:202010.
PR PROSITE; PDOC00081;
DR P15150, CPN1_BOVIN; Q64408, CPN1_CAVPO; P15538, CPN1_HUMAN;
DR P97720, CPN1_MESAU; Q29527, CPN1_PAPHA; Q29552, CPN1_PIG ;
DR Q92104, CPN1_RANCA; P15393, CPN1_RAT ; P51663, CPN1_SHEEP;
DR P19099, CPN2_HUMAN; Q64658, CPN2_MESAU; P15539, CPN2_MOUSE;
DR P30099, CPN2_RAT ; P30100, CPN3_RAT ;
//
is now:
ID 1.14.15.4
DE Steroid 11-beta-monooxygenase.
AN Steroid 11-beta-hydroxylase.
AN Steroid 11-beta/18-hydroxylase.
AN Cytochrome p450 XIB1.
CA A steroid + reduced adrenal ferredoxin + O(2) = an 11-beta-
CA hydroxysteroid + oxidized adrenal ferredoxin + H(2)O.
CF Heme-thiolate.
CC -!- Also hydroxylates steroids at the 18-position, and converts
CC 18-hydroxycorticosterone into aldosterone.
DI Adrenal hyperplasia IV; MIM:202010.
PR PROSITE; PDOC00081;
DR P15150, CPN1_BOVIN; Q64408, CPN1_CAVPO; P15538, CPN1_HUMAN;
DR P97720, CPN1_MESAU; Q29527, CPN1_PAPHA; Q29552, CPN1_PIG ;
DR Q92104, CPN1_RANCA; P15393, CPN1_RAT ; P51663, CPN1_SHEEP;
DR P19099, CPN2_HUMAN; Q64658, CPN2_MESAU; P15539, CPN2_MOUSE;
DR P30099, CPN2_RAT ; P30100, CPN3_RAT ;
//
In this release, we have also updated and added a significant number of
DI (Disease) lines and added synonyms (AN lines) to a number of entries.
The WWW version of ENZYME on ExPASy now includes links to the BRENDA
database of enzymes. See:
http://www.uni-koeln.de/math-nat-fak/biochemie/ds/dsbren_e.htm
8.2 The PROSITE data bank
Release 15.0 of the PROSITE data bank is distributed with release 36 of
SWISS-PROT. This release of PROSITE contains 1014 documentation entries
that describe 1'352 different patterns, rules and profiles/matrices.
9. WE NEED YOUR HELP !
We welcome feedback from our users. We would especially appreciate that
you notify us if you find that sequences belonging to your field of
expertise are missing from the database. We also would like to be
notified about annotations to be updated, if, for example, the function
of a protein has been clarified or if new information about post-
translational modifications has become available. To facilitate this
feedback we offer, on the ExPASY WWW server, a form that allows the
submission of updates and/or corrections to SWISS-PROT:
http://www.expasy.ch/sprot/sp_update_form.html
It is also possible, from any entry in SWISS-PROT displayed by the ExPASy
server, to submit updates and/or corrections for that particular entry.
Finally, you can also send your comments by electronic mail to the
address:
swiss-prot@expasy.ch
Note that from January 1999, all update requests will be assigned a
unique identifier of the form 'UR-Xnnnn' (example: UR-A0123). This
identifier will be used internally by the SWISS-PROT staff at SIB and EBI
to track down the fate of requests and will also be used in email
exchanges with the persons having submitted a request.
10. IMPORTANT ANNOUNCEMENT
It became obvious in the last years that the tremendous increase in data
flow has created a requirement for resources which cannot be addressed in
full by public funding. This is causing databases to fall behind the
research. We believe that the only solution to the resource shortfall is
to ask commercial users to participate by paying a license fee. No fee
will be charged to academic users, nor will any restriction be imposed on
their use or reuse of the data. Both SWISS-PROT and PROSITE are concerned
by these changes, while this is not the case of ENZYME.
A document fully describing what will be the impact of this change for
SWISS-PROT is available with the SWISS-PROT distribution files on FTP
(sp_info.txt). You can also access the document as well as other relevant
ones from:
http://www.expasy.ch/announce/
http://www.ebi.ac.uk/news.html
If you do not have the time to read this document, the most important
take-home message is that these changes should not have any impact on the
way SWISS-PROT or PROSITE are accessed or redistributed. Academic users
will not be affected by these changes. Industrial end-users will also not
directly be affected as long as their employer pays the license fee. The
same holds true for bioinformatics companies. Academic software or
database developers as well as providers of database distribution
services will be only minimally affected by these changes. We hope to be
able to keep the spirit of SWISS-PROT and PROSITE alive and at the same
time ensure their long-term financial survival. We sincerely hope and
believe that in the next two years the only change that will matter will
be the increase in scope and timeliness of the databases.
----------------------------------------------------------------------------
SWISS-PROT is copyright. It is produced through a collaboration between the
Swiss Institute of Bioinformatics and the EMBL Outstation - the European
Bioinformatics Institute. There are no restrictions on its use by non-profit
institutions as long as its content is in no way modified. Usage by and for
commercial entities requires a license agreement. For information about the
licensing scheme see: http://www.isb-sib.ch/announce/ or send an email to
license@isb-sib.ch.
----------------------------------------------------------------------------
========================================================================
APPENDIX A: SOME STATISTICS
A.1 Amino acid composition
A.1.1 Composition in percent for the complete data bank
Ala (A) 7.58 Gln (Q) 3.97 Leu (L) 9.42 Ser (S) 7.12
Arg (R) 5.16 Glu (E) 6.37 Lys (K) 5.95 Thr (T) 5.67
Asn (N) 4.45 Gly (G) 6.84 Met (M) 2.37 Trp (W) 1.23
Asp (D) 5.28 His (H) 2.24 Phe (F) 4.09 Tyr (Y) 3.18
Cys (C) 1.66 Ile (I) 5.81 Pro (P) 4.90 Val (V) 6.58
Asx (B) 0.001 Glx (Z) 0.001 Xaa (X) 0.01
A.1.2 Classification of the amino acids by their frequency
Leu, Ala, Ser, Gly, Val, Glu, Lys, Ile, Thr, Asp, Arg, Pro, Asn, Phe,
Gln, Tyr, Met, His, Cys, Trp
A.2 Repartition of the sequences by their organism of origin
Total number of species represented in this release of SWISS-PROT: 6307
The first twenty species represent 36880 sequences: 47.3 % of the total
number of entries.
A.2.1 Table of the frequency of occurrence of species
Species represented 1x: 2929
2x: 984
3x: 503
4x: 340
5x: 244
6x: 216
7x: 161
8x: 116
9x: 107
10x: 65
11- 20x: 297
21- 50x: 185
51-100x: 74
>100x: 86
A.2.2 Table of the most represented species
Number Frequency Species
1 5146 Human
2 4806 Baker's yeast (Saccharomyces cerevisiae)
3 4476 Escherichia coli
4 3387 Mouse
5 2550 Rat
6 2046 Bacillus subtilis
7 1956 Caenorhabditis elegans
8 1701 Haemophilus influenzae
9 1406 Fission yeast (Schizosaccharomyces pombe)
10 1307 Methanococcus jannaschii
11 1126 Bovine
12 1064 Fruit fly (Drosophila melanogaster)
13 918 Mycobacterium tuberculosis
14 862 Chicken
15 792 Arabidopsis thaliana (Mouse-ear cress)
16 723 Salmonella typhimurium
17 711 African clawed frog (Xenopus laevis)
18 670 Synechocystis sp. (strain PCC 6803)
19 651 Pig
20 582 Rabbit
21 490 Mycoplasma pneumoniae
22 470 Mycoplasma genitalium
23 428 Maize
24 403 Rhizobium sp. (strain NGR234)
25 367 Helicobacter pylori
26 363 Pseudomonas aeruginosa
27 332 Rice
28 296 Dog
29 295 Tobacco
30 285 Slime mold (Dictyostelium discoideum)
31 274 Treponema pallidum
32 272 Bacteriophage T4
33 268 Sheep
34 262 Mycobacterium leprae
35 260 Borrelia burgdorferi
36 256 Pea
37 253 Vaccinia virus (strain Copenhagen)
38 235 Methanobacterium thermoautotrophicum
235 Soybean
40 224 Neurospora crassa
41 222 Staphylococcus aureus
42 221 Barley
43 219 Porphyra purpurea
44 209 Wheat
45 203 Tomato
46 201 Rhodobacter capsulatus
47 199 Potato
48 198 Klebsiella pneumoniae
49 194 Candida albicans
50 193 Human cytomegalovirus (strain AD169)
51 192 Bacillus stearothermophilus
52 189 Archaeoglobus fulgidus
189 Pseudomonas putida
54 186 Vaccinia virus (strain WR)
55 170 Agrobacterium tumefaciens
56 169 Spinach
57 166 Guinea pig
58 159 Chlamydomonas reinhardtii
59 158 Rhizobium meliloti
60 154 Autographa californica nuclear polyhedrosis virus
61 150 Aspergillus nidulans
150 Marchantia polymorpha (Liverwort)
63 148 Streptomyces coelicolor
148 Guillardia theta (Cryptomonas phi)
65 147 Cyanophora paradoxa
66 146 Variola virus
67 144 Golden hamster
68 143 Horse
69 140 Lactococcus lactis (subsp. lactis)
70 139 Odontella sinensis
71 134 Orgyia pseudotsugata multicapsid polyhedrosis virus
72 132 Kluyveromyces lactis
73 127 Trypanosoma brucei brucei
74 126 Synechococcus sp. (strain PCC 7942)
75 125 Thermus aquaticus (subsp. thermophilus)
76 120 Alcaligenes eutrophus
77 115 Bombyx mori (Silk moth)
115 Anabaena sp. (strain PCC 7120)
79 114 Bradyrhizobium japonicum
80 109 Yersinia enterocolitica
81 107 Streptococcus pneumoniae
82 105 Brachydanio rerio (Zebrafish)
83 104 Oncorhynchus mykiss (Rainbow trout)
104 Brassica napus (Rape)
85 102 Rhodobacter sphaeroides
86 101 Cat
A.3 Repartition of the sequences by size
From To Number From To Number
1- 50 3186 1001-1100 708
51- 100 6584 1101-1200 537
101- 150 9506 1201-1300 365
151- 200 7467 1301-1400 246
201- 250 7006 1401-1500 202
251- 300 6508 1501-1600 127
301- 350 6115 1601-1700 115
351- 400 6164 1701-1800 86
401- 450 4707 1801-1900 93
451- 500 4450 1901-2000 62
501- 550 3351 2001-2100 34
551- 600 2258 2101-2200 68
601- 650 1768 2201-2300 70
651- 700 1292 2301-2400 35
701- 750 1146 2401-2500 41
751- 800 941 >2500 222
801- 850 740
851- 900 781
901- 950 536
951-1000 460
A.4 Longest sequences
The longest sequences (>=4000 residues) are listed here:
HTS1_COCCA 5217
MUC2_HUMAN 5179
FAT_DROME 5147
RYNR_RABIT 5037
RYNR_PIG 5035
RYNR_HUMAN 5032
RYNC_RABIT 4969
LRP_CAEEL 4753
DYHC_DICDI 4725
PLEC_RAT 4687
LRP2_RAT 4660
LRP2_HUMAN 4655
DYHC_RAT 4644
DYHC_DROME 4639
DYHC_CAEEL 4568
DYHB_CHLRE 4568
APB_HUMAN 4563
APOA_HUMAN 4548
LRP1_HUMAN 4544
LRP1_CHICK 4543
DYHC_PARTE 4540
RRPA_CVMJH 4488
DYHG_CHLRE 4485
DYHC_ANTCR 4466
DYHC_TRIGR 4466
GRSB_BACBR 4451
PKSK_BACSU 4447
PKSL_BACSU 4427
PGBM_HUMAN 4393
YP73_CAEEL 4385
DYHC_NEUCR 4367
DYHC_NECHA 4349
DYHC_EMENI 4344
PKD1_HUMAN 4303
DYHC_SCHPO 4196
DYHC_YEAST 4092
RRPA_CVH22 4085
RRPL_DUGBV 4036
A.5 Statistics for journal citations
Total number of journals cited in this release of SWISS-PROT: 955
A.5.1 Table of the frequency of journal citations
Journals cited 1x: 351
2x: 130
3x: 79
4x: 43
5x: 33
6x: 26
7x: 15
8x: 17
9x: 15
10x: 12
11- 20x: 66
21- 50x: 67
51-100x: 25
>100x: 76
A.5.2 List of the most cited journals in SWISS-PROT
Nb Citations Journal abbreviation
-- --------- ----------------------------------
1 6476 J. Biol. Chem.
2 3931 Proc. Natl. Acad. Sci. U.S.A.
3 3418 Nucleic Acids Res.
4 2815 J. Bacteriol.
5 2606 Gene
6 2119 FEBS Lett.
7 1994 Eur. J. Biochem.
8 1843 Biochem. Biophys. Res. Commun.
9 1811 Biochemistry
10 1751 EMBO J.
11 1650 Nature
12 1484 Biochim. Biophys. Acta
13 1398 J. Mol. Biol.
14 1264 Cell
15 1214 Mol. Cell. Biol.
16 981 Mol. Gen. Genet.
17 973 Plant Mol. Biol.
18 941 Genomics
19 922 Biochem. J.
20 833 Science
21 811 Mol. Microbiol.
22 778 Virology
23 702 J. Biochem.
24 525 J. Virol.
25 482 Yeast
26 472 J. Cell Biol.
27 464 J. Gen. Virol.
28 452 Plant Physiol.
29 431 Hum. Mutat.
30 419 Genes Dev.
31 402 Hum. Mol. Genet.
32 355 J. Immunol.
33 344 Arch. Biochem. Biophys.
34 339 Infect. Immun.
35 324 Curr. Genet.
36 322 Oncogene
37 309 Mol. Biochem. Parasitol.
38 295 FEMS Microbiol. Lett.
39 291 Structure
40 269 Am. J. Hum. Genet.
41 264 Biol. Chem. Hoppe-Seyler
42 261 Nat. Genet.
43 250 Development
44 244 J. Clin. Invest.
45 238 Mol. Endocrinol.
46 238 Microbiology
47 225 J. Mol. Evol.
48 220 J. Gen. Microbiol.
49 220 Genetics
50 219 Nat. Struct. Biol.
51 213 Hoppe-Seyler's Z. Physiol. Chem.
52 200 DNA Cell Biol.
53 199 Hum. Genet.
54 196 Appl. Environ. Microbiol.
55 186 J. Exp. Med.
56 183 Blood
57 182 Dev. Biol.
58 176 Protein Sci.
59 175 Neuron
60 154 Immunogenetics
61 152 DNA
62 146 Endocrinology
63 146 DNA Seq.
64 136 Plant Cell
65 122 Cancer Res.
66 116 Plant J.
67 116 Hemoglobin
68 115 Bioorg. Khim.
69 115 Biochimie
70 112 Mol. Biol. Evol.
71 112 J. Neurochem.
72 109 Virus Res.
73 109 Agric. Biol. Chem.
74 107 Comp. Biochem. Physiol.
75 106 Brain Res. Mol. Brain Res.
76 101 Mech. Dev.
========================================================================
APPENDIX B: RELATIONSHIPS BETWEEN SWISS-PROT AND SOME BIOMOLECULAR
DATABASES
The current status of the relationships (cross-references) between
SWISS-PROT and some biomolecular databases is shown in the following
schematic:
***********************
* EMBL Nucleotide *
* Sequence Database *
* [EBI] *
***********************
^ ^ ^ ^ ^ ^ ^ ^ ^
****************** | | | I | | | | | **********************
* FlyBase * <-------+ | | I | | | | +-------> * MGD [Mouse] *
****************** | | | I | | | | | **********************
| | | I | | | | |
****************** | | | I | | | | | **********************
* SubtiList * <---------+ | I | | | +---------> * GCRDb [7TM recep.] *
* [B.subtilis] * | | | I | | | | | **********************
****************** | | | I | | | | |
| | | I | | | | | **********************
****************** | | | I | | +-----------> * EcoGene [E.coli] *
* Mendel [Plant] * <-----+ | | | I | | | | | **********************
****************** | | | | I | | | | |
| | | | I | | | | | **********************
****************** | | | | I +---------------> * SGD [Yeast] *
* MaizeDb * <-----------+ I | | | | | **********************
* [Zea mays] * | | | | I | | | | |
****************** | | | | I | | | | | **********************
| | | | I | +-------------> * DictyDB [D.disco.] *
****************** | | | | I | | | | | **********************
* WormPep * | | | | I | | | | |
* [C.elegans] * <---+ | | | | I | | | | | **********************
****************** | | | | | I | | | | | +-----> * ENZYME [Nomencl.] *
| | | | | I | | | | | | **********************
****************** | v v v v v v v v v v v v
* REBASE * ************************* **********************
* [Restriction * <-- * SWISS-PROT * ----> * OMIM [Human] *
* enzymes] * * Protein Sequence * **********************
****************** * Data Bank *
************************* **********************
****************** ^ ^ ^ ^ ^ ^ ^ | ^ ^ ^ * ECO2DBASE [2D] *
* StyGene * | | | | | | | | | | +--------> **********************
* [S.Typhimurium]* <----+ | | | | | | | | |
****************** | | | | | | | | | **********************
| | | | | | | | +----------> * Maize-2DPAGE [2D] *
****************** | | | | | | | | **********************
* TRANSFAC * <------+ | | | | | | |
****************** | | | | | | | **********************
| | | | | | +------------> * SWISS-2DPAGE [2D] *
****************** | | | | | | **********************
* Harefield [2D] * <--------+ | | | | |
****************** | | | | | **********************
| | | | +--------------> * Aarhus/Ghent [2D] *
****************** | | | | **********************
* PROSITE * | | | |
* [Patterns and * <----------+ | | +----------------> **********************
* profiles] * | | * YEPD [Yeast] [2D] *
****************** | +----------------+ **********************
| v |
| *********************** +-> **********************
+--------> * PDB [3D structures] * <----- * HSSP [3D similar.] *
*********************** **********************
=End=of=SWISS-PROT=release=37=notes=====================================