SWISS-PROT RELEASE 34.0 RELEASE NOTES
1. INTRODUCTION
Release 34.0 of SWISS-PROT contains 59'021 sequence entries, comprising
21'210'389 amino acids abstracted from 50'052 references. This
represents an increase of 14.5% over release 33. The growth of the data
bank is summarized below.
Release Date Number of entries Nb of amino acids
2.0 09/86 3939 900 163
3.0 11/86 4160 969 641
4.0 04/87 4387 1 036 010
5.0 09/87 5205 1 327 683
6.0 01/88 6102 1 653 982
7.0 04/88 6821 1 885 771
8.0 08/88 7724 2 224 465
9.0 11/88 8702 2 498 140
10.0 03/89 10008 2 952 613
11.0 07/89 10856 3 265 966
12.0 10/89 12305 3 797 482
13.0 01/90 13837 4 347 336
14.0 04/90 15409 4 914 264
15.0 08/90 16941 5 486 399
16.0 11/90 18364 5 986 949
17.0 02/91 20024 6 524 504
18.0 05/91 20772 6 792 034
19.0 08/91 21795 7 173 785
20.0 11/91 22654 7 500 130
21.0 03/92 23742 7 866 596
22.0 05/92 25044 8 375 696
23.0 08/92 26706 9 011 391
24.0 12/92 28154 9 545 427
25.0 04/93 29955 10 214 020
26.0 07/93 31808 10 875 091
27.0 10/93 33329 11 484 420
28.0 02/94 36000 12 496 420
29.0 06/94 38303 13 464 008
30.0 10/94 40292 14 147 368
31.0 02/95 43470 15 335 248
32.0 11/95 49340 17 385 503
33.0 02/96 52205 18 531 384
34.0 10/96 59021 21 210 389
2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 33
2.1 Sequences and annotations
6'892 sequences have been added since release 33, the sequence data of
1118 existing entries has been updated and the annotations of 10'629
entries have been revised.
2.2 What's happening with the model organisms
We have selected a number of organisms that are the target of genome
sequencing and/or mapping projects and for which we intend to:
- Be as complete as possible. All sequences available at a given time
should be immediately included in SWISS-PROT. This also includes
sequence corrections and updates;
- Provide a higher level of annotation;
- Provide cross-references to specialized database(s) that contain,
among other data, some genetic information about the genes that code
for these proteins;
- Provide specific indices or documents.
What was done since the last release or in preparation for the next
release concerning model organisms:
- We have added Mycobacterium tuberculosis to the list of model
organisms. The genome of this important pathogenic bacteria is
currently being sequenced at the Sanger Genome Center in Hinxton. We
have already annotated 474 putative proteins from M.tuberculosis.
- We have continued our effort in catching up with the backlog of
sequences from eukaryotic model organisms. In particular we added 687
entries from yeast, 525 from human, 316 from S.pombe, 202 from
C.elegans, 62 from A.thaliana and 92 from Drosophila.
- We have added in SWISS-PROT, all the sequences from yeast chromosome
VII and XIV. We plan to integrate data from the remaining chromosomes
(IV, XII, XIII, XV and XVI) very soon so as to have a complete set of
annotated yeast sequences.
- We plant to finish, for the next release, the annotation of the
Haemophilus influenzae and Mycoplasma genitalium sequence entries
which are not yet part of SWISS-PROT.
Here is the current status of the model organisms:
Organism Database Index file Number of
cross-referenced sequences
-------------- --------------------- -------------- ---------
A.thaliana None yet In preparation 562
B.subtilis SubtiList SUBTILIS.TXT 1783
C.albicans None yet CALBICAN.TXT 124
C.elegans WormPep CELEGANS.TXT 1208
D.discoideum DictyDB DICTY.TXT 265
D.melanogaster FlyBase In preparation 910
E.coli EcoGene ECOLI.TXT 3606
H.influenzae None yet HAEINFLU.TXT 1591
H.sapiens MIM MIMTOSP.TXT 4000
M.genitalium None yet In preparation 425
M.tuberculosis None yet None yet 474
S.cerevisiae LISTA/SGD YEAST.TXT 4340
S.typhimurium StyGene SALTY.TXT 617
S.pombe None yet POMBE.TXT 956
S.solfataricus None yet None yet 42
Collectively the entries from the above model organisms represent 35.4%
of all SWISS-PROT entries.
2.3 Change in the GN line
Starting with release 34, we allow more than a single GN line to be
present in an entry. This small change was rendered necessary to allow
the representation of all gene names for a number of protein sequences
encoded by a multiplicity of genes or for genes with many synonyms.
Examples:
GN (MSP-31 OR R05F9.13) AND (MSP-40 OR C33F10.9) AND (MSP-142 OR
GN K05F1.2) AND C34F11.4 AND F58A6.8 AND K07F5.1 AND ZK1248.6.
GN (RPL44A OR RPL44 OR SCL41A OR RPL41A OR YNL162W OR N1722) AND
GN (RPL44B OR RPL44 OR SCL41B OR RPL41B OR MAK18 OR YHR141C).
2.4 Changes concerning cross-references (DR line)
We have added cross-references from SWISS-PROT to the Maize genome 2D
Electrophoresis database. These cross-references are present in the DR
lines:
Data bank identifier: MAIZE-2DPAGE
Primary identifier: The protein spot unique identifier [1]
Secondary identifier: The tissue of origin [2]
Example: MAIZE-2DPAGE; P80607; COLEOPTILE.
[1] The Maize-2PAGE database uses SWISS-PROT primary accession numbers
as the alphanumeric designation of spots that are linked to SWISS-
PROT entries
[2] Currently only `COLEOPTILE' is used.
Small changes have been made to the syntax of cross-references to the
MIM and REBASE databases:
o In DR lines pointing to MIM, the secondary identifier which used to
be the release number of that database has been replaced by a '-'
(dash). This change became necessary due to the fact the MIM is now
updated on a daily basis and that there are no longer release numbers
for this database.
o REBASE has recently introduced accession numbers. We therefore
changed the format of DR lines pointing to this database. The new
REBASE accession numbers are used as primary identifiers and the
names of the restriction systems as secondary identifiers.
Examples:
DR MIM; 249900; -.
DR REBASE; RB0005; ECORI.
3.0 PLANNED CHANGES
3.1 Accession numbers
With the creation of the TREMBL database (see section 6) and the rapid
increase in the amount of sequence data, we are faced with a problem of
availability of accession numbers. Currently we use a system based on a
one-letter prefix followed by 5 digits. This system was also used by the
nucleotide sequence databases which had originally reserved for SWISS-
PROT the prefix letters 'P' and 'Q'. The nucleotide databases having run
out of space (due mainly to EST's), have been forced to start using a
new format based on a two-letter prefix followed by 6 digits.
We will soon have used up all possible numbers with 'P' and 'Q' and the
only letter prefix which was not used by the nucleotide database is 'O'.
As we believe that changing the format of the accession numbers to that
used now by the nucleotide database would create havocs on the numerous
software packages using SWISS-PROT, we have decided to keep a system of
accession numbers based on a six-character code, but with the following
planned changes:
1) As soon as we have used up all 'P' and 'Q' numbers, we will start
using 'O'. This extra letter should allow the continuation of the
present format (1 prefix letter + 5 digits) for at least a year.
2) When we will have finished using up 'O', we will introduce a system
based on the following format:
1 2 3 4 5 6
[O,P,Q] [0-9] [A-Z, 0-9] [A-Z, 0-9] [A-Z, 0-9] [0-9]
What the above means is that we will keep a six-character code, but
that in positions 3, 4 and 5 of this code any combination of letters
and numbers can be present. This format allows a total of 14 million
accession numbers (up from 300'000 with the current system).
We only allow numbers in positions 2 and 6 so that the SWISS-PROT
accession numbers can not be mistaken with gene names, acronyms,
other type of accession numbers or any type of words !
Examples: P0A3S2, Q2ASD4, O13YX2, P9B123
3.2 Introduction of a new CC line-type topic (DATABASE)
There are an increasing number of databases that caters for a specific
protein or a for a very limited number of proteins. Most of these
databases are mutation databases, reporting defects linked to a genetic
disease. We want to add cross-references to these databases when they
are available electronically, either by WWW or by FTP. We are therefore
adding, in the next release, a new comments (CC) line-type 'topic':
"DATABASE" whose syntax will be the following:
CC -!- DATABASE: NAME=Text[; NOTE=Text][; WWW="Address"][;
FTP="Address"].
Where:
NAME is the name of the database;
NOTE is an optional free text note;
WWW is the WWW address (URL) of the database;
FTP is the anonymous FTP address (including the directory name) where
the database file(s) are stored.
Examples of its usage:
CC -!- DATABASE: NAME=CD40LBASE;
CC WWW="HTTP://www.expasy.ch/www/cd40lbase.html";
CC FTP="www.expasy.ch/databases/cd40lbase".
CC -!- DATABASE: NAME=HAEMB; NOTE=HAEMOPHILIA B DATABASE;
CC FTP="ftp.ebi.ac.uk/pub/databases/haemb/".
Please note that this is the first part of SWISS-PROT to allow lower
case characters (yes, we plan to go to mixed cases soon !).
4. STATUS OF THE DOCUMENTATION FILES
SWISS-PROT is distributed with a large number of documentation files.
Some of these files have been available for a long time (the user
manual, release notes, the various indices for authors, citations,
keywords, etc.), but many have been created recently and we are
continuously adding new files. Since release 33, we have added 6 new
document files. The following table list all the documents that are
either currently available or that we plan to add in the next few
months.
USERMAN .TXT User manual
RELNOTES.TXT Release notes
SHORTDES.TXT Short description of entries in SWISS-PROT
JOURLIST.TXT List of abbreviations for journals cited
KEYWLIST.TXT List of keywords in use
SPECLIST.TXT List of organism identification codes
TISSLIST.TXT List of tissues (in RC line) [1]
EXPERTS .TXT List of on-line experts for PROSITE and SWISS-PROT
SUBMIT .TXT Submission of sequence data to SWISS-PROT
ACINDEX .TXT Accession number index
AUTINDEX.TXT Author index
CITINDEX.TXT Citation index
KEYINDEX.TXT Keyword index
SPEINDEX.TXT Species index
7TMRLIST.TXT List of 7-transmembrane G-linked receptors entries
AATRNASY.TXT List of aminoacyl-tRNA synthetases
ALLERGEN.TXT Nomenclature and index of allergen sequences
CALBICAN.TXT Index of Candida albicans entries and their corresponding
gene designations
CDLIST .TXT CD nomenclature for surface proteins of human leucocytes
CELEGANS.TXT Index of Caenorhabditis elegans entries and their
corresponding gene designations and WormPep cross-
references
DICTY .TXT Index of Dictyostelium discoideum entries and their
corresponding gene designations and DictyDB cross-
references
EC2DTOSP.TXT Index of Escherichia coli Gene-protein database entries
referenced in SWISS-PROT
ECOLI .TXT Index of Escherichia coli K12 chromosomal entries and
their corresponding EcoGene cross-references
EMBLTOSP.TXT Index of EMBL Database entries referenced in SWISS-PROT
EXTRADOM.TXT Nomenclature of extracellular domains
GLYCOSID.TXT Classification of glycosyl hydrolases families and index
of glycosyl hydrolase entries
HAEINFLU.TXT Index of Haemophilus influenzae RD chromosomal entries
HOXLIST .TXT Vertebrate homeotic Hox proteins: nomenclature and index
HUMCHR20.TXT Index of protein sequence entries encoded on human
chromosome 20 [1]
HUMCHR21.TXT Index of protein sequence entries encoded on human
chromosome 21
HUMCHR22.TXT Index of protein sequence entries encoded on human
chromosome 22
HUMCHRX .TXT Index of protein sequence entries encoded on human
chromosome X [1]
HUMCHRY .TXT Index of protein sequence entries encoded on human
chromosome Y
MIMTOSP .TXT Index of MIM entries referenced in SWISS-PROT
MYGENIT .TXT Index of Mycoplasma genitalium chromosomal entries [2]
NOMLIST .TXT List of nomenclature related references for proteins
PDBTOSP .TXT Index of X-ray crystallography Protein Data Bank (PDB)
entries referenced in SWISS-PROT
PEPTIDAS.TXT Classification of peptidase families and index of
peptidase entries
PLASTID .TXT List of chloroplast and cyanelle encoded proteins
POMBE .TXT Index of Schizosaccharomyces pombe entries in SWISS-PROT
and their corresponding gene designations
RESTRIC .TXT List of restriction enzyme and methylase entries
RIBOSOMP.TXT Index of ribosomal proteins classified by families on the
basis of sequence similarities [1]
SALTY .TXT Index of Salmonella typhimurium LT2 chromosomal entries
and their corresponding StyGene cross-references
SUBTILIS.TXT Index of Bacillus subtilis 168 chromosomal entries and
their corresponding SubtiList cross-references
YEAST .TXT Index of Saccharomyces cerevisiae entries and their
corresponding gene designations
YEAST1 .TXT Yeast Chromosome I entries
YEAST2 .TXT Yeast Chromosome II entries
YEAST3 .TXT Yeast Chromosome III entries
YEAST5 .TXT Yeast Chromosome V entries
YEAST6 .TXT Yeast Chromosome VI entries
YEAST7 .TXT Yeast Chromosome VII entries [1]
YEAST8 .TXT Yeast Chromosome VIII entries
YEAST9 .TXT Yeast Chromosome IX entries
YEAST10 .TXT Yeast Chromosome X entries
YEAST11 .TXT Yeast Chromosome XI entries
YEAST13 .TXT Yeast Chromosome XIII entries [2]
YEAST14 .TXT Yeast Chromosome XIV entries [1]
Notes:
[1] New in release 34.
[2] Will be available starting with release 35 of February 1997.
We have continued to include in some SWISS-PROT document files the
references of World-Wide Web sites relevant to the subject under
consideration. There are now 12 documents that include such links.
5. THE EXPASY WORLD-WIDE WEB SERVER
5.1 Background information
The most efficient and user-friendly way to browse interactively in
SWISS-PROT, PROSITE, ENZYME, SWISS-2DPAGE and other databases. is to use
the World-Wide Web (WWW) molecular biology server ExPASy. WWW is a
global information retrieval system merging the power of world-wide
networks, hypertext and multimedia. Through hypertext links, it gives
access to documents and information available on thousands of servers
around the world. To access a WWW server one needs a WWW browser.
Currently, the most popular browser is Netscape Navigator(TM) from
Netscape Communications Corp. (available from ftp.netscape.com). Using a
WWW browser, one has access to all the hypertext documents stored on the
ExPASy server as well as many other WWW servers.
The ExPASy server was made available to the public in September 1993. On
October 1996 a cumulative total of 8 million connections was attained.
It may be accessed through its Uniform Resource Locator (URL - the
addressing system defined in WWW), which is:
http://www.expasy.ch/
The ExPASy WWW server allows access, using the user-friendly hypertext
model, to the SWISS-PROT, PROSITE, ENZYME, SWISS-2DPAGE, SWISS-3DIMAGE
and CD40Lbase databases and, through any SWISS-PROT protein sequence
entry, to other databases such as EMBL, Eco2DBASE, EcoCyc, FlyBase,
GCRDb, MaizeDB, SubtiList/NRSub, OMIM, PDB, HSSP, ProDom, REBASE, SGD,
YEPD and Medline. Using a browser which is able to display images one
can also remotely access 2D gels image data from SWISS-2DPAGE. ExPAsy
also offers many tools for the analysis of protein sequences and 2D
gels.
For more information on the ExPASy WWW server, you can read the
following article:
Appel R.D., Bairoch A., Hochstrasser D.F.
A new generation of information retrieval tools for biologists: the
example of the ExPASy WWW server.
Trends Biochem. Sci. 19:258-260(1994).
Or you can contact Dr. Ron Appel:
Email: ron.appel@dim.hcuge.ch
5.2 SWISS-SHOP
Thanks to the work of Manuel Peitsch from the Geneva Glaxo Institute for
Molecular Biology, we can provide, on ExPASy, a service called SWISS-
SHOP. SWISS-Shop allows any users of SWISS-PROT to indicate what
proteins he/she is interested in. This can be done using various
criteria that can be combined:
- By entering one or more words that should be present in the
description line;
- By entering one or more species name(s) or taxonomic division(s);
- By entering one or more keywords;
- By entering one or more author names;
- By entering the accession number (or entry name) of a PROSITE pattern
or a user-defined sequence pattern;
- By entering the accession number (or entry name) of an existing
SWISS-PROT entry or by entering a "private" sequence.
Every week, the new sequences entered in SWISS-PROT are automatically
compared with all the criteria that have been defined by the users. If a
sequence corresponds to the selection criteria defined by a user, that
sequence is sent by electronic mail.
5.3 What is new on ExPASy
Since the last release, there has been a large number of new
developments on the ExPASy WWW server. Here are some highlights of these
changes:
- CD40Lbase, The European CD40L Defect Database prepared by Manuel
Peitsch, has been made accessible through the ExPASy WWW server. The
purpose of CD40Lbase is to collect clinical and molecular data on
CD40 ligand defects leading to X-linked Hyper-IgM syndrome.
- Two new tool are available from the "Tools" page:
PeptideMass: this program is designed to calculate the theoretical
masses of peptides generated by the chemical or enzymatic cleavage of
proteins, to assist in the interpretation of peptide mass
fingerprinting and peptide mapping experiments. When proteins of
interest are specified from SWISS-PROT, the program considers all
annotations for that protein in the database, and uses these in order
to generate the correct peptide masses and warn users about peptides
that are not likely to be found when undertaking peptide mass
fingerprinting. Many protein post-translational modifications which
affect the masses of peptides can thus be taken into consideration.
TagIdent: this a protein identification tool which improves on and
superspeed the tool previously known as 'GuessProt'. The user can now
identify proteins from 2-D gels by giving protein pI and MW
estimates, a species or organism classification of interest, and a
short sequence tag of up to 6 amino acids. This tag can be derived
from the N-terminus, the C-terminus or from internal peptides of a
protein. The results are now sent to the user by e-mail, allowing
many searches to be done at the same time.
- In PROSITE and Enzyme, we have added the possibility to save all
referenced SWISS-PROT entries to a user-defined file on our anonymous
FTP server "outgoing" directory.
- At the end of each page displaying a SWISS-PROT entry we have added
links to some of our sequence analysis tools so as to allow users to
directly submit the displayed sequence to these tools.
- An email option has been added to the tool ScanProsite, if you want
to scan a pattern against SWISS-PROT, you have now the option of
having sent the results of your query by email, which should avoid
previously frequent timeout problems and is particularly useful for
complex patterns.
- WWW links have been implemented between SWISS-PROT entries and
nucleotide entries from DDBJ, the DNA Data Bank of Japan (in addition
to the existing links to EMBL at EBI and GenBank at NCBI). We have
also added direct WWW links to: SubtiList, the Bacillus subtilis
genomic database (http://www.pasteur.fr/Bio/SubtiList.html); YPD, the
Yeast Protein Database (http://quest7.proteome.com/YPDhome.html) and
ECO2DBASE, the Escherichia coli 2DPAGE database
(http://pcsf.brcf.med.umich.edu/eco2dbase).
- Links have been established from most feature (FT) lines of SWISS-
PROT entries to pages that highlight the subsequence in question,
both in 1- and in 3-letter amino acid codes.
- 2D Hunt, a database created and continuously updated by the Marvin
(http://www.hon.ch/MedHunt/Marvin.html) robot contains sites related
to electrophoresis and more specifically to 2-D electrophoresis. It
is accessible from the SWISS-2DPAGE top page of ExPASy.
- We have continued to build a list of Biomolecular servers, this list
is available on the ExPASy top page or directly from:
http://www.expasy.ch/www/amos_www_links.html
- Many other changes have been made to all parts of the server.
6. TREMBL - A SUPPLEMENT TO SWISS-PROT
The ongoing genome sequencing and mapping projects have dramatically
increased the number of protein sequences to be incorporated into SWISS-
PROT. Since we do not want to dilute the quality standards of SWISS-PROT
by incorporating sequences into the database without proper sequence
analysis and annotation, we cannot speed up the incorporation of new
incoming data indefinitely. But as we also want to make the sequences
available as fast as possible, we have introduced with SWISS-PROT an
computer annotated supplement. This supplement consists of entries in
SWISS-PROT-like format derived from the translation of all coding
sequences (CDS) in the EMBL nucleotide sequence database, except those
already included in SWISS-PROT. We name this supplement TREMBL
(TRanslation from EMBL). It can be considered as a preliminary section
of SWISS-PROT. TREMBL is split in two main sections; SP-TREMBL and REM-
TREMBL:
SP-TREMBL (SWISS-PROT TREMBL) contains the entries (86'040) which should
be incorporated into SWISS-PROT. SWISS-PROT accession numbers have been
assigned for all SP-TREMBL entries.
REM-TREMBL (REMaining TREMBL) contains the entries (19'255) that we do
not want to include in SWISS-PROT for a variety of reasons (synthetic
sequences, pseudogenes, translations of uncorrect open reading frames,
fragments with less than eight amino acids, patent-derived sequences,
immunoglobulins and T-cell receptors, etc.)
TREMBL is available by FTP from the EBI server (ftp.ebi.ac.uk) in the
directory '/pub/databases/trembl'. It can be queried on WWW by the EBI
SRS server (http://www.ebi.ac.uk/srs/srsc). It is also available on the
SWISS-PROT CD-ROM and is searchable on the FASTA and BLITZ email servers
of the EBI.
7. WEEKLY UPDATES OF SWISS-PROT
Weekly updates of SWISS-PROT are available by anonymous FTP. Three files
are updated at each update:
new_seq.dat Contains all the new entries since the last full release;
upd_seq.dat Contains the entries for which the sequence data has been
updated since the last release;
upd_ann.dat Contains the entries for which one or more annotation
fields have been updated since the last release.
Currently these files are available on the following anonymous ftp
servers:
Organization ExPASy (Geneva University Expert Protein Analysis System)
Address www.expasy.ch
Directory /databases/swiss-prot/updates
Organization National Center for Biotechnology Information (NCBI)
Address ncbi.nlm.nih.gov
Directory /repository/swiss-prot/updates
Organization European Bioinformatics Institute (EBI)
Address ftp.ebi.ac.uk
Directory /pub/databases/swissprot/new
Organization Bioinformatics Unit, Weizmann Institute of Science (WIS)
Address bioinformatics.weizmann.ac.il
Directory /pub/databases/swiss-prot/updates
!! Important notes !!!
Although we try to follow a regular schedule, we do not promise to
update these files every week. In some cases two weeks will elapse in-
between two updates.
Due to the current mechanism used to build a release the entries that
are provided in these updates are not guaranteed to be error free.
8. ENZYME AND PROSITE
8.1 The ENZYME data bank
Release 21.0 of the ENZYME data bank is distributed with release 34 of
SWISS-PROT. ENZYME release 21.0 contains information relative to 3646
enzymes.
8.2 The PROSITE data bank
Release 13.2 of the PROSITE data bank is distributed with release 34 of
SWISS-PROT. This release of PROSITE contains 889 documentation entries
that describe 1'167 different patterns, rules and profiles/matrices.
Release 13.2 does not really represent a new release; the only changes
between releases 13.0 and 13.2 are updating of the pointers to the
SWISS-PROT entries whose name have been modified between releases 32 and
34. The next release of PROSITE (14.0) will be distributed with release
35 of SWISS-PROT.
9. WE NEED YOUR HELP !
We welcome feedback from our users. We would especially appreciate that
you notify us if you find that sequences belonging to your field of
expertise are missing from the data bank. We also would like to be
notified about annotations to be updated, if, for example, the function
of a protein has been clarified or if new post-translational information
has become available.
========================================================================
APPENDIX A: SOME STATISTICS
A.1 Amino acid composition
A.1.1 Composition in percent for the complete data bank
Ala (A) 7.55 Gln (Q) 4.02 Leu (L) 9.33 Ser (S) 7.22
Arg (R) 5.15 Glu (E) 6.32 Lys (K) 5.93 Thr (T) 5.74
Asn (N) 4.52 Gly (G) 6.84 Met (M) 2.35 Trp (W) 1.25
Asp (D) 5.30 His (H) 2.24 Phe (F) 4.07 Tyr (Y) 3.19
Cys (C) 1.69 Ile (I) 5.72 Pro (P) 4.92 Val (V) 6.52
Asx (B) 0.001 Glx (Z) 0.001 Xaa (X) 0.01
A.1.2 Classification of the amino acids by their frequency
Leu, Ala, Ser, Gly, Val, Glu, Lys, Thr, Ile, Asp, Arg, Pro, Asn, Phe,
Gln, Tyr, Met, His, Cys, Trp
A.2 Repartition of the sequences by their organism of origin
Total number of species represented in this release of SWISS-PROT: 5389
The first twenty species represent 28511 sequences: 48.3 % of the total
number of entries.
A.2.1 Table of the frequency of occurrence of species
Species represented 1x: 2447
2x: 844
3x: 477
4x: 299
5x: 220
6x: 199
7x: 131
8x: 98
9x: 116
10x: 51
11- 20x: 229
21- 50x: 162
51-100x: 54
>100x: 62
A.2.2 Table of the most represented species
Number Frequency Species
1 4340 Baker's yeast (Saccharomyces cerevisiae)
2 4000 Human
3 3606 Escherichia coli
4 2429 Mouse
5 2121 Rat
6 1783 Bacillus subtilis
7 1591 Haemophilus influenzae
8 1208 Caenorhabditis elegans
9 956 Fission yeast (Schizosaccharomyces pombe)
10 910 Fruit fly (Drosophila melanogaster)
11 899 Bovine
12 709 Chicken
13 617 Salmonella typhimurium
14 582 African clawed frog (Xenopus laevis)
15 562 Arabidopsis thaliana (Mouse-ear cress)
16 502 Rabbit
17 474 Mycobacterium tuberculosis
18 446 Pig
19 425 Mycoplasma genitalium
20 381 Maize
21 276 Rice
22 275 Bacteriophage T4
23 265 Slime mold (Dictyostelium discoideum)
24 262 Pseudomonas aeruginosa
25 253 Vaccinia virus (strain Copenhagen)
26 229 Tobacco
27 219 Porphyra purpurea
28 217 Pea
29 207 Dog
30 193 Wheat
193 Human cytomegalovirus (strain AD169)
32 190 Barley
33 186 Staphylococcus aureus
186 Soybean
35 184 Vaccinia virus (strain WR)
36 183 Sheep
37 173 Neurospora crassa
38 172 Pseudomonas putida
39 171 Rhodobacter capsulatus
40 169 Mycobacterium leprae
41 161 Potato
42 157 Synechocystis sp. (strain PCC 6803)
157 Klebsiella pneumoniae
44 154 Tomato
154 Bacillus stearothermophilus
154 Autographa californica nuclear polyhedrosis virus
47 150 Marchantia polymorpha (Liverwort)
48 148 Spinach
49 146 Variola virus
50 142 Cyanophora paradoxa
51 139 Agrobacterium tumefaciens
52 138 Odontella sinensis
53 137 Rhizobium meliloti
54 127 Lactococcus lactis (subsp. lactis)
55 125 Chlamydomonas reinhardtii
56 124 Candida albicans
57 121 Guinea pig
58 116 Streptomyces coelicolor
59 109 Aspergillus nidulans
60 108 Horse
61 107 Trypanosoma brucei brucei
62 101 Anabaena sp. (strain PCC 7120)
A.3 Repartition of the sequences by size
From To Number From To Number
1- 50 2831 1001-1100 534
51- 100 5243 1101-1200 405
101- 150 7359 1201-1300 290
151- 200 5678 1301-1400 186
201- 250 5207 1401-1500 165
251- 300 4745 1501-1600 99
301- 350 4445 1601-1700 84
351- 400 4533 1701-1800 70
401- 450 3420 1801-1900 80
451- 500 3320 1901-2000 47
501- 550 2455 2001-2100 30
551- 600 1735 2101-2200 53
601- 650 1292 2201-2300 63
651- 700 971 2301-2400 27
701- 750 877 2401-2500 34
751- 800 721 >2500 176
801- 850 544
851- 900 570
901- 950 391
951-1000 341
A.4 Longest sequences
The longest sequences (>=4000 residues) are listed here:
HTS1_COCCA 5217
FAT_DROME 5147
RYNR_RABIT 5037
RYNR_PIG 5035
RYNR_HUMAN 5032
RYNC_RABIT 4969
LRP_CAEEL 4753
DYHC_DICDI 4725
PLEC_RAT 4687
LRP2_RAT 4660
DYHC_RAT 4644
DYHC_DROME 4639
APB_HUMAN 4563
APOA_HUMAN 4548
LRP1_HUMAN 4544
LRP1_CHICK 4543
RRPA_CVMJH 4488
DYHC_ANTCR 4466
DYHC_TRIGR 4466
GRSB_BACBR 4451
PKSK_BACSU 4447
PKSL_BACSU 4427
PGBM_HUMAN 4393
YP73_CAEEL 4385
DYHC_NEUCR 4367
DYHC_EMENI 4344
PKD1_HUMAN 4303
DYHC_YEAST 4092
RRPA_CVH22 4085
A.5 Statistics for journal citations
Total number of journals cited in this release of SWISS-PROT: 776
A.5.1 Table of the frequency of journal citations
Journals cited 1x: 295
2x: 97
3x: 64
4x: 31
5x: 29
6x: 23
7x: 9
8x: 8
9x: 13
10x: 10
11- 20x: 68
21- 50x: 42
51-100x: 23
>100x: 64
A.5.2 List of the most cited journals in SWISS-PROT
Citations Journal abbreviation
--------- ----------------------------------
5458 J. BIOL. CHEM.
3394 PROC. NATL. ACAD. SCI. U.S.A.
3266 NUCLEIC ACIDS RES.
2322 J. BACTERIOL.
2059 GENE
1825 FEBS LETT.
1713 EUR. J. BIOCHEM.
1540 EMBO J.
1526 BIOCHEM. BIOPHYS. RES. COMMUN.
1425 NATURE
1384 BIOCHEMISTRY
1235 BIOCHIM. BIOPHYS. ACTA
1090 J. MOL. BIOL.
1069 CELL
1043 MOL. CELL. BIOL.
860 MOL. GEN. GENET.
834 PLANT MOL. BIOL.
768 BIOCHEM. J.
736 VIROLOGY
677 SCIENCE
645 MOL. MICROBIOL.
613 J. BIOCHEM.
535 GENOMICS
486 J. VIROL.
423 J. GEN. VIROL.
378 J. CELL BIOL.
370 PLANT PHYSIOL.
349 YEAST
341 GENES DEV.
288 CURR. GENET.
286 HUM. MOL. GENET.
282 J. IMMUNOL.
267 ARCH. BIOCHEM. BIOPHYS.
259 BIOL. CHEM. HOPPE-SEYLER
256 INFECT. IMMUN.
252 MOL. BIOCHEM. PARASITOL.
231 ONCOGENE
214 MOL. ENDOCRINOL.
213 HOPPE-SEYLER'S Z. PHYSIOL. CHEM.
208 J. GEN. MICROBIOL.
201 AM. J. HUM. GENET.
198 FEMS MICROBIOL. LETT.
195 J. CLIN. INVEST.
179 DEVELOPMENT
165 NAT. GENET.
164 J. MOL. EVOL.
160 GENETICS
151 DNA
150 HUM. MUTAT.
148 J. EXP. MED.
143 BLOOD
140 DNA CELL BIOL.
138 HUM. GENET.
129 NEURON
128 DEV. BIOL.
123 APPL. ENVIRON. MICROBIOL.
114 PLANT CELL
109 IMMUNOGENETICS
109 HEMOGLOBIN
105 AGRIC. BIOL. CHEM.
103 DNA SEQ.
101 BIOCHIMIE
101 BIOORG. KHIM.
101 ENDOCRINOLOGY
========================================================================
APPENDIX B: RELATIONSHIPS BETWEEN BIOMOLECULAR DATABASES
The current status of the relationships (cross-references) between some
biomolecular databases is shown in the following schematic:
***********************
****************** * EMBL Nucleotide * **********************
* EPD [Euk.Prom] * <---> * Sequence Database * <---- * ECDC [E.coli map] *
****************** * [EBI] * **********************
***********************
^ ^ ^ ^ ^ ^ ^ ^
****************** | | | I | | | |
* FlyBase * <------+ | | I | | | | **********************
* [D.melanogas.] * | | | I | | | +--------> * GCRDb [7TM recep.] *
****************** | | | I | | | | **********************
| | | I | | | |
****************** | | | I | | | | **********************
* SubtiList * <---------+ | I | | +-----------> * EcoGene [E.coli] *
* [B.subtilis] * | | | I | | | | **********************
****************** | | | I | | | |
| | | I | | | | **********************
****************** | | | I +---------------> * SGD [Yeast] *
* MaizeDb * <-----------+ I | | | | **********************
* [Zea mays] * | | | I | | | |
****************** | | | I | | | | **********************
| | | I | +-------------> * DictyDB [D.disco.] *
****************** | | | I | | | | **********************
* WormPep * | | | I | | | |
* [C.elegans] * <----+ | | | I | | | | **********************
****************** | | | | I | | | | +------ * ENZYME [Nomencl.] *
| | | | I | | | | | **********************
****************** | v v v v v v v v v v
* REBASE * *********************** **********************
* [Restriction * <--- * SWISS-PROT * -----> * OMIM [Human] *
* enzymes] * * Protein Sequence * **********************
****************** * Data Bank *
*********************** **********************
****************** ^ ^ ^ ^ ^ ^ ^ | ^ ^ ^ * ECO2DBASE [2D] *
* StyGene * | | | | | | | | | | +--------> **********************
* [S.Typhimurium]* <----+ | | | | | | | | |
****************** | | | | | | | | | **********************
| | | | | | | | +----------> * Maize-2DPAGE [2D] *
****************** | | | | | | | | **********************
* Transfac * <------+ | | | | | | |
****************** | | | | | | | **********************
| | | | | | +------------> * SWISS-2DPAGE [2D] *
****************** | | | | | | **********************
* Harefield [2D] * <--------+ | | | | |
****************** | | | | | **********************
| | | | +--------------> * Aarhus/Ghent [2D] *
****************** | | | | **********************
* PROSITE * | | | |
* [Patterns and * <----------+ | | +----------------> **********************
* profiles] * | | * YEPD [Yeast] [2D] *
****************** | +----------------+ **********************
| v |
| *********************** +-> **********************
+--------> * PDB [3D structures] * <----- * HSSP [3D similar.] *
*********************** **********************
=End=of=SWISS-PROT=release=34=notes=====================================