by PDB,NDB,UniProt,PROSITE Code or Search term(s)  

(-) Preliminary Note

The primary resource for 3D structures of biological macromolecules is the Worldwide Protein Data Bank (wwPDB).
Comprehensive resources that may be ar may not be wwPDB members are: All of these resources have a QuickSearch or a LiteSearch Option. Is is important to note that the results obtained may be different. This is basically due to different search spaces or matching options and is illustrated by three searches performed on July 13, 2006.

Search for 'melanin':
JenaLib6 hits:1F9B, 1IDP, 1OYO, 2B9L, 2STD, 3STD
MSD26 hits:1A8R, 1A9C, 1AR0, 1DOH, 1DPT, 1F9B, 1G0N, 1G0O, 1GTP, 1IDP, 1JA9, 1OAA, 1OUN, 1OYO, 1STD, 1TVB, 1TVh, 1YBV, 2B9L, 2STD, 3STD, 4STD, 5STD, 6STD, 7STD
OCA 12 hits:1DOH, 1F9B, 1G0N, 1G0O, 1GTP, 1JA9, 1OYO, 1STD, 1YBV, 2B9L, 2STD, 3STD
PDBsum 6 hits:1F9B, 1IDP, 1OYO, 2B9L, 2STD, 3STD
RCSB/PDB 14 hits:1DPT, 1F9B, 1OYO, 1STD, 1TVB, 1TVH, 1YBV, 2B9L, 2STD, 3STD, 4STD, 5STD, 6STD, 7STD
Search for 'PYRR_BACSU':
JenaLib2 hits:1A3C, 1A4X
MSD2 hits:1A3C, 1A4X
OCA 2 hits:1A3C, 1A4X
PDBsum 0 hits:
RCSB/PDB   0 hits:
Search for 'genase':
JenaLib1942 hits:
MSD2608 hits:(In this case the search term has to be '*genase'. Otherwise you will get no hits.)
OCA 2243 hits: (Again '*genase' has to be used for searching.)
PDBsum 1952 hits:
RCSB/PDB   0 hits: ('*genase' does not work in this case.)

The OCA and MSD searches are done as a text query.The larger number of hits found by the first PDB search as compared to the PDBsum and JenaLib results is due to the fact that this search is based on mmCIF format files. These files also include information from other databases, such as UniProt keywords, for example. One the other hand, the JenaLib/PDBsum searches are based on the original PDB format files. Note, however, that in the JenaLib QuickSearch option a mapping of PDB, UniProt and PROSITE codes is included in the search space. Further differences can occur if different PDB file records are taken into account. Information in which records the search strings do occur is provided by PDBsum and JenaLib.
The by far largest number of hits is obtained from a MSD search. The reason is that MSD also searches in PubMed abstracts of primary and of all secondary citations.

For the second search it is not clear why the PDB does not return any hits, because in other cases searching for UniProt IDs gives results. In MSD and PDBsum searching for UniProt codes seems not to be possible in the simple search versions.

In the third search there is a dramatic difference in the hit number between PDBsum/JenaLib on the one side and MSD/OCA/PDB on the other side. The reason is that in the latter case a complete word matching is required, whereas in JenaLib/PDBsum a partial word match is sufficient. The difference disappears both for MSD and OCE if the wildcard sign is used in the search string. This does not work, however, in the RCSB/PDB case. The larger number of MSD hits may be again due to additional search in PubMed abstracts. Finally note, that PDBsum includes superseded entries.

So, the take-home message from these observations is that the best results can be obtained by using search options of different resources.

(-) General Information

This QuickSearch option provides a simple search interface to the Jena Library of Biological Macromolecules (JenaLib).

PDB / NDB IDs and UniProt accession numbers are recognized automatically. In this case the search is performed only in the corresponding ID / accession number list and requires a complete match.
If a PDB / NDB ID was provided, the corresponding JenaLib atlas page will be shown directly. Otherwise a list of entries will be displayed.

Any other string, including UniProt IDs (entry names) and PROSITE IDs and accession numbers, is interpreted as one or more 'search terms'. The separation of these terms is indicated by blanks. So, the string 'arabinose isomerase' will be separated into the two search terms 'arabinose' and 'isomerase'.

A phrase can be used as one search term by putting the complete string in double quotes. So, "arabinose isomerase" will be used, for example, as the single search term 'arabinose isomerase'. A search term must be at least three charactesr long. Within a phrase, character strings may be shorter than three as in "factor h", for example. However, the total number of characters surrounded by the double quotes and including blanks has to be three or larger. Double quotes can also be used to prevent the recognition of a string as a PDB / NDB code or accession number.

A hit is returned if all search terms are found in a particular entry. This corresponds to a search term combination by a logical AND, see below.

(-) Search Space

In the following description 'complete match' means that the complete database code must match a search term.
In contrast, 'partial match' means that only a part of a field like 'Structure Title' must match a search term.
Fields are the database elements that contain parts of information from the PDB file or from other data sources, such as the TITLE or KEYWDs records of the PDB file.

The QuickSearch option queries:
PDB IDscomplete match( example:  3CRO )
NDB IDscomplete match( example:  PDR001 )
UniProt codes, including  
- Primary accession number
complete match( example:  P03036 )
- Secondary accession number
complete match( example:  P25982 )
- ID / Entry name
partial match( example:  RCRO_BP434 )
PROSITE codes, including  
- Accession number
complete match( example:  PS01122 )
- ID / Entry name
partial match( example:  CASPASE_CYS )
Headerpartial match( PDB record: HEADER )
Structure Titlepartial match( PDB record: TITLE )
Keywordspartial match( PDB record: KEYWDS )
Methodpartial match( PDB record: EXPDTA )
Hetero Component Namepartial match( PDB record: HETNAM, HET ; only full name )
Reference, including all sub-records such aspartial match( PDB record: JRNL ; primary reference )
- auth
- titl
- ref
- refn
- ...
Compound, including all sub-records such aspartial match( PDB record: COMPND )
- molecule
- synonym
- ec
- ...
Source, including all sub-records such aspartial match( PDB record: SOURCE )
- organism_scientific
- organism_common
- cellular_location
- expression_system
- cell_line
- tissue
- ...

Only PDB information contained in the original PDB format files and cross-references between PDB, UNiProt and PROSITE codes is taken into account. Additional information from mmCIF format files is not used.

More information on the 'PDB Format' can be obtained from the Protein Data Bank Contents Guide.

(-) How to Create a Search Query

  • The search query can either be a single search term or multiple search terms, separated by blanks.
    (example: arabinose)
  • Multiple search terms are automatically combined by a logical AND.
  • The logical NOT ('!=') can also be used.
    Note, that the search time increases substantially if only one search term with a logical NOT is used.
  • A phrase can be searched by surrounding it with double quotes.
  • Each individual search term, including phrases, must have three or more letters.
    (example: "factor h")
  • Sub-records such as 'auth', 'titl', 'ref', 'refn', 'molecule', 'cell_line' ... can be used as search terms.
    But be careful, they are not necessarily present in each PDB entry.
  • The search is NOT case-sensitive.
    (example: arABinOse)

(-) Output

The search returns either an atlas page or an entry list.

In the latter case all search fields with occurrences of at least one of the search terms are displayed and the search terms are highlighted.
For up to 150 entries the ouput can be collapsed to only one line per entry.

It is also possible to generate code lists with user-selected separators such as new line, comma, semicolon, blank, tab.

Example output:

Search Term(s): "arabinose isomerase" ()
1 of 38719 entries match the query
Reference : titl: crystal structure of l-arabinose isomerase from e.coli
Compound : molecule: l-arabinose isomerase
Structure Title : crystal structure of l-arabinose isomerase from e.coli
Generate entry code list:  codes separated by     include header  

(-) Example Queries

In the following examples double quotes ("....") but NOT single quotes ('....') are part of the search string.
The query strings are linked to the corresponding QuickSearch query.
Query Description
_HUMAN Search for all human proteins in the PDB that are cross referenced to a UniProt entry.
DPOLB_ Search for all DNA polymerase beta proteins in the PDB that are cross referenced to a UniProt entry.
CASPASE_ Search for all entries with the associated PROSITE IDs (entry names) CASPASE_CYS, CASPASE_HIS, CASPASE_P10, CASPASE_P20.
"refn: astm psfgey" Search for all entries that have been published in the journal Proteins.
refn is a group of fields that contains encoded references to the citation such as the ASTM (American Society for Testing and Materials) code or the ISSN and ISBN numbers. 'psfgey' is the ASTM code for the journal Proteins. This search mode is especially useful if a journal has changed its name or if the journal name is rather unspecific (as in this case)]. To get the ASTM code search first for the journal name and include the search string refn in the query. From the search results you will get the ASTM code that can than be used for a more specific search.
"solid state nmr" Search for all entries that match exactly the phrase 'solid state NMR'.
caspase nmr Search for all NMR structures of the protein caspase.
haloarcula marismortui Search for all structures from the archaeum (archaebacterium) Haloarcula marismortui.
a.rich Search for all structures with A. Rich as an author.
Note, that the search string rich' returns many other hits containing, for example, the phrase 'leucine rich proteins' or all Escherichia coli proteins.
"tb structural genomics consortium" Search for all structures deposited by the TB Structural Genomics Consortium.
jcsg !=nmr Search for all non-NMR structures deposited by the Joint Center for Structural Genomics (JCSG).
", rsgi" Search for all structures deposited by the Riken Structural Genomics/Proteomics Initiative (RSGI).
Search for 'rsgi' alone would, for example, also yield entries authored by R.W. Pickersgill.
Note also, that search for other author affilitations makes no sense, because this information is not included in the corresponding PDB record. The names of Structural Genomics Centers/Initiatives are an exception because they are used as author names.
gene: Search for all entries which contain a sub-record ending with 'gene' (gene, organism_gene, expression system_gene).
"gene: thsa" Search for all structures for which the thsA gene is indicated in the sub-records ending with 'gene' (gene, organism_gene, expression system_gene).
"cellular_location: cytoplasm" Search for all structures for which cytoplasm is indicated in the sub-records 'cellular_location' or 'expression_system_cellular_location'.
renin Search for all entries that contain the string 'renin'.
The hits include both entries with 'renin' and with other 'renin'-containg strings such as 'prorenine' or 'kynurenine'.
" renin " Returns 'renin' hits only.
moglobin Search for all hemoglobin entries.
The PDB uses the two different spellings 'hemoglobin' and 'haemoglobin'. Searching for 'moglobin' identifies both of them.
"organism_common: fungi" Search for all fungi proteins.
Searching for 'fungi' alone would also return hits where fungi is part of a longer string such as in fungicide or sinefungin. Search for the enzyme classificator (xylose isomerase).
Note that the PDB files contain slightly different strings related to enzyme classification, e.g.: 'E.C.', 'E.C.' and 'EC:'. Also, be careful in interpreting the results. For example, a search for '' returns not only the ''-hits but also entries with '' or '', for example.
"to be published" Search for all entries with the reference information 'to be published'.
On May 3, 2006 this query returned 7052 hits.
"2005" Search for all occurrences of '2005'.
These occurrences may include the year of publication, page numbers, a specific CCDC/PDB confirmation code in the refn sub-record and possibly further cases, for example the occurrence of '2005' in large page numbers. So, this search will return structures published in 2005 but also further entries.Getting only 2005 structures is not possible with the QuickSearch option. In any case you have to use double quotes. Otherwise, the search string is considered as PDB ID.
" 1996" Search for all occurrences of ' 1996' (note the leading blank).
The leading blank prevents to find hits, where '1996' is part of a larger string but does not prevent cases, where 1996 is a page number. As 1996 is not used as a code in the refn sub-record, and also 1996 is obviously never used as a page numer (as of June 2006), this search should very likely give all structures with 1996 primary citations. Note, that the number of these hits is not identical to the number of entries released in 1996. The latter quantity also includes cases with a 'to be published'-references and, possibly, entries that were already published before 1996.
" 1973" Search for all occurrences of ' 1973'.
In addition to 1973 structures you will also get other entries, for example, the one with the line 'fragment: nonstructural protein ns5a (p56)(residues 1973- 2003 of swiss-prot sequence p27958)'.
So, for a more reliable and specific search in references you have to use the (upcoming) AdvancedSearch option.

(-) AdvancedSearch vs. QuickSearch

Certainly, an AdvancedSearch option can query the database in a more versatile and specific manner than a simple QuickSearch. Currently, we are working on a new AdvancedSearch option. An old version is still available, however. It is required, for example, if one wants to search for SCOP, SMART, Pfam or Gene Ontology terms.

Our experience is, however, that a large fraction of database queries can be conducted in a satisfying manner by QuickSearch. One advantage of the QuickSearch option is also that it can identify entries where terms of upcoming developments are occurring in reference titles and keywords, for example, but have not yet made it into the more formalized PDB records. One example is 'solid state NMR' that not yet appears in the Methods record.