Overview

pdbsearch is a Python library for searching for PDB structures using the RCSB web services.

Nodes and Queries

The pdbsearch.search() function is useful for simple queries, but it has some limitations.

  1. If using multiple queries, they will always be combined with an and operator.

  2. You can only provide one value per argument - if you have multiple protein sequences, you can’t search for them all at once.

It can sometimes be useful to access the underlying node system that pdbsearch.search() is built on for more complex queries. This solves both of the above limitations.

Nodes

Each of the search services has a function for creating a single search node.

>>> # Full text search node
>>> node = pdbsearch.full_text_node(term="thymidine kinase")
>>>
>>> # Text search node
>>> node = pdbsearch.text_node(pdbx_struct_assembly__details__not__contains="good")
>>>
>>> # Text chem search node
>>> node = pdbsearch.text_chem_node(chem_comp__formula_weight__lt=1000)
>>>
>>> # Sequence search nodes
>>> node = pdbsearch.sequence_node(protein="MALWMRLLPLLALLALWGPDPAAA", identity=0.95, evalue=1e-10)
>>> node = pdbsearch.sequence_node(dna="ATGCATGCATGC", identity=0.95, evalue=1e-10)
>>> node = pdbsearch.sequence_node(rna="AUGCAUGCAUGC", identity=0.95, evalue=1e-10)
>>>
>>> # Sequence motif search node
>>> node = pdbsearch.seqmotif_node(protein="C-X-C-X(2)-[LIVMYFWC]", pattern_type="prosite")
>>>
>>> # Structure search node
>>> node = pdbsearch.structure_node("4HHB-1", operator="relaxed_shape_match")
>>>
>>> # Structure motif search node
>>> node = pdbsearch.strucmotif_node("4HHB", residues=(("A", 10), ("A", 20)), rmsd=0.5, exchanges={("A", 10): ["ASP"], ("A", 20): ["HIS"]})
>>>
>>> # Chemical search nodes
>>> node = pdbsearch.chemical_node(smiles="CC(C)C", match_type="graph-relaxed-stereo")
>>> node = pdbsearch.chemical_node(inchi="InChI=1S/C6H12/c1-2-4-6-5-3-1/h1-6H2", match_type="graph-relaxed-stereo")

You can execute any of these nodes individually using their pdbsearch.query() method. These can take a return_type parameter, and all of the request option parameters.

>>> results = node.query("entry", return_all=True, sort="-rcsb_accession_info.deposit_date")

Combining Nodes

All node objects have an and_ and or_ method, which can be used to combine them with other nodes.

>>> node1 = pdbsearch.full_text_node(term="thymidine kinase")
>>> node2 = pdbsearch.text_node(pdbx_struct_assembly__details__not__contains="good")
>>> node3 = pdbsearch.sequence_node(protein="MALWMRLLPLLALLALWGPDPAAA", identity=0.95, evalue=1e-10)
>>> node4 = pdbsearch.sequence_node(dna="ATGCATGCATGC", identity=0.95, evalue=1e-10)
>>> node5 = pdbsearch.sequence_node(rna="AUGCAUGCAUGC", identity=0.95, evalue=1e-10)
>>> node = node1.and_(node2).or_(node3.and_(node4.or_(node5)))
>>> results = node.query("entry", return_all=True, sort="-rcsb_accession_info.deposit_date")

Schemas

The text and text_chem services have a schema that defines the attributes you can search on. These can be read here and here respectively.

They are also available as JSON schema objects, here and here respectively. This is important as pdbsearch needs to know type information about the attributes in order to know which operator to use sometimes, and it needs to know which parameter names correspond to this service when parsing a pdbsearch.search() function call.

For this reason, a simplified form of the schema (all attributes, but only the information about them pdbsearch needs) is hardcoded into the library. To ensure the library always uses the most up to date information, it will try to update its own local copy of the schema from the RCSB API when the library is imported.

This can be disabled by setting the PDBSEARCH_NO_UPDATE environment variable.

You can also download the full schema using a CLI command:

pdbsearch schema > schema.json
pdbsearch schema --chemical --indent 4 > chemical_schema.json

The downloaded schema information will be cached locally, so that it doesn’t fetch the schema every time pdbsearch runs - to delete this local cache, you can run:

pdbsearch clearschema