Lilien Lab
Department of Computer Science
Centre for Cellular and Biomolecular Research
University of Toronto

The database is constructed using the following steps:
  1. The latest PDB release is filtered using the annotation provided from the PDBsum database. The following complexes are selected: structure determination by X-ray crystallography with resolution better than 2 angstrom; at least one chain is longer than 50 amino-acids; the bound ligand is correctly annotated as HETATOM.
  2. Remove ligands with less than 7 (or 13) heavy atoms or with molecular weight greater than 800 Dalton.
  3. Remove ligands that are covalently bound to the target protein. In case of multiple ligands, retain the non-covalent ligands.
  4. Calculate pairwise sequence similarity between all pairs of proteins (bl2seq). Calculate the pair-wise Tanimoto coefficient of the Daylight fingerprints between all pairs of ligands.
  5. Construct similarity matrices (protein vs. protein and ligand vs. ligand) and convert them into binary matrices using a 25% sequence similarity threshold for proteins and a 0.85 Tanimoto coefficient threshold for ligands.
  6. Multiply the protein and ligand similarity matrices (dot product).
  7. Construct a graph in which every vertex corresponds a complex and an edge between two nodes exists if the similarity between the corresponding complexes is higher than the threshold (i.e. there is a '1' in the corresponding entry of the binary similarity matrix).
  8. Select a maximal set of nodes (complexes) such that no two nodes in the set have a connecting edge. That is, select the maximal set of complexes where no two selected complexes are considered similar.
At the Downloads page we provide non-redundant lists combining the following parameters:
  • 7 and 13 minimal number of heavy atoms (see Step 2).
  • 25%, 50% and 90% protein sequence identity (see Step 5).
  • 0.85 and 0.7 Tanimoto coefficient fingerprints similarity (see Step 5).
Additional files: We separate all complexes used for the non-redundant selection (step 8 above) into two files. One file contains the protein structure without the bound ligands in a PDB format. The second file contains all non-covalently bound ligands in a SDF format.