​​The Cambridge Crystallographic Data Centre (CCDC).
The CCDC websites use cookies. By continuing to browse the site you are agreeing to our use of cookies. For more details about cookies and how to manage them, see our  cookie policy.

How does the similarity search work?

Solution

​The similarity search in WebCSD is based on molecular fingerprints that are calculated using the chemical features of the molecule such as atom types, bond types and bonded paths through the molecule. When a molecule is drawn in the similarity sketcher, the molecular fingerprint for this molecule is calculated and then it is compared to pre-calculated fingerprints of all the structures in the CSD.

Molecular fingerprint comparison is done via a simple mathematical calculation (Tanimoto coefficient) based on the two molecular fingerprints in the form of strings of binary variables. This calculation results in a single coefficient which effectively gives a measure of the similarity between the molecules based on their fingerprints. The similarity value will be in the range of 0 to 1, with 0 being completely dissimilar and 1 being identical, always in terms of fingerprints (i.e. a similarity value of 1 does not always mean that the structures are identical).

In order to produce a manageable set of similar structures a cut-off value for the similarity coefficient is used, below which value matches are discarded (the default value for this cut-off is 0.7 for the Tanimoto coefficient).

N.B. WebCSD v1 also provided a second option for mathematical calculation of the fingerprint-based molecular similarity - the Dice coefficient (for which the cut-off value was 0.975). These two types of similarity coefficient are not directly comparable, so calculated similarity values cannot be compared between the two types in a quantitative fashion.