What are the strengths/weaknesses of similarity searching in WebCSD?
With all fingerprint-based methods of similarity searching there are certain strengths and weaknesses inherent in the fingerprint definitions. The fingerprints used by WebCSD for similarity searching are created using atom types, bond types and bonded paths through the molecules. This definition for the fingerprints means that the search will tend to find matches that contain closely related scaffolds. There are, however, a number of weaknesses associated with the fingerprints and similarity calculations as they are implemented at the moment.
The first issue is that although the bond types are compared, cyclicity is not explicitly taken into account within the fingerprints. This means that cyclohexane will be indistinguishable from hexane in a similarity search. Molecules that contain fewer atoms will also be less well defined, and therefore are more prone to low similarity scores. Finally, no information is stored about chemically related elements, such as transition metals, this means that closely related metal complexes, for example, may not be listed with high similarity coefficients.
For further information about the similarity search calculation, see the following open access publication: Thomas et al., 2010, J. Appl. Cryst., 43, 362-366