I've recently tackled a similar problem, but I was only interested in two-component structures in the CSD. The approach I followed made use of the SubstructureSearch functionality and is summarised below:
- Search the CSD to find all two-components structures that are organic and have 3D coordinates, and store resulting refcodes in a .gcd file for future usage
- Split each entry in its heaviest component and smallest component
- Create and save a Substructure Screen of all heaviest components (please note that this functionality will be available with the November release of the CSD Python API, but it helped to speed up the search)
- Create a dictionary of the smallest components (key: CSD identifier, value: Molecule object of the component). A Substructure Screen can be used here, too.
- Perform the substructure search:
- Start by using the heaviest component as a query
- Loop over all the “heavy_hits” and check that the smallest component is the same as the one in the query. Used the dictionary created above in this step.
N.B. I check that query and hit have the same number of heavy atoms. However, it’s likely that you get false positives, as in same cases query and hit differ in stereochemistry rather than in the number of hydrogen atoms.
I hope that helps.