this did come as a bit of a surprise to me, but after looking at it a bit I think I am clear as to how it happened.
The symmetry operator returned by the substructure search returns a composite of the unit cell translations and the underlying symmetry operator. In this instance it represents the symmetry operator 1/3+y,2/3-x+y,2/3+z with a translation of (0, 0, 1). When putting the translation component in to the operator we end up with 1 + 2/3 +z, which is rounded to 1.66667. When applying the symetry operator to the crystal to create the symmetric_molecule, a check is made that the operator is one of the symmetry operators of the crystal, and 0.66667 is not exactly equal to 1/3.
The solution here is very simple: you can force the symmetry operator to be applied:
>>> crystal = hits.crystal
>>> symmop = hits.match_symmetry_operators()[-1]
>>> symmetric_molecule = crystal.symmetric_molecule(symmop, force=True)
I shall consider whether there may be a better representation of symmetry operators, with the translation kept separate from the operator for some future release.
Thank you for raising this interesting case.
this occurs because the search is performed over the all molecules of the crystal as can be seen by inspecting the symmetry operators of the match:
>>> print hits.match_symmetry_operators()[-1]
The atoms provided by match_atoms() are of the single, central molecule of the crystal. I do think that this is not ideal, and I would like to provide some method for capturing the actually matched atoms in some future release.
In the meantime you will can get the matched atoms in a rather roundabout way:
>>> sym_copy = hits.crystal.symmetric_molecule(hits.match_symmetry_operators()[-1])
>>> sym_Y = sym_copy.atom(Y.label)
>>> print ccdc.descriptors.MolecularDescriptors.atom_distance (X, sym_Y)
This will work as long as the atom labels of the molecule are distinct. If not the match will have to be done on the index of the atoms:
>>> sym_Y = [a for a in sym_copy.atoms if a.index == Y.index]
I think you may have put ConQuest as part of your startup system, so it gets started when you log in. If this is the case, a colleague suggests starting your Mac in safe mode, then removing ConQuest from the startup folder. You can then run ConQuest when you want to, rather than automatically when you log in.
Please let me know if this works.
I'm glad you've got it all sorted out.
You'll be pleased to hear that the forthcoming release of the CSD (due later this month) will contain ADPs for many structures - around a quarter of them. These will be accessible through the API as properties of the atoms of a molecule.
I worked out the SMARTS pattern by hand, since I've had a reasonable amount of experience with Daylight SMARTS. The best reference for it is:
There is a sketcher which I believe will display SMARTS, though I've not used it:
It appears to support setting variable atom and bond types.
I'm afraid the CSD Python API does not support pickling of classes and instances. Underneath the python is a set of bindings to native C++ objects and these cannot currently be serialised. If I wish to save the results of a CSD search I will normally save the identifiers in a GCD file:
with open('search_hits.gcd', 'w') as writer:
writer.write('\n'.join(h.identifier for h in hits))
Then if I wish to get the matching atoms, I will search the GCD file:
hits = searcher.search('search_hits.gcd')
Alternatively, you could use a CSV file, to record the identifier and atom numbers of each hit.
I think this is because the filters which we use to screen out structures which cannot match the query are not doing a particularly good job here, since the query structure contains nothing but carbon and hydrogen. This means a lot of graph searching needs to be done. I can speed up the search quite a lot - approximately seven-fold - by recasting the query using fewer atoms and some extra constraints:
s = SubstructureSearch()
cps_substructure = QuerySubstructure()
c1 = cps_substructure.add_atom('C')
c2 = cps_substructure.add_atom('C')
c3 = cps_substructure.add_atom('C')
c4 = cps_substructure.add_atom('C')
c5 = cps_substructure.add_atom('C')
b1 = cps_substructure.add_bond(QueryBond(), c1, c2)
b2 = cps_substructure.add_bond(QueryBond(), c2, c3)
b3 = cps_substructure.add_bond(QueryBond(), c3, c4)
b4 = cps_substructure.add_bond(QueryBond(), c4, c5)
b5 = cps_substructure.add_bond(QueryBond(), c5, c1)
b1.cyclic = b2.cyclic = b3.cyclic = b4.cyclic = b5.cyclic = True
c11 = cps_substructure.add_atom('C')
c12 = cps_substructure.add_atom('C')
c13 = cps_substructure.add_atom('C')
c14 = cps_substructure.add_atom('C')
c15 = cps_substructure.add_atom('C')
c11.num_hydrogens = c12.num_hydrogens = c13.num_hydrogens = c14.num_hydrogens = c15.num_hydrogens = 3
b11 = cps_substructure.add_bond('Single', c1, c11)
b12 = cps_substructure.add_bond('Single', c2, c12)
b13 = cps_substructure.add_bond('Single', c3, c13)
b14 = cps_substructure.add_bond('Single', c4, c14)
b15 = cps_substructure.add_bond('Single', c5, c15)
Slightly faster still is to use a SMARTS query:
smarts = SMARTSSubstructure('[#6]1(-[CH3])~[#6](-[CH3])~[#6](-[CH3])~[#6](-[CH3])~[#6](-[CH3])1')
I'm afraid in my last post I omitted the essential step of normalising hydrogen positions before testing the hits. This should be done by:
for h in hits:
before testing the distance.
Sorry about this.
I'm afraid the API does not currently support this feature. If your search has constraints or measurements which depend on hydrogen positions, these will be taken from the structure without applying hydrogen position normalisation, so may well not find hits of interest. If, on the other hand, your search is independent of the hydrogen bond positions, then the results of the search may be normalised after the search, when measurements may be taken, or constraints applied.
For example, you could perform a search with rather looser distance constraints to a hydrogen, then for each hit, normalise the molecule's hydrogens, and test with a more accurate measurement.
Here is a worked example, searching for COOH dimers:
from ccdc import search
searcher = search.SubstructureSearch()
query = search.SubstructureQuery()
sub1 = searcher.add_substructure(search.SMARTSSubstructure('C(=O)-O-H'))
sub2 = searcher.add_substructure(search.SMARTSSubstructure('C(=O)-O-H'))
Normally we would search for the hydrogen bonds using a vdw corrected distance of (-5, 0) but if we do not trust the positions of the hydrogens we can make the range larger:
searcher.add_distance_constraint('Dist1', sub1, 1, sub2, 3, (-5, 5))
searcher.add_distance_constraint('Dist1', sub2, 1, sub1, 3, (-5, 5))
hits = searcher.search(max_hit_structures=10, max_hits_per_structure=1)
This gives 10 hits.
[h.identifier for h in hits]
[u'AADAMC', u'ABADIS', u'ABADIS02', u'ABAFER', u'ABEKUO', u'ABENEB', u'ABEQUU', u'ABEXUD', u'ABIDAR', u'ABIDAR01']
Now we can filter using the correct distances:
from ccdc.descriptors import MolecularDescriptors
atoms = h.match_atoms()
o1, h2 = atoms, atoms
o2, h1 = atoms, atoms
MolecularDescriptors.atom_distance(o1, h2) - o1.vdw_radius - h2.vdw_radius < 0 and
MolecularDescriptors.atom_distance(o2, h1) - o2.vdw_radius - h1.vdw_radius < 0
hits = [h for h in hits if correct_distance(h)]
And now we have 6 hits:
[h.identifier for h in hits]
[u'AADAMC', u'ABADIS', u'ABEKUO', u'ABENEB', u'ABEQUU', u'ABEXUD']
I hope this is helpful.
I'm afraid the API does not support the 'Any' keyword to mean any bond. Instead this becomes a specification for an explicit 'Unknown' bond. This is an error, and will be rectified in the next release of the API. Instead, use the form
cps_substructure.add_bond(QueryBond(), c1, c2)
which will do the right search.
the simplest way to do this is to use the attribute 'heaviest_component'. This will work as long as the solvent molecules have molecular weights lower than that of the component of interest.
All discrete components of the molecule are accessible via the attribute 'components'. These will be sorted by molecular weight, so can be inspected to find the structure of interest.
Hope this is clear and helpful.