I currently work on identifying the dimensionality of crystal structures using python. The paper (http://pubs.acs.org/doi/full/10.1021/acs.nanolett.6b05229) introduces an algorithm that can screen and identify 1D, 2D, 3D structures. One of the important parameters needed for this algorithm is to find covalently bonded clusters in crystal structures. I already have the covalent bond length threshold needed to determine whether a pair of atoms is bonded or not for my structures of interest. I wonder whether there is a built-in function in CSD python API in helping to find covalently bonded clusters and identify the number of atoms in the clusters or even dimensionality if possible? Thanks a lot!

 

Xiwen

Dear Xiwen,

I'm not entirely sure what you mean by 'covalently bonded clusters', so forgive me if I have gone down the wrong track in what follows, and please post some explanation of the term.

In the API we have a number of methods for finding covalently bonded structures from a crystal.  For example,

from ccdc import io
csd = io.EntryReader('csd')
crystal = csd.crystal('ABACUF')
molecule = crystal.molecule

Here the molecule derived from the crystal is all covalently bonded, but in the presence of solvents, ions or co-crystals the resulting molecule will form more than one discretely bonded structure.  These can be separated using the molecule's component attribute:

components = molecule.components
heaviest = components[0]

It is also possible to extract the moiety lying in the asymetric unit, i.e the minimal representation of the crystal:

asymm_mol = crystal.asymmetric_unit_molecule

We can build larger representations of the molecule by applying the crystal's symmetry operators:

from ccdc.molecule import Molecule
shell = Molecule('SHELL')
for symmop in crystal.symmetry_operators:
    shell.add_molecule(crystal.symmetric_molecule(symmop))

And again, if these form non-covalently bonded structures, the heaviest component can be extracted from this shell molecule:

heaviest_of_shell = shell.heaviest_component

The bonds of these various molecules have been deduced from the crystallographic data and are generally considered to be very good.  If, however, you wish to use a bespoke distance range to define your clusters, then you will have to analyse the lengths of atom-atom distances according to your criterion.  If this is what you wish to do, please post again and I will post an example for you to work with.

I hope this is helpful.

Best wishes

Richard

Hi Richard:

      Thanks for replying! I am glad that API has built-in methods for finding covalently bonded structures. 

      I am sorry that I did not define 'covalently bonded structures' clear enough when I posted the question. Strongly bonded pairs of atoms are identified if the interatomic distance of the pair is less than the sum of their reference atomic bond lengths plus a tolerance of 0.45 Å to account for bond length variability. The reference values of bond radii can be found on page 11 of the link: http://pubs.acs.org/doi/suppl/10.1021/acs.nanolett.6b05229/suppl_file/nl6b05229_si_001.pdf

Therefore, I wish to use a bespoken distance range to define my clusters. In your last post, I tried the method---components and it was great. I am not interested in the whole molecule of the imported crystal but only on the inorganic layers. For example, in my case, I am interested in vanadium tellurite structures and therefore I only want V, O, Te ----these 3 types of atoms to be imported when the components function is used. 

In addition, I am indeed interested in building larger representations the molecule. I tried shell.add_molecule(crystal.symmetric_molecule(symbol))  method suggested by your post. However, I am little bit confused about the line: shell=Molecule('SHELL'). Molecule() doesn't seem to be defined on the API documentation. Also, is the line connected with crystal=csd.crystal('ABACUF') by any means? It'll be great if you can explain a bit about the line. 

Also, I am interest in finding covalently bonded structures in 2*2*2 supercell. The code that I have written is:

expansions=[

cif.crystal.symmetric_molecule(symmop,(i,j,k))

for symmop in cif.crystal.symmetry_operators

for i in range(0,2)

for j in range(0,2)

for k in range(0,2)

]

However, expansions is a list and therefore the method---components does not seem to be able to apply on. Therefore, I wonder what method is available for finding covalently bonded clusters on supercell.

Thanks so much! 

 

Best,

Xiwen

 

 Hi Xiwen,

I think the approach I would take here is to identify the atoms of interest in the unit cell, remove those not of interest and place the molecule back into the crystal:

For definiteness, I'll write an example from the CSD refcode AXOHOL, which is a Vanadium-Tellurium-Oxygen polymeric structure.  I've put extensive comments on the lines of code, so I hope everything will be clear.

from ccdc import io # import the input-output module
csd = io.EntryReader('csd') # This is the whole, current CSD
axohol = csd.crystal('AXOHOL') # This pulls out the crystal given its refcode.  

unit_mol = axohol.molecule
heaviest = unit_mol.heaviest_component # This strips out the solvent molecule

axohol.molecule = heaviest # Replace the crystal's molecule with the desolvated molecule

Now we can make the symmetry expanded representation:

from ccdc import molecule
shell = molecule.Molecule('SHELL') # Create a new, empty molecule

expansions = [
    axohol.symmetric_molecule(symmop,(i,j,k))
    for symmop in axohol.symmetry_operators
    for i in range(0,2)
    for j in range(0,2)
    for k in range(0,2)
] # This is exactly as you wrote

Now add the molecules to the shell:

for mol in expansions:
    shell.add_molecule(mol)

Now the shell contains all the atoms of the super-cell.  However, because there may be atoms lying in special postions, which is the case with AXOHOL, we must take care to remove duplicate atoms from the super-cell.  We will determine uniquness by atoms of the same element lying in the same position:

from collections import defaultdict
unique_dict = defaultdict(list)
for a in shell.atoms:
    unique_dict[(a.atomic_symbol, str(a.coordinates))].append(a) # I use str() on the coordinates to avoid rounding errors failing to catch duplicates

Now any we can remove the duplicates:

shell.remove_atoms(
    a for l in unique_dict.values() for a in l[1:]
)

Now we can make any extra bonds according to your criterion.  We could remove all bonds first, but I think what is more reasonable is to create a bond between the symmetrically related components, where there are two atoms close enough.

from ccdc.descriptors import MolecularDescriptors

to_be_bonded = [
    (a, b) for a in shell.atoms for b in shell.atoms
    if a.index < b.index
    and a not in b.neighbours
    and MolecularDescriptors.atom_distance(a, b) <= 2.21 # This is the ideal bond length for V-O bonds
]


shell.add_bonds(('Single', a, b) for a, b in to_be_bonded)

And now we can get the heaviest component of the super-cell:

super_heaviest = shell.heaviest_component

writer.write(shell)

Please let me know if this works for you, and do not hesitate to come back with further questions.

Best wishes
Richard


Hi Richard:

       Thanks for replying! Your code and comments are all super helpful. However, I still have a couple of questions for you:

1. For the purpose of importing only atoms of interest (in my case, it will be 'O', 'Te', 'V'), I modified your code but not sure whether it is right way to do so:

to_be_bonded_1=[

(a,b) for a in shell.atoms if 'O' in a.label and for b in shell.atoms if 'V' in b.label

if a.index<b.index and a not in b.neighbours and MD.atom_distance(a,b)<=2.6     

]

 

to_be_bonded_2=[

(a,c) for a in shell.atoms if 'O' in a.label and for b in shell.atoms if 'Te' in c.label

if a.index<c.index and a not in c.neighbours and MD.atom_distance(a,c) <=2.8

]

#I used 2.6 for V-O bond and 2.8 for Te-O bond for my structures of interest; other parts of the code remain the same

I wonder whether modifying the code in this way can achieve a correct import of atoms of interest. Or should I do it some other ways.

2. I am having a small trouble implementing the last line of your code---writer.write(shell). I checked the API documentation and there were a variety of writers. Are you referring to the MoleculeWriter.write( ) method? Please correct me if I interpret it the wrong way.

3. I am also interested in calculating the number of atoms in the covalently bonded clusters in the filled unit cell. (by 'filled', I mean that symmetry operators have been applied to the unit cell). I wonder whether I can just modify the 2*2*2 supercell code slightly to do the calculation:

changing: expansions = [
    axohol.symmetric_molecule(symmop,(i,j,k))
    for symmop in axohol.symmetry_operators
    for i in range(0,2)
    for j in range(0,2)
    for k in range(0,2)

into: expansions = [
    axohol.symmetric_molecule(symmop,(i,j,k))
    for symmop in axohol.symmetry_operators
    for i in range(0,1)
    for j in range(0,1)
    for k in range(0,1)

while keeping the other parts of the code the same.

 

 

4. I am interested in the maximum number of atoms and the minimum number of atoms in the covalently bonded clusters in both the filled unit cell and the 2*2*2 supercell. My current way of doing it is to use the components method on the newly written shell (in my following code, it will be shell_unitcell and shell_supercell):

components_unitcell=shell_unitcell.components

components_supercell=shell_supercell.components

if len(components_unitcell)==1:

       print "number of atoms in the unit cell is (max1==min1)", len(components_unitcell[0].atoms)

else:

        result_unitcell=[ ]

        for i in range(len(components_unitcell)):

                 result_unitcell.append(len(components_unitcell[i].atoms))

        if max(result_unitcell)==min(result_unitcell):

                 print "max1==min1 and the # of atoms in the clusters is", max(result_unitcell)

         else:

                   max_unitcell=max(result_unitcell)

                    min_unitcell=min(result_unitcell)

                    print max_unitcell, min_unitcell

#the code for extracting min# atoms and max# atoms in covalently bonded clusters in components_supercell will be similar.

 

I wonder whether using the components method on the written shell molecule will be a good way to extract the two parameters. 

Thanks a lot! I highly appreciate for your replies!

 

Best Regards,

Xiwen

 

 Dear Xiwen,

it's a pleasure for me to be able to help people like you to do interesting research with the API.  There are a couple of things that need to be changed in the code:

Firstly, we don't know in which order the bonds will be defined, so we have to check for both cases O-V and V-O.  This is most easily done with a set comparison.  It's probably worth writing a little function for this, to save copy and paste errors, and to make life easier if you wish to study other crystal structures:


def to_be_bonded(shell, *atom_types, distance=2.8):
    '''Pairs of atoms of the correct type with a distance criterion.'''
    return [
        (a, c) for a in shell.atoms for b in shell.atoms
        if set([a.atomic_symbol, c.atomic_symbol]) < set(atom_types)
        and a.index < c.index
        and a not in c.neighbours
        and MD.atom_distance(a, c) <= distance
    ]

to_be_bonded_1 = to_be_bonded(shell, 'V', 'O', distance=2.6)
to_be_bonded_2 = to_be_bonded(shell, 'O', 'Te', distance=2.8)

The last line was accidentally left in the code.  When I wrote and tested the code, to check that I had done everything correctly I wrote the molecules out so I could see them in mercury.  You are right, it was a MoleculeWriter:

with io.MoleculeWriter('/tmp/VOTe.mol2') as writer:
    writer.write(shell)

I find this useful when developing molecular or crystallographic code.


Your second expression for the expansions will not apply any translations to the symmetry operators, so is more concisely and clearly written as:

expansions = [axohol.symmetric_molecule(symmop) for symmop in axohol.symmetry_operators]

For counting the atoms in the unit cell and the supercell it is probably easier
and clearer to use a generator expression rather than a for loop:

max_unit_cell = max(len(c.atoms) for c in unit_cell.components)
min_unit_cell = min(len(c.atoms) for c in unit_cell.components)

This, though, is purely a matter of taste.

May I ask what is the nature of the research you are doing?  

Best wishes,
Richard


Hi Richard:

             Thanks so much for your replies! Your inputs are really helpful for my research. My research has been about coding in python with API to achieve an efficient extraction of chemical descriptors of interest, which include dimensionality, atom min distance, coordination number and so on. Chemical descriptors can then be mapped to crystals' structures. 

             In addition, when I tried to run the code on AXOHOL and other crystal structures (i.e. IKEOLA, YOTHUM, VOLMAL, VOLMEP [all of which are vanadium teullrites]), I was getting a error message of "add(HBondType, HAtomType, HAtomType) atom1 and atom2 are already in a different bond in this ChemicalGraph object" on the line "shell.add_bonds(('Single',a,c) for a,c in to_be_bonded_2)". I wonder what I can do on the code to avoid this. Thanks!

 

Best,

Xiwen

Hi Xiwen,

This message occurs when an attempt is made to add a bond that has already been made in the molecule, so either the to_be_bonded_2 list contains duplicates, or it contains a pair that is already bonded in the shell molecule.  You can check for either case when adding the bonds, by adding them one-by-one with an appropriate check:

for a, c in to_be_bonded_2:
    if a not in c.neighbours:
        shell.add_bond('Single', a, c)

but it would be better to eliminate the duplicates when constructing this list.

I'm afraid there was a bug in the function I wrote earlier,  which should read:

def to_be_bonded(shell, distance, *atom_types):
    '''Pairs of atoms of the correct type with a distance criterion.'''
    return [
        (a, b) for a in shell.atoms for b in shell.atoms
        if set([a.atomic_symbol, b.atomic_symbol]) == set(atom_types)
        and a.index < b.index
        and a not in b.neighbours
        and MolecularDescriptors.atom_distance(a, b) <= distance
    ]


to_be_bonded_1 = to_be_bonded(shell, 2.6, 'V', 'O')
to_be_bonded_2 = to_be_bonded(shell, 2.8, 'O', 'Te')

I have tested it this time!  This works correctly for AXOHOL, IKELOA, YOTHUM, VOLMAL and VOLMEP.

Have a good weekend.

Best wishes
Richard

Hi Richard:

   Thanks for the responses! 

   May I ask a follow-up question regarding this topic?

   After making all the revisions based on your replies, my code can run successfully and is able to extract the minimum and maximum number of atoms in the covalently bonded clusters for both the original unit cell (min1, max1) and the 2*2*2 supercell (min2,max2). However, the extracted number is quite weird. For example, for structure YOTHUM, min1=max1=min2=max2=18. However, YOTHUM is a 1D chain structure. From the original unit cell to the 2*2*2 supercell, max2/max1 should be 2 instead of 1. 

Another example is AXOHUR, a vanadium tellurite structure : min1=18 max1=25 min2=128 max2=181  max2/max1=7.24. With AXOHUR being a 2D layer, max2/max1 should equal to 4 while the code prints out 7.24.

One more example is  WEMBOG01: min1=max1=22  min2=max2=40   max2/max1=1.82 . However, WEMBOG01 is a 2D layer, max2/max1 should equal to 4 instead of 1.82.

There are quite a lot of structures that have incorrect min1/max1/min2/max2 number. I opened the cif in VESTA to get a rough idea of what max2/max1 ratio is when I expanded the unit cell into a 2*2*2 supercell. What I got from VESTA was inconsistent with the numbers extracted. I have been looking at my code for a couple of days but ended up with little ideas on how to revise it. Therefore, I post again, asking for help. I wonder whether the failure of the extraction has something to do with Molecule.components because the code prior to this method seems quite correct to me. Or maybe we miss some important steps? 

I am very grateful for your help in the past. Will appreciate your advices/suggestions!!   :)   Feel free to ask me more if I didn't state my question clearly. 

 

Have a great day,

Xiwen

HI Xiwen,

sorry about the delay in answering.  I didn't notice that you had replied.

Could you please post the code that you are running so I can try to see what's going wrong with it?

Thanks

Richard

Hi Richard:

             No worries and thanks for keeping track! I have revised the code so that now it seems to be able to extract right values. The revision that I did is : after occupying the cell and expanding the cell by using symmop, I delete atoms that are out of the 2*2*2 supercell and out of the 4*4*4 supercell by using atoms' fractional coordinates before proceeding to the next step. What I also did was that I make the range of i, j, k in cif.symmetric_molecule(symmop,(i,j,k)) large: "for i in range(0,5)   for j in range(0,5)   for k in range(0,5) " so that I can make sure all atoms in the 2*2*2 supercell and 4*4*4 supercell can be generated after applying symmetric_molecules method. I am able to visualize the outcomes of my code in mercury and they seem right to me. 

          The only problem that I have is that the code takes a long time to run: about 2 hours per structure. I wonder whether it's the problem with my algorithm or it is the problem with the old-fashioned computer where I install the CSD package. The cif.symmetric_molecule(symbol, (i,j,k)) for i in range(0,5)   for j in range(0,5)   for k in range(0,5) seems to take a long time.  Some examples of the structures that need a long time to run are:LAQNOH, QIFPAY, CAPBEB.

         

 

 

Here is my code:

def dimension_test(self):

cif=self.obj.crystal

unit_mol=cif.molecule
heaviest=unit_mol.heaviest_component

cif.molecule=heaviest

shell=molecule.Molecule('SHELL')

expansions=[
cif.symmetric_molecule(symmop,(i,j,k))
for symmop in cif.symmetry_operators
for i in range(0,5)
for j in range(0,5)
for k in range(0,5)
]

for mol in expansions:
shell.add_molecule(mol)


unique_dict=defaultdict(list)
for a in shell.atoms:
unique_dict[(a.atomic_symbol,str(a.coordinates))].append(a)

shell.remove_atoms(
[a for l in unique_dict.values() for a in l[1:]]
)

shell.normalise_labels()

fractionalcor=[
a.fractional_coordinates for a in shell.atoms
]

atoms_not_of_interest=[
a for a in shell.atoms
if a.fractional_coordinates.x<0 or a.fractional_coordinates.x>2
or a.fractional_coordinates.y<0 or a.fractional_coordinates.y>2
or a.fractional_coordinates.z<0 or a.fractional_coordinates.z>2
]


shell.remove_atoms(i for i in atoms_not_of_interest)


to_be_bonded_1=self.to_be_bonded(shell,2.6,'O','V')
to_be_bonded_2=self.to_be_bonded(shell,2.8,'O','Te')


shell.add_bonds(('Single',a,b) for a,b in to_be_bonded_1)
shell.add_bonds(('Single',a,b) for a,b in to_be_bonded_2) 

max_atoms_cell=max(len(c.atoms) for c in shell.components)
min_atoms_cell=min(len(c.atoms) for c in shell.components)
print "max_unit_cell is", max_atoms_cell
print "min_unit_cell is",min_atoms_cell


def to_be_bonded(self,shell,distance,*atoms_types):
return[
(a,c) for a in shell.atoms for c in shell.atoms
if set([a.atomic_symbol,c.atomic_symbol])==set(atoms_types)
and a.index<c.index
and a not in c.neighbours
and MD.atom_distance(a,c)<=distance
]

 

print descriptors('REF CODE').dimension_test()

 

Thanks so much and have a great weekend!

 

Best,

Xiwen

 Hi Xiwen,

I've had a look at the code, and found that it is the remove_atoms() calls that take a long time - around half an hour to remove 8000 atoms.  I think this is to do with reshuffling atom indices for every removal.  I have tried sorting the atoms in decreaising value if index, and that does make a huge difference (but is an ugly trick) so I've come up with a scheme to do the deletions on the smaller expansion molecules before appending to the shell:

    present = set()
    for mol in expansions:
        mol.remove_atoms(
            a for a in mol.atoms if (a.atomic_symbol, str(a.coordinates)) in present
        )
        atoms_not_of_interest=[
            a for a in mol.atoms
            if a.fractional_coordinates.x<0 or a.fractional_coordinates.x>2
            or a.fractional_coordinates.y<0 or a.fractional_coordinates.y>2
            or a.fractional_coordinates.z<0 or a.fractional_coordinates.z>2
        ]
        mol.remove_atoms(atoms_not_of_interest)
        present |= set((a.atomic_symbol, str(a.coordinates)) for a in mol.atoms)
        shell.add_molecule(mol)

For LAQNOH this fragment runs in less than four seconds.

Best wishes

Richard

Hi Richard,

     Thanks a lot! Your fragment definitely cuts the running time and it takes fewer than 5 mins to run one structure. 

     Only 2 structures are acting weird: VOLMAL and IXASUW. "none" will be printed out every time I call mol.atoms.fractional_coordinates. I have looked into their CIFs and there were "?" marks besides some atoms.  Therefore, I added one line to resolve this issue:

mol.remove_atoms(  a for a in mol.atoms if a.fractional_coordinates is None )

       The code was finally able to run on those two structures instead of encountering "nonetype object has no attribute" type issue. But the running time was quite long: each of those two structures takes about half an hour to get an result. At first, I wonder whether this has to do with remove_atoms() reshuffling atom index after every removal, an issue that you mention in the last post. So, I tried to sort mol.atoms based on its index: 

mol.atoms.sort(key=lambda x: x.index, reverse=True)

       Still, sorting based on atom index doesn't help. Do you have any idea of what's going on here? Is there any possibility that we can shorten the running time on those two structures? 

 

 

Many thanks,

Xiwen

 HI Xiwen,

You are right to remove the atoms with no fractional coordinates, a case I hadn't considered when writing the code.  This occurs where there is some disorder in the structure.

When I run the code with VOLMAL and IXASUW I get results in under a minute, with the longest step being the calculation of to_be_bonded (around 35 - 40 seconds).  Where did you put your extra line?  I put mine in the loop which adds expansions:

    for mol in expansions:
        mol.remove_atoms(
            a for a in mol.atoms if (a.atomic_symbol, str(a.coordinates)) in present
        )
        mol.remove_atoms(
            a for a in mol.atoms if a.fractional_coordinates is None
        )
        atoms_not_of_interest=[
            a for a in mol.atoms
            if a.fractional_coordinates.x<0 or a.fractional_coordinates.x>2
            or a.fractional_coordinates.y<0 or a.fractional_coordinates.y>2
            or a.fractional_coordinates.z<0 or a.fractional_coordinates.z>2
        ]
        mol.remove_atoms(atoms_not_of_interest)
        present |= set((a.atomic_symbol, str(a.coordinates)) for a in mol.atoms)
        shell.add_molecule(mol)

though it could be an extra case in the atoms_not_of_interest comprehension.

By the way, mol.atoms.sort() won't do anything useful.  The atoms property of the molecule is calculated afresh every time you call mol.atoms.  Better would be to sort a list of atoms, for example:

atoms_not_of_interest.sort(...)
mol.remove_atoms(atoms_not_of_interest)

I've attached the script I use for testing these things.  Can you have a look at it and see if I'm doing anything terribly wrong?

Best wishes

Richard

 Hi Richard:

      I put my extra line on the exact same spot as you. Your code attached is completely right.

      The reason why mine took such a long time was that I not only did the calculation for the 2*2*2 supercell but also did it for the 3*3*3 supercell. I need the max3/max2 ratio to find out the dimensionality of the structure. I am sorry that I did not make it clear at first. What I did was that I wrote a little function called dimension_test which covers from expansion to finding max atoms using shell.components (please see the attached code for more details). I set the upper limit for the atoms_not_of_interest list to be a parameter called 'threshold' and I called dimension_test in the main function so that I could run it twice to calculate max_atoms in 2*2*2 supercell and max_atoms in  3*3*3 supercell.  I timed each sections of my code and it was the to_be_bonded of 3*3*3 supercell that took the most time: 1796 seconds for VOLMAL and 1766 seconds for IXASUW. To_be_bonded of 2*2*2 supercell was: 144 seconds for VOLMAL and 149.72 seconds for IXASUW. 

      I have tried to put the to_be_bonded line and add_bond line under the "for mol in expansion" loop, hoping that the running time will decrease if we find to_be_bonded on the smaller scale. However, this change didn't seem to result in a shorter running time (I have tested the influence of the change on running time on VOLMAL).

       I don't understand whether there will be such a big difference on running time if there is none in the fractional coordinates. I have ran the code on a lot of vanadium tellurites structures (32 in total): except for IXASUW and VOLMAL, others' running time range from 15 seconds to 4 mins. Is it still possible to cut the running time of the to_be_bonded line for the structures that have atoms with no fractional coordinates?

      An idea that I have so far is that we can pre-process the CIFs, delete the lines of atoms which have no fractional coordinates and read in the revised CIFs into the code to avoid a long running time. 

       Many thanks!! Really appreciate your help over the past :) 

 

Best,

Xiwen

 

       

You must be signed in to post in this forum.