I am not a number – a potted history of the CSD refcode
December 21, 2012
Last time we told you a bit about the history of CCDC numbers and looked to their future, (“Who Wants to be a Millionaire?”) but what about that other well known CCDC identifier, the refcode?
Firstly, what IS a refcode? Put simply, it’s a 6 letter code which corresponds to an entry in the Cambridge Structural Database (CSD). Refcodes have been used since the beginning of CSD, so pre-date CCDC numbers. But there are also considerable benefits in using refcodes alongside CCDC numbers as identifiers.
The refcode system allows us to group together structures into families, something we can’t do until the data is actually added to the CSD and compared with the structures already there. The most obvious grouping is different data collections of the same crystal, for example at different temperatures or pressures. We also group the same structure determined by different research groups, and add polymorphs of the same compound, too.
It is also useful to have this distinction between CCDC numbers and refcodes when papers discuss new structures in relation to previous compounds discovered by someone else. Work by other groups is generally referred to by the refcode, so for example an author may say ‘CCDC nnnnnn is isostructural with ABCDEF’, which clearly separates the two structures.
The refcode format means data that may have been collected years apart and have very different CCDC numbers will be found together, as say ABCDEF01 and ABCDEF02. For many structures this is not really an issue, but for commonly studied compounds like the amino acid Glycine, there are currently 85 members of the family, from the very first determination in 1960 to CCDC 864814 (GLYCIN84) published in April 2012.
Early refcodes were selected manually, based on the compound name – so GLYCIN is the structure of glycine. As the compounds which were studied by crystallography became more complex, this became impossible so we started using a refcode generator programme. This generates 6 letter refcodes which have not previously been assigned. There are some letter combinations which are omitted so as not to cause offence, but occasionally refcodes can appear which are a bit rude! Have you found any of these? Our apologies if so!
There are some great refcodes in the CSD – we’ve been using some of our favourites as Featured Structure Friday structures on our Facebook page. Recently we had refcode AUTUMN to celebrate the coming of the new season. If you’ve not signed up to follow us, why not do it now and if you send us your favourite refcodes, you might see your choice featured soon.
One final thing to tell you about refcodes; you may have noticed that in CSD-Xpress the refcodes have a slightly different format. They still have 6 letters, but all end in 00. As you may know, CSD-Xpress contains entries which are validated by our programmes but have not been validated by our team of scientific editors. Although we give these refcodes, we want to make sure they look different from the fully validated entries. It is not until a CCDC scientific editor has checked the structure, as part of our quality control processes, that the entry can be put into a refcode family with other entries. If the entry is part of a family, then its refcode will change completely from that in CSD-Xpress. If the new entry is not part of a family, then the CSD-Xpress entry will lose the 00, and the 6 letters remaining will be the entry’s refcode in CSD.
We hope we’ve given you a bit more insight into our refcode system. Don’t forget to let us know what your favourites are!
CSD System (49)