700,000 high quality crystal structures now at CSD users’ disposal!

​I wrote a blog early in 2012, noting the fact that the Cambridge Structural Database (CSD) had grown to include over 600,000 entries. CSD users that visit our online portal to the CSD (known as WebCSD) will see that the latest update (released on 25th March) pushes the size of the CSD to over 700,000 entries.

What users may not be aware of is that the CSD has actually held over 700,000 entries since December 2013, but in a way that’s a bit harder to spot. WebCSD is frequently updated to include the additional crystal data as it reaches us here in Cambridge. The last update of 2013 brought the total number of structures in the CSD to over 687,000. Not a particularly round number you may think. However, the update also included the most recently published data available to the CCDC - over 19,000 structures - which we make available alongside the CSD as CSD X-Press. That brought the total number of structures available to CSD users to over 700,000!
 
You may be wondering why we’ve added the structures in two separate databases. Here at the CCDC we believe CSD users want two things: crystal data available as soon as possible, and a database that contains consistent, high quality structural data. To achieve these two things we provide two separate solutions, which can be used together or independently.
 
CSD X-Press, as the name suggests, contains the most recent data available. This includes structures that are available online but not yet formally published in a journal (sometimes referred to as ‘Advance Articles’, ‘Early View’ or ‘Articles ASAP’), and the most recently published structures. To enable us to make these data available so rapidly, structures are added to CSD X-Press automatically.  By working closely with major journal publishers, such as the ACS, RSC and IUCr, we can ensure that data are made freely available to CSD users as quickly as possible via CSD X-Press. However, this approach comes with certain caveats, which is why crystal data treated this way, as shown in the screenshot below, is in a separate CSD X-Press database.
 
An example of a CSD X-Press database entry, with provisional CSD refcode LODVEH00, first published as an Early View article on the 3rd March and currently without a full publication reference.
 
Automatic validation of crystallographic data is a challenging process. Structures may exhibit complex disorder, including cases where some atoms (especially hydrogen) may not be modelled at all. Situations like this can result in an automatic determination of the structure’s chemistry that may be uncertain. Whilst the systems developed here at the CCDC can successfully handle the majority of structures, some entries in CSD X-Press will not contain data found in a normal CSD entry, such as a compound name or 2D diagram.
 
Structures added to the CSD have been assessed by an expert Scientific Editor, here at the CCDC. This important quality control stage allows us to compare the new, recently added data with structures already in the CSD; by doing this we can add related structures (e.g. polymorphs) to refcode families, and include additional information, such as cross-referencing stereoisomers. Using the expertise of our Scientific Editors we can also, in consultation with the published manuscript, accurately assign the chemistry of structures in complex cases.  In practise this means that after publication crystallographic data flows to CSD X-Press and then on to the CSD, as we undertake processes to assess and validate the chemistry and crystallography of each crystal structure.
 
CSD entry LOBSEC, one of the >700,000th structures, which is a compound extracted from the roots of Stemona tuberosa, a herb that is commonly used in traditional Chinese medicine.
 
By using a combination of automatic and manual validation procedures and regular CSD data updates, the CCDC delivers the highest quality crystal data to users as soon as possible. In the latest WebCSD update you’ll find our 700,000th example of this!