Enhancing the Discoverability and Value of Crystal Structural Data – The CSD Data Curation Process
January 31, 2023
The Cambridge Structural Database (CSD) contains over 1.2M experimental 3D structures with data from X-ray and neutron diffraction analyses. Researchers across the pharmaceutical, agrochemical and fine chemical industries use the database to predict and guide future discoveries.
The CSD is a trusted scientific resource that gives big-data insights using powerful algorithms for molecular analysis. This blog, based on the webinar ‘The CSD Data Curation Process’ (download the recording here) discusses how the experimental data is further curated to enhance the discoverability and value of the data.
Why Curate the Data?
The automated and manual curation carried out by our experts enables the data to be used across manual and machine methods including AI and ML approaches, maximising it’s value. Curation maintains the standards of the database:
Quality – the CSD is a trusted resource being relied upon by industry and academia. It is vital that we perform checks on the accuracy and quality of the data deposited.
Consistent and readable – with over 1.2M structures it is important to maintain consistency, readability, and understandability of the data.
Accessible and discoverable – we annotate and enhance the data with meta data such as names, diagrams, and properties to make the data accessible and discoverable to all.
As well as being a wealth of knowledge to researchers, helping to inform and progress their research, we offer software and services that provides additional insights into a range of molecular interactions using the underlying CSD data.
Inside the CSD
The CSD contains over 1.2M organic and metal-organic experimental crystal structures, both single- and multi-component, including drugs, agrochemicals, explosives and metal–organic frameworks (MOFs).
In addition to structural data, the CSD also includes additional data including polymorph families, melting points, crystal colours and shapes, bioactivity details, natural source data and oxidation states.
The CCDC adopts the FAIR data principles when curating the CSD. We strive to make the data Findable, Accessible, Interoperable, and Reusable through the data curation of the CSD.
These FAIR principles are achieved through the CSD data curation process.
The Data Curation Process
Stage 1 – Getting Data into the CSD
Data is primarily received as .cif files from researchers, although it can sometimes be obtained from publishers or through manual addition by the CCDC.
Once deposited, workflows are in place with all the major publishers that allow us to update the publication details automatically. The benefit of this is that we can track the progress of the publication of the crystal structure, even if the paper gets rejected and re-submitted to another publisher or journal.
Journal Editors and peer-reviewers can access data pre-publication to aid in the peer-review process. Depositors are also reminded annually about unpublished datasets.
During deposition, depositors can add a range of additional information to make a CSD entry more comprehensive. Enhanced data such as experimental observations, physical properties and sensitivities can be submitted along with the structural data. Depositors can also add any information they think is relevant for the curation of the entry.
Stage 2 – Curation Begins
Once deposited in the CSD and on publication of the scientific paper, the data becomes publicly available in WebCSD, our online search and access tool (prior to publication of the paper the data is privately stored by us).
At this stage the structure has been through automated validation processes only. These include duplication checks, bond assignments, attempted disorder resolution and 2D diagram generation. This automation allows up to 100 structures a day to be processed by a single editor.
Human editing and enhancement by our Scientific Editorial Team now begin. The chemical connectivity will be checked, validated compound names and 2D chemical diagrams will be added, and the quality of the entry considered using the R-factor (a measure of the agreement between the crystallographic model and the experimental X-ray diffraction data).
A structure is normally curated into the CSD within 2 weeks of being published in a journal. While the CSD entry is enhanced with additional data and information, the original .cif file is also available to download from WebCSD if required.
Stage 3 – Beyond Curation
After a CSD entry has been looked at by an editor, the enhancement doesn’t necessarily stop there. There are annual initiatives that aim to improve entries further. These may include adding additional data to an entry, correcting mistakes / inconsistencies or identifying / grouping molecules of interest. See, for example, the blog New Year, New Data Resolutions!
Following curation, the data is released to our desktop software through quarterly data updates.
Compliance with FAIR Principles
On deposition, the structure is given a deposition or CCDC number. Once the structure is published it becomes associated with a CSD refcode, a unique 6-digit identifier. On publication the structure is also associated with a DOI (Digital Object Identifier) to ensure the data is easily sharable, citable and attributable – compliant with FAIR principles.
Structures can also be published directly in the CSD as a CSD Communication with authors getting full credit for their data. This allows the sharing of crystal structures that might otherwise not have been published. To date there has been over 50K CSD Communications structures published directly through the CSD. A recent paper in IUCrJ describes this innovative initiative to make more crystal structure data available for the public benefit.
The Benefits of Sharing Your Data in the CSD
From long term data preservation to meeting funding and publication requirements, to citations and credit to making data sharing with collaborators easier, there are many benefits to sharing your data in the CSD. These include recognition, discoverability, security, compliance, and increased impact.
More about the benefits of data sharing in the CSD can be found here.
CSD Database (36)