In (crystallographic) data we trust?
Data integrity investigations at the CCDC
The integrity of the data within the Cambridge Structural Database (CSD) is of great importance to the CCDC and, no doubt, to the many scientists across the world that use the CSD as part of their research. The CSD is a trusted repository of crystallographic data and as such we are taking a pro-active approach to ensuring the data that we store is both trustworthy and consistent, enabling scientists to find and utilise the best data for their research. Our overall aim is to be able to help depositors, peer reviewers and wider users of the CSD identify the quality and integrity of the data and to ensure, as far as possible, that there are no cases of fraud or plagiarism within the CSD.
This leads nicely onto my role at the CCDC. I joined the Database Team in October 2018 to undertake a postdoctoral position investigating the integrity of the data within the CSD, having previously dabbled in the world of crystallographic data integrity alongside my PhD research. My work at the CCDC has initially focused on investigating the ‘completeness’ of data deposited to the CSD, as well as devising new methods to identify any plagiarised or fraudulent data. This has involved collaboration with the International Union of Crystallography (IUCr) to build on the good work they have already undertaken in this area.
A poster I authored at the ECM meeting in Croatia in 2015, prior to joining the CCDC, investigating the potential of producing fraudulent crystallographic datasets
A complete picture
Through the study of the completeness of crystallographic information files (CIF)s which are deposited to us, we intend to investigate how we can work with scientists to ensure that the data we receive is as complete as possible. We aim to identify the trends in the use of CIF terms, highlight where data is missing and work to enable this information to be captured during deposition to prevent the loss of metadata to science. This helps the data to remain useful and supports the FAIR principle (of Findable, Accessible, Interoperable and Re-usable data). It has definitely made me extra careful in my own work; I have been paying extra attention to all of my CIFs to ensure that the information is as correct and complete as possible.
Current investigative tools and mechanisms
Although my investigations into the identification of fraudulent data are ongoing, I believe it is worth highlighting that the CCDC already has a number of mechanisms in place to ensure the integrity of data within the database. Since 2011 we have accepted the deposition of structure factors alongside structural data according to the best practise of the IUCr [1] which facilitates greater validation of structural models.
During the deposition process there are a number of tools available for the depositor. CIFs which are being deposited can be assessed using the IUCr’s checkCIF service [2]. We encourage users to take advantage of this service to ensure that CIFs are free from errors which could arise during data processing and highlighting if certain information about the experiment is missing from the file. This allows any alerts produced by checkCIF to be either investigated or commented on by the crystallographer. The syntax of CIFs is also checked during deposition and the depositor is able to fix and enhance the file to include extra information prior to submission. The depositor is also required to identify the crystallographer involved in the production of the data, increasing the ability to trace the source of the data and providing recognition of work if they are not included as authors on any associated scientific articles.
After deposition every structure is manually checked by a scientific editor to ensure that information provided is consistent with the rest of the database. This allows users of the database to find the structures more easily through common searches. We also undertake a series of CSD improvement projects every year to ensure that the information already in the database is as useable as possible.
As most of the data in the CSD is published as part of a scientific article the research it accompanies should have been peer reviewed prior to the structure appearing in the database, hopefully providing an additional measure of quality. Quality investigations into data published directly as a CSD Communication have been the subject of a recent blog, where it was shown that quality of the structural data was comparable to that published in peer reviewed journals.
Next steps
As you can see, we take data integrity seriously. There will no doubt be more updates to come on the outcomes of this work as it progresses. If you have any comments or suggestions about the direction of this work, please do get in touch. With your help we want to ensure the integrity of crystallographic data for the good of the community and support our depositors to produce the best possible datasets.
[1] https://www.iucr.org/index.html/leading-article/2011/2011-06-02