Building a complete picture
December 3, 2019
Completeness is an important measure of data integrity and is essential to capture all relevant information about an experiment. This also helps ensure research data is FAIR (Findable, Accessible, Interoperable and Re-usable). With this in mind, CCDC is investigating the completeness of the crystallographic data we hold in our archive. The aim of this investigation is to identify the trends in the information submitted to us, highlight where data is missing and work to enable the capture of any absent information during deposition to prevent the loss of valuable metadata in the future. This blog will highlight some of our initial findings.
Crystallographic data is stored in Crystallographic Information File (CIF) format . The CIF was introduced in 1990 by the International Union of Crystallography (IUCr)  as a standard and machine-readable way of representing crystallographic data. The format has a variety of fields where details about the data collection, experimental procedure and the model of the crystal structure can be provided. The information that should be supplied in each field is defined in one of the IUCr’s CIF dictionaries . In recent years > 99 % of structures deposited to CCDC were in CIF format (Figure 1).
Complete CIF metadata is very important. Although CIFs are usually accompanied by an article which may provide additional information, if a full description of the experimental process is not included in the CIF it can be hard to match the text in the article with the dataset and complete analysis of this data is impossible. Additionally, experimental information about the crystallographic experiment is sometimes not included in the paper or the dataset could be published without an accompanying article entirely as a CSD Communication. Capturing experimental details in the CIF allows CCDC to use CIFs to provide additional information on structures, which in turn enables researchers to take advantage of this information to gain new scientific insights. One such project we have undertaken this year has been to use CIF metadata to identify studies collected at specialist facilities such as synchrotron, neutron and electron sources (more information on this project appears later in the blog).
The information in a selection of the CIF fields are exposed in the CSD. CCDC are currently working on a new database format, which will enable the possibility of making more information from the CIF visible within our software. For this to be useful of course, the desired information needs to be included in the CIF in the first place. CCDC already has produced a list of guidelines that details information we would encourage depositors to include in their CIFs.
Figure 1. A graph showing the percentage of structures added to the CSD each year that have an accompanying CIF.
Reflection data (in the form of either .hkl or structure factors files) is required to be made available alongside or within CIFs by several journals . The data provides additional insight into the structural model and can also be used in integrity checking. Reflection data can be included directly in the CIF or uploaded during deposition to the CSD. There has been a dramatic increase in the number of structures that have accompanying reflection information since 2010 (Figure 2). This is most likely due to software changes in recent years; some refinement programs now include this data within CIFs automatically, making it easier for the user to provide the required information to journals.
Figure 2. A graph showing the percentage of structures added to the CSD each year that have some form of accompanying reflection information.
Refinement statistics can give a measure of how well the structural model describes the experimental data. Three of these indicators investigated were the R and wR factors and the goodness of fit (GooF). CCDC envisions these values could also be used to provide additional filters for scientists to select the most appropriate data for their research. Presently only the R factor is exposed in the CSD software and the value can be used as a filter when searching for structures.
Figure 3 shows the proportion of CIFs that do not include these values. The R, wR and GooF are currently very well reported and almost always included in the CIF. This is most likely due to refinement software automatically calculating and including these figures during CIF creation, alongside journal requirements for the reporting of certain values. However, we can see this has not always been the case as the GooF was less commonly reported in the early days of the CIF format.
Figure 3. A graph showing the percentage of CIFs each year that did not provide values for R, wR or GooF.
There are many fields, defined in the CIF dictionary, that can be used to include information about the data collection, such as the experimental conditions and the equipment that was used. This information is useful as different experimental setups could produce different results and be used to explain variations in models of the same structure. Some of the fields that have been assessed for completeness are the wavelength of the radiation, the collection temperature, as well as the X-ray source, monochromator and detector.
Figure 4 shows there is great variation in the trends of information reported over the past 20 years. The temperature has always been well reported and providing the experimental wavelength is now commonplace. In recent years information about the X-ray detector used to measure the diffraction pattern has become more commonly reported. However, the X-ray source along with other experimental information such as the monochromator are now becoming less likely to be included.
Figure 4. A graph showing the percentage of CIFs each year that did not include details of the wavelength, temperature, X-ray source, monochromator or detector used.
Using CIF metadata to provide additional information about structures
At CCDC, we work on a number of projects to improve our existing data throughout the year, to ensure that the structures are labelled consistently and to add extra information to existing entries (as well as curating newly published structures). One recent project has been to ensure the correct identification of structures measured at specialist facilities – such synchrotron and neutron sources – using CIF metadata.
Synchrotron and neutron data are routinely marked in the CSD, but this currently relies on the presence of particular keywords in a CIF and more historically relied on information being present in the associated publication. Through a study of CIF metadata, additional CIF fields that contained ‘synchrotron identifying information’ were established along with additional keywords. This yielded a further 800 structures potentially collected at synchrotrons that were not labelled as such in the CSD. The database team has been working hard over the course of the year checking these structures to assess if they were measured using synchrotron radiation. We were able to confirm over 700 of these studies and in the next release of the database these structures should now be labelled correctly.
Additional information in the CIFs could also lead us to be able to identify the facility (and even the beamline) many of these studies were collected at. Further information on the analysis can be found on a poster that was presented at the British Crystallographic Association (BCA) Spring Meeting 2019.
We are now working to investigate why some fields are less routinely populated in CIFs. Incomplete experimental metadata will mean that data is less reusable as important details are missing. It can also make it harder to classify studies; in our synchrotron experiments we found that one common CIF field containing ‘synchrotron identifying information’ was the X-ray source. With the increasing likelihood of this field not being included in a CIF (Figure 4), it could become harder to detect such studies using CIF metadata in the future.
Investigating data completeness has already enabled us to make targeted improvements to the CSD and is the first step towards helping our depositors provide a complete dataset during deposition through the addition of new integrity checks. These investigations are also important as we start to think about how we expand the data fields available through the CSD and what new filters and flags would help you better identify which structures to use in your research. We would be keen to know which fields you think are most important for a CIF to contain and what additional data you would like to see in the CSD.
- Using structures added to CSD up to v540 (October 2018).
- Powder data and CIFs that were hand-typed at CCDC were excluded from the analysis of CIF completeness.
- Year is defined as the date the structure was added to the CSD.
- If multiple CIF fields could contain the required information, then a CIF was deemed to contain the information if one of these fields were populated.
- The contents of the CIF fields were not checked – just whether the field was populated. ‘?’ or ‘.’ or blank fields were not counted as the field being populated.