Promoting Data Sharing

You may not realise it, but this is an exciting time for crystallographic data sharing and publishing! Inspired by the changes of recent years in research and scholarly communications, the scientific community is currently transforming how research data is stored, shared and published, with the objective of making all data more openly accessible. This is most definitely the case for crystallographic data, which here at the Cambridge Crystallographic Data Centre (CCDC), we work to make freely accessible to the scientific community via our Access Structures service. In this blog, I will outline the new frameworks and initiatives shaping the future landscape of open scientific research data, which will no doubt influence how crystallographic data will be managed in the future.

What has triggered this move towards open research data? One answer to that question would undoubtedly be the technological advances of recent years which have allowed scientists to conduct experiments at faster and more affordable rates. This has resulted in a surge in the amount of experimental data produced and available to be shared. This is evident with regards to Crystallography where the current time it takes to determine large complex structures would have been unimaginable 53 years ago when the Cambridge Structural Database (CSD) was established. The expansion of the internet has also had a major impact on the scientific publishing sector as online open access journals have gained traction among many researchers who wish to make all aspects of their work as accessible as possible. Nevertheless, beyond these technological evolutions, at the heart of this push for open data is also the fundamental motivation for publishing research in the first place: to share knowledge and ideas among the members of the scientific community for the advancement of their field of study. In this way, sharing datasets promotes collaboration among the community by allowing others to validate the findings outlined in a research article, as well as, triggering further research and discoveries.

Faced with a need for new frameworks and tools for data sharing and management, a number of groups and collaborations have emerged looking to facilitate open data practices. Notable amongst these are the FAIR Guiding Principles for scientific data management and stewardship initially established in 2014 by a diverse group of stakeholders and subsequently refined by members of a FORCE11 working group. According to these principles, in order for datasets to be as reproducible and discoverable as possible, data and metadata should be Findable, Accessible, Interoperable and Re-usable by not just humans but also machines. This vision of data sharing has been adopted by major funders of scientific research and policy makers, and applied by global membership organisations such as the Research Data Alliance (RDA) and the ICSU World Data System when developing cross discipline open data policies and recommendations.

In parallel with this vision of FAIR data, mechanisms for putting these principles into practice have also emerged. Most notably, DataCite and CrossRef’s services for minting DOIs and citing metadata have become essential methods for linking and discovering datasets within the current research publication framework. Using such tools, the Scholix initiative represents an important example of how scholarly data communication could develop into the future. This project, which the CCDC has been part of since its conception, is based on collaborations between journal publishers, data centres and global service providers who collect and exchange links between research data and literature. Representing a more collaborative framework, this system will benefit researchers as the work and effort involved in publishing, citing and linking data is shared between the different parties involved.

Initiatives for enabling the discovery and reuse of data are also being developed and adopted within the Chemistry community. The International Union of Pure and Applied Chemistry (IUPAC) and other groups such as the InChi Trust and the RDA Chemistry Research Data Interest Group have been working to establish domain specific standards for managing the array of data types within the various areas of the discipline. The DIGChem website provides a snapshot of current chemistry research data activity. The various projects outlined there have been established based on input from the variety of professionals involved in producing, storing and publishing chemistry data: from researchers and librarians to publishers and database providers.

As for the CCDC, we believe in making crystallographic research data as accessible as possible while maintaining the sustainability of the CSD and have therefore over the years established various initiatives which closely align with the FAIR principles. For instance, we have worked to make crystallographic data more easily citable and discoverable by adding DOIs to deposited CSD datasets since 2014. Furthermore, we have helped scientists contribute to the quantity of crystallographic data available to the public by publishing data as CSD Communications at the request of depositors. As of today, CSD Communications have provided the community with access to more than 24,000 crystallographic structures which may never have otherwise been made public. In more recent years, we have collaborated with other data repositories, such as the Protein Data Bank (PDB), ChemSpider, DrugBank and PubChem, in order to add cross-references between the CSD and other databases; and have most recently developed free joint data deposition and access services with FIZ Karlsruhe’s Inorganic Crystal Structure Database (ICSD). Thanks to these collaborations, we hope that crystallographic data will become more accessible and discoverable; reaching wider segments of the scientific community. Moving forward we therefore plan to continue developing and adopting initiatives which will make the crystallographic data stored at the CCDC even more FAIR.

Reciprocal linking between CSD entry and PubChem entry for Pentacene

Reciprocal linking between CSD entry and PubChem entry for Pentacene (CSD refcode PENCEN01 https://dx.doi.org/10.5517/cc3v2vv)

These evolving frameworks for more efficient data management systems represent an exciting opportunity to all who have an interest in making scientific data open and accessible, especially researchers, who often express difficulties with making their data publicly available. These difficulties will hopefully be alleviated by developing an increasingly collaborative framework between the many stakeholders. However, there is evidently much work still to do. This is apparent even in Crystallography where although it is the norm for crystallographers to share data, it is estimated by some members in the community that as little as 15% of structural data is publicly shared! To ensure that this percentage increases, it is therefore essential that the frameworks and principles for data publishing and management evolve with the input and support of researchers, who ultimately are the ones producing and providing the data. For that reason, here at the CCDC, we encourage you to be part of the data sharing movement by learning about and engaging with the initiatives discussed in this blog; and of course, by continuing to deposit your data at the CCDC and share your otherwise unpublished structures as CSD Communications!