Boosting Research Efficiency in Industry with FAIR Data
June 8, 2023
The Cambridge Structural Database (CSD) is a well-known platform used by academics and researchers in fields extending from the pharmaceutical, agrochemical and fine chemical industries. With its 1.2M+ experimental 3D structures deriving from X-ray and neutron diffraction data, the CSD represents a trusted scientific resource that is essential when performing molecular analysis.
In today’s blog, based on the webinar ‘Boosting Research Efficiency in Industry with FAIR Data’ (download the recording here), the FAIR data principles and their potential is discussed. In particular, the focus will be directed on how these principles are applied at the CCDC, and how the chemical industry can benefit from them.
The FAIR Data Principles
The acronym FAIR stands for ‘Findable, Accessible, Interoperable, Reusable’, indicating the characteristics that research data should present to ensure that both humans and machines can reliably use it.
When looking at crystallographic data, some community standards were adopted before the FAIR data principles were conceived. One example is the CIF format, which denotes an IUCr standard for semantic representation of a diffraction experiment. Similarly, the use of standard identifiers such as DOIs for digital objects, ORCIDs for researchers and InChIs for chemical structures, allows to describe various entities and enables their findability and interoperability.
Some aspects of the FAIR data principles within the CCDC software portfolio are focused on underpinning the transition from data, through to knowledge, to applications: the CSD-Community portfolio aims to make the data available and discoverable; the CSD-Core tools enable the discovery of new science and novel insights with tools for search, analysis, and visualization; finally, CSD-Discovery and CSD-Materials target the industrial applications through discovering new molecules and engineering of new materials.
FAIR Data as an Opportunity for the CCDC and for Industry
A collaborative project was started with Nick Lynch and Richard Shute from Curlew Research, aiming to understand how CCDC services and data align with industry interest in FAIR, and how the use of FAIR data could represent an opportunity for both the CCDC and the R&D in industry.
Phase 1: Voice of the Customer Exercise
The first step undertaken in this study was to examine the voice of the costumer exercise (Phase 1), where some organisations and various roles representatives were contacted to probe their awareness of FAIR, the activity that they were undertaking to address FAIR and the reflection on CCDC services. The feedback from this investigation reflected that CCDC customers are broadly well satisfied from the good quality data, and that the CCDC has good FAIR maturity. However, FAIR strategy on its own is not a priority for many drug discovery organisations and chemical interoperability remains the primary goal, opening up to areas for improvements.
A development in terms of interoperability that is present in the latest release of CSD Python API, regards the ability to access reliable InChIs (the IUPAC International Chemical Identifiers) for CSD entry components and to generate InChIs for structures that are loaded into the API. A second example of improvement involves a collaboration between the CCDC, PDBe and ChEMBL, aiming to connect chemical and biological data across resources, an initiative called BioChemGraph.
Phase 2: Scoping Follow-on Activities
The Phase 2 of this study looked at the relevance of 3D structural chemistry data across the drug pipeline. Interest was addressed to further data interoperability challenges, such as being able to connect different determinations of the same structure from different sources and polymorphic forms, the desire to render Powder Diffraction data more findable and accessible, and the desire to access other physical and molecular properties alongside the 3D structural data.
Phase 3: FAIR-enabled R&D
Finally, Phase 3 consisted in further interviews with Big Pharma, diving deeper into “I” and “R” of FAIR. The first focused once again on the desire to see 3D structural data alongside other physical properties, such as thermal, spectroscopic, and morphological. The second regarded reusability, and more specifically the need to have high quality FAIR data. To this regard, the CSD is recognised globally as the gold standard database for 3D structural data, and the entries are thoroughly curated and checked for completeness, accuracy, integrity, etc. “Quality” is hence addressed to all the dimensions reported below, aligning with the work-in-progress at CCDC that aims to develop and expose additional quality metrics for public data.
CCDC FAIR-enabling Activities
Among the FAIR-enabling activities that CCDC undertakes, it is hence possible to find collating, curating, and enriching published crystallographic data, alongside with data management systems and services. The data curation services that the CCDC provides to industry represent a fundamental tool that helps managing, discovering, sharing, and reusing the data generated from the organisation. Nowadays this extends out also to predicted structures, which can be accessed via CSD-Theory, a software that allows users to visualize and analyse predicted and experimental data side-by-side.
The FAIR principles revealed to be important to help boosting research efficiency through structural chemistry. FAIR crystallographic data has been enabling aspects of R&D for many years through the adoption of standards. The curation services of the CCDC increased the value of public and in-house structural data, both experimental and predicted. Finally, interoperability and reusability represented key concepts on this study, opening up to important developments for the CCDC.
CSD Database (36)