What is ‘CSD One Million’?

CSD One Million marks the sharing of one million organic and metal-organic structures through the Cambridge Structural Database (CSD) The structures are deposited by scientists from around the globe and each one is curated by our in-house database team prior to entry. We reached the one million mark in June 2019.

 

What is the CSD?

The Cambridge Structural Database (CSD) is the world's repository of highly curated experimentally determined organic and metal-organic crystal structures. It is used by scientists in over 70 countries to understand how molecules behave and interact in three dimensions in the solid form and ultimately how this affects physical properties. This rich data resource alongside advanced search, analysis and visualisation software from the CCDC enables scientists from both industry and academia to further their research and predict new outcomes. Furthermore, knowledge derived from the CSD underpins computational chemistry and molecular modelling, and is relied on by industry for the development of new drugs and within academia to teach chemistry.

 

Why does it matter how many structures are in the CSD?

Drawing insights from larger volumes of data enables the generation of new and more replete answers because as the amount and diversity of information increases, so does the confidence in both the data and the insights.

‘Big data’ is becoming a frequently used term globally across many industries as companies recognise the value of large sets of data that can be mined for information and used to inform decision making. Big data refers to large amounts of data that cannot be processed or analysed using the conventional data processing techniques, as well as data that grows rapidly with time. It also refers to the mining, storage, analysis, visualisation and sharing of data.

Near exponential growth in the volume of data within the CSD, coupled with our intelligent software, enables scientists to visualise, understand and predict outcomes of their research within life science and materials development, accelerating R&D and pushing products to market within industry. 

The widespread use of structural data globally, the reliance on the CSD and associated software in drug discovery and development, and the thousands of research papers published using the CSD are testament to the fact that this highly curated database of structural information can enable new insights and knowledge. 

Below are some examples of how information derived from the CSD played a large part in research published in the Journal of Medicinal Chemistry. 

In this first paper by Ito et al. (2018), the authors used experimentally determined conformations within the CSD to aid the discovery of a promising lead compound for drug discovery [1].

In another study by Ohtake et al. (2012) the researchers used the CSD in the discovery of a molecule that subsequently led to the identification of a candidate for a novel therapeutic approach to treat type 2 diabetes. [2]

With the emergence of new AI and machine learning techniques these insights and this reliance are set to grow significantly.

 

How do you ensure the integrity of such large volumes of data?

One of our core objectives here at the CCDC is to ensure the integrity of the data so that it can be re-used with confidence.

Each dataset that is deposited undergoes extensive validation and cross-checking via automated workflows and through manual curation by our expert chemists and crystallographers. This means the data within the CSD is accurate, consistent and of the highest-quality.

Furthermore, we enrich the data with bibliographic information, chemical representations and physical property information, adding further value to the original structural data, enabling both experts and non-experts to reuse, discover and interpret structural data in a chemically meaningful way.

You can find out more about how we curate, enhance and ensure the integrity of our data by reading the following blogs:

In (crystallographic) data we trust? Data integrity investigations at the CCDC 

CSD Data Curation – the challenge of a million structures

CSD Data Curation – the human touch

 

Is processing and curating data in this way worth the time and effort?

Yes. If the quality of the data is poor then the size of the data set is irrelevant as the consequences of using poor quality data are far reaching, leading to incorrect scientific conclusions, wasted investment and effort, a loss of trust, and ultimately to poor business decisions. We put our expertise and time into the quality of our data so that our users can trust the data and be sure that the insights they generate from the CSD inform the right decisions in their research.

 

How did you decide the one millionth structure?

The millionth structure was the millionth structure added to the CSD and made available worldwide through our website via our free Access Structures service and our more advanced online search system called WebCSD. The daily total in the CSD is displayed on our homepage.

Furthermore, CSD One Million was the millionth published structure and does not include data deposited pre-publication and held confidentially in trust for our depositors prior to release. It includes experimental organic and metal-organic structures.

Structures that are available on our website but are not added to the CSD (and therefore not eligible for CSD One Million) include inorganic structures that are added to the Inorganic Crystal Structure Database by our collaborators FIZ Karlsruhe, and structures that have been calculated or predicted rather than determined experimentally.

The CSD, and therefore CSD One Million, contains structures from a variety of publication sources. This includes structures published in associated scientific journal articles, structures published in books and thesis, structures published in patents and university repositories, as well as structures solely published through the database itself as CSD Communications.

For more information on the criteria for CSD One Million and what’s included in the CSD see:

www.ccdc.cam.ac.uk/Community/blog/countdown_to_1_million

www.ccdc.cam.ac.uk/solutions/csd-system/components/csd

 

What happens now we've reached one million structures?

Over the last 54 years, we have been collating, curating and validating data published and deposited by scientists from around the globe and we will continue to work to expand the set of data whilst maintaining the quality. 

The use of the CSD in the pharmaceutical and agrochemical industries is already well-established but with our product development programme, coupled with this expanding data set, the CSD is fast becoming a fundamental resource for research into new materials such as batteries, paints, pigments and dyes, and in particular the development of gas storage frameworks and tailored catalysts. As environmental contamination and sustainability become increasingly important there is considerable potential on a global scale.

Furthermore, we are now starting to draw insights and trends to inform the direction of future research across different industries. Keep an eye out for the latest insights or sign up to our mailing list to receive insights directly into your mailbox!

 

[1] https://pubs.acs.org/doi/10.1021/acs.jmedchem.8b00683

[2] https://pubs.acs.org/doi/10.1021/jm300884k