CSD One Million

Ball and stick 3D representation of the structure for Refcode XOPCAJ.

In 2019 the CSD Reached One Million Structures!

The Cambridge Structural Database (CSD) reached one million structures, leading the way in structural data to inform drug discovery and materials development.

Woman writing on a notebook with an iPad next to her displaying a bar graph.

CSD Annual Statistics

Since we reached 1 million structures back in 2019 the CSD has continued to grow. Explore the latest statistics on the Cambridge Structural Database here.

We Are Very Proud to Have Reached The One Million Milestone

The addition of the one millionth structure to the Cambridge Structural Database (CSD) was an incredible achievement for the whole community. The CSD is used in almost every chemistry department worldwide by universities and pharmaceutical companies. They depend on it to drive drug discovery projects. Materials scientists design ever more complex 3D network structures. Information from the database underpins all molecular mechanics forcefields and interaction scoring functions that allow putative drug molecules to be docked to their protein receptors.

What Is ‘CSD One Million’?

CSD One Million marks the sharing of one million organic and metal-organic structures through the Cambridge Structural Database. The structures are deposited by scientists from around the globe and each one is curated by our in-house database team prior to entry. We reached one million structures in June 2019.

What Is the CSD?

The CSD is the world’s repository of highly curated experimentally determined organic and metal-organic crystal structures. It is used by scientists in over 70 countries to understand how molecules behave and interact in three dimensions in the solid form and ultimately how this affects physical properties. This rich data resource alongside advanced search, analysis, and visualization software from the CCDC enables scientists from both industry and academia to further their research and predict new outcomes. Furthermore, knowledge derived from the CSD underpins computational chemistry and molecular modeling and is relied on by industry for the development of new drugs and within academia to teach chemistry.

Why Does It Matter How Many Structures Are in the CSD?

Drawing insights from larger volumes of data enables the generation of new and more replete answers because as the amount and diversity of information increases, so does the confidence in both the data and the insights.

Big data is becoming a frequently used term globally across many industries as companies recognize the value of large sets of data that can be mined for information and used to inform decision-making. Big data refers to large amounts of data that cannot be processed or analysed using conventional data processing techniques, as well as data that grows rapidly with time. It also refers to the mining, storage, analysis, visualization, and sharing of data.

Near exponential growth in the volume of data within the CSD, coupled with our intelligent software, enables scientists to visualize, understand and predict outcomes of their research within life science and materials development, accelerating R&D and pushing products to market within the industry.

The widespread use of structural data globally, the reliance on the CSD and associated software in drug discovery and development, and the thousands of research papers published using the CSD are testament to the fact that this highly curated database of structural information can enable new insights and knowledge.

Below are some examples of how information derived from the CSD played a large part in research published in the Journal of Medicinal Chemistry.

In this first paper by Ito et al. (2018), the authors used experimentally determined conformations within the CSD to aid the discovery of a promising lead compound for drug discovery [1].

In another study by Ohtake et al. (2012) the researchers used the CSD in the discovery of a molecule that subsequently led to the identification of a candidate for a novel therapeutic approach to treat type 2 diabetes [2].

With the emergence of new AI and machine learning techniques these insights and this reliance are set to grow significantly.

How Do You Ensure Data Integrity?

One of our core objectives here at the CCDC is to ensure the integrity of the data so that it can be re-used with confidence.

Each dataset that is deposited undergoes extensive validation and cross-checking via automated workflows and manual curation by our expert chemists and crystallographers. This means the data within the CSD is accurate, consistent, and of the highest quality.

Furthermore, we enrich the data with bibliographic information, chemical representations, and physical property information, adding further value to the original structural data and enabling both experts and non-experts to reuse, discover and interpret structural data in a chemically meaningful way.

Is Processing and Curating Data in This Way Worth the Time and Effort?

Yes. If the quality of the data is poor then the size of the data set is irrelevant as the consequences of using poor quality data are far reaching, leading to incorrect scientific conclusions, wasted investment and effort, a loss of trust, and ultimately to poor business decisions. We put our expertise and time into the quality of our data so that our users can trust the data and be sure that the insights they generate from the CSD inform the right decisions in their research.

How Did You Decide on the One-Millionth Structure?

The millionth structure was the millionth structure added to the CSD and made available worldwide through our website via our free Access Structures service and our more advanced online search system called WebCSD. The daily total in the CSD is displayed on our homepage.

Furthermore, CSD One Million was the millionth published structure and does not include data deposited pre-publication and held confidentially in trust for our depositors prior to release. It includes experimental organic and metal-organic structures.

Structures that are available on our website but are not added to the CSD (and therefore not eligible for CSD One Million) include inorganic structures that are added to the Inorganic Crystal Structure Database by our collaborators FIZ Karlsruhe and structures that have been calculated or predicted rather than determined experimentally.

The CSD, and therefore CSD One Million, contains structures from a variety of publications. This includes structures published in associated scientific journal articles, structures published in books and theses, structures published in patents and university repositories, as well as structures solely published through the database itself as CSD Communications.

One Million Structures and Beyond

Over the last 54 years, we have been collating, curating, and validating data published and deposited by scientists from around the globe and we will continue to work to expand the set of data whilst maintaining the quality.

The use of the CSD in the pharmaceutical and agrochemical industries is already well-established but with our product development programme, coupled with this expanding data set, the CSD is fast becoming a fundamental resource for research into new materials such as batteries, paints, pigments, and dyes, and in particular the development of gas storage frameworks and tailored catalysts. As environmental contamination and sustainability become increasingly important there is considerable potential on a global scale.

Furthermore, we are now starting to draw insights and trends to inform the direction of future research across different industries. Keep an eye out for the latest insights or sign up for our mailing list to receive insights directly into your mailbox!

Ball and stick representation of the Nevirapine molecule, CSD REFCODE PABHJ01.

Structure Highlights and Trends from One Million Structures from Chemistry World

In (crystallographic) data we trust?

The CSD is a trusted repository of crystallographic data and as such we are taking a pro-active approach to ensure the data that we store is both trustworthy and consistent, enabling scientists to find and utilise the best data for their research.

CSD Data Curation – the challenge of a million structures

Research is often reported precisely because the results are novel and unusual, however with a database of almost a million structures we need to try and represent data in a consistent way to give meaningful search results.

CSD Data Curation – The Human Touch

Although the way in which data is deposited has changed significantly over this period, it may surprise you to know how much of a human touch is still involved with the production of the CSD.

Countdown to 1 million

The CSD has grown and evolved since its inception in 1965, and currently, an individual CSD entry refers to one published report of a crystal structure.

Discovery of 3-Benzyl-1-(trans-4-((5-cyanopyridin-2-yl)amino)cyclohexyl)-1-arylurea Derivatives as Novel and Selective Cyclin-Dependent Kinase 12 (CDK12) Inhibitors

Discovery of Tofogliflozin, a Novel C-Arylglucoside with an O-Spiroketal Ring System, as a Highly Selective Sodium Glucose Cotransporter 2 (SGLT2) Inhibitor for the Treatment of Type 2 Diabetes

Software

Community

Discover

Consultancy Services

Research

Support and Resources

About

Search our website

Search our Support FAQs

Search the CSD