Who wants to be a millionaire? The CCDC issues number CCDC 900000.

In the last week or so we passed another milestone at the CCDC in the building of the Cambridge Structural Database (CSD), by issuing the reference number CCDC 900000. This type of reference number is probably familiar to many people from scientific papers describing X-ray data, and corresponds to a set of X-ray experimental data. This number is issued when data is first sent to us, and stays with the dataset even if undergoes revisions before or during the publishing process. It’s also worth mentioning that structures in the database are not normally referred to by the CCDC number at all, but rather using the six letter CCDC refcode. We’ll talk about refcodes in more detail in an upcoming blog!

Unfortunately we can’t publicise what the structure of CCDC 900000 is yet, since any data sent to us are confidential until the structure is published. Only then do we add the data to the CSD and make the original data freely available to the public. So instead of talking about the actual structure, I thought it might be of interest to explain a bit about the background of CCDC numbers.

First of all, just because we’ve issued the number CCDC 900000 doesn’t mean we have 900000 crystal structures. The reason why is more complicated than you might expect!

The CCDC was formed in 1965, at a time when crystallography was still a very complex and time consuming process. Data weren’t deposited with the CCDC, rather staff here searched the scientific literature for reported structures and painstakingly typed the data to form a CSD entry. At that time, the CSD was released as a series of books, and the there simply wasn’t a need for any kind of external reference number.

As time went on, and the database grew in size, the need for some quick reference - as opposed to an index of structures – became apparent. Initially CCDC numbers were allocated by journals to papers after they had been accepted for publication, and the publisher would then send the data to the CCDC (as boxes of paper in the post!) for entry into the CSD.  The RSC journal Chem.Comm., for example, was given a journal code of 182, and would issue a CCDC number in the format 182/123. This would correspond to the 123rd Chem Comm paper that contained X-ray data.

It wasn’t until the early 1990’s that the standard cif format for reporting crystal structure data was established, with a paper co-authored by Frank Allen of the CCDC published in Acta. Cryst A in 1991. However data still came to the CCDC in a variety of formats, such as hardcopy tables of data, shelx files and cifs.

It wasn’t until 1996 when cif had become the standard format for X-ray data that the current system of six digit CCDC numbers as they’re known today began with CCDC 100001. However even then things were slightly different from today’s system; a reference was still given per paper rather than per structure as now. This caused a few headaches when the referee of a paper wanted the authors to revise or reject, say, the 9th of 15 structures - all of which had a same CCDC number. Therefore in 1998 we started to allocate one number per structure, and this began with CCDC 101495 (GOQHID). At this point you can start to see why the CCDC number doesn’t correspond to the size of the CSD. To give you a rough idea, there were about 210,000 structures in the CSD when CCDC 101495 was added.

The crystal structure of CCDC 101495, the first structure with one unique CCDC number.

The next big change came in March 2006; after CCDC 299945 the next number issued was CCDC 600000 - this was to avoid confusion with CSD numbers allocated by the Inorganic Crystal Structure Database (ICSD). So, even though we’re up to CCDC 900000 we’re not quite up to those numbers in terms of crystal structures.

Hopefully that brief dip into the CCDC’s history helps to explain a bit about the CSD – but to end it’s probably worth thinking about the future. At current rates we’ll be issuing CCDC number 1,000,000 within the next couple of years; that is of course unless we change the format! The current six-digit reference number has several advantages, it’s a recognisable format, simple, and it’s easy to quote ranges of numbers. Unfortunately the disadvantage of any string of numbers is it’s also quite easy to get wrong by a simple typo or accidentally swapping digits. There’s also no easy way of knowing which structure a CCDC number refers to at a glance, which means very occasionally authors unintentionally reference the wrong structure! One idea is to issue numbers with a check letter or number that enables depositors to be sure the number they’ve typed is correct. This value could even be based on the crystal structure itself (say a lattice parameter) giving that link to the individual structure. Unfortunately that means if you’d like to be the proud owner of the structure CCDC 1000000 you might be disappointed!

This is something we’re already starting to think about at the CCDC, and it would be really interesting to hear your views and experiences too. Do you find the current system of CCDC numbers and refcodes easy to use? Why not send us a tweet (@ccdc_cambridge) or a Facebook message with your thoughts?