The story behind the SAVI project

A computer-generated database of a billion+ easily synthesizable molecules aimed at drug design

The idea to the SAVI project came to me during a conference in July 2013 in Moscow. 

One of the perennial questions in drug development, especially in computer-aided drug design (CADD) generating molecules de novo, is: once we've come up with a new structure we think may be active, where do we get an actual sample from? And even in a low-throughput screening approach, tens to hundreds of molecules are needed, so the cost for the synthesis becomes important.

The National Cancer Institute (NCI), unlike big pharma, only has small groups of medicinal chemists ready to tackle synthetic challenges. Previous efforts at NIH to farm out synthesis requests to external groups and companies met with decidedly mixed success. Many vendors were too optimistic in their predictions that, and with how much effort, they would be able to synthesize molecules for NCI.

So while listening to the presentations at the Moscow meeting and talking with other participants, I started thinking that perhaps we are asking the wrong question: Maybe we shouldn't ask how to make a molecule we've previously designed with CADD approaches, but should instead create a very large database of easily, and cheaply, synthesizable, molecules first and only then apply or CADD approaches to this library to find new active molecules.

The numbers actually seemed to be vastly in support of this approach. Even if we assume that just one out of a million molecules is easily synthesizable, and if we take the lower bound for the size of chemistry space of small molecules of 1040, this means there must be 1034 such easily accessible molecules. While this consideration is somewhat naïve, it shows that there must be many orders of magnitude more such molecules than all the compounds synthesized in individual syntheses so far (~108).

The path to create such a data set, which we called Synthetically Accessible Virtual Inventory (SAVI) seemed clear in principle: Take a large number of readily available building blocks, a set of reliable chemical reaction rules for how to connect such reactants, and a chemoinformatics approach to combine the two to generate reaction products and their proposed synthetic routes.

The initial plans were to use the 58 Robust Organic Synthesis Reactions published just a couple years earlier by Hartenfeller et al., made available by the authors as Reaction SMARTS expressions. Due to our connections with the team of ChemNavigator, in particular its CEO Scott Hutton, which had been acquired by Sigma-Aldrich in 2009, we had access to a significant set of Sigma-Aldrich building blocks.

I contacted our long-term collaborator, Wolf-Dietrich Ihlenfeldt, the author of the chemoinformatics toolkit CACTVS, and asked him if he could implement a tool in CACTVS to apply the Hartenfeller rules to the Sigma-Aldrich building blocks.  Wolf-Dietrich suggested that we should however also investigate a system that was decades older: the LHASA program. The LHASA project, the computational embodiment of E.J. Corey's ground-breaking (and Nobel Prize winning) development of retrosynthetic analysis, had its early beginnings in the late 1960s, and its first publications in the 1970s. Its age of more than 40 years made it seem rather like a legend in the field than a practical option. But we kept asking around if anyone knew whether its components, in particular its expert-system type transforms written in the FORTRAN-like language pair CHMTRN and PATRAN, were still available anywhere. At times, this reminded me of a sentence at the beginning of Tolkien's Lord of the Rings: "Much that once was is lost. For none now live who remember it."  Well, it is not that bad, most of the people who worked on the LHASA project are still alive. Still, Philip Judson, one of the few people who still can write CHMTRN/PATRAN code and who became member of our team, had to get out of retirement to join the SAVI project.

Curiously, in spite of the age of the LHASA project, surprisingly little had been published about it, and even less so about CHMTRN and PATRAN. And some written material was even inaccessible: When I called a university in the UK to ask whether we could get a copy of a Ph.D. thesis written in the 1970s in the context of the LHASA project, I was told that it was still embargoed.

Nevertheless, we were able to get both enough documentation and the existing CHMTRN/PATRAN rules for usage in the SAVI project, which were graciously provided by the two companies that had inherited their respective parts of the LHASA knowledgebase.

While the initial SAVI runs used a handful of the existing LHASA transforms, it had been clear from the beginning that numerous widely used named reactions were not in the knowledgebase – simply because they were not yet known when the bulk of the knowledgebase was written. We therefore quickly moved to writing new CHMTRN/PATRAN rules ourselves. We made sure that the lhasa engine in CACTVS, which is a clean-room reimplementation without knowledge of any of the existing LHASA code, runs both the new and the original rules.

While the SAVI approach had seemed like a wild idea to us in 2013, we had apparently hit a zeitgeist. Several similar projects came up in the following years (and we realized that pharma companies had already been investigating this approach in-house since at least 2010, though of course not making any data sets publicly available).

After listening to Yurii Moroz's presentation at the 2018 Spring American Chemical Society National meeting about Enamine's REAL database of virtual screening samples, I immediately walked up to him and proposed that we mutually share the chemistries used for generation of the SAVI and REAL databases, respectively.  I was certain that the overlap would be limited; which turned out to be the case. We therefore agreed that it would make sense to switch the building blocks used for SAVI to the Enamine set to ensure ready availability of on-demand synthesis of SAVI compounds for SAVI users world-wide as well as for our own drug development projects.

With this, the stage was set for the generation of the SAVI data set described in our Scientific Data paper.