Tartarus

A realistic benchmarking platform for chemical inverse design.

Published 19/12/2023 in NeurIPS 2023

This was work done with Akshat Nigam Kumar, Robert Pollice, Kjell Jorner, John Willes, Luca Thiede, Anshul Kundaje, and Alán Aspuru-Guzik.

Link to paper is here, and link to the Github repo is here.

Generative models for molecular design have shown great promise in inverse design of chemicals: the optimization of chemical properties in chemical space. There are an estimated over 1060 possible molecules in all of chemical space, an intractable number of compounds for molecular discovery tasks or virtual screening campaigns. Generative models can instead learn the distribution of a subset of chemical space, and generate molecules from this space while directly optimizing for relevant chemical properties.

Previous benchmarking platforms have focused on distribution matching, maximization of valid/unique/diverse compounds, or chemical rediscovery tasks. Through the years of development in cheminformatics and deep learning methods, many of these tasks are trivial, and are often not chemcially interesting. TARTARUS aims to solve this, providing a suite of simulated tasks based on quantum chemistry calculations.

All simulation workflows only require a SMILES entry representing the molecular graph. TARTARUS will provide the 3D embedding, conformer generation, energy relaxation, and quantum chemistry calculations. In all, we propose 4 classes of material discovery tasks:

  1. Organic photovoltaics: the design of molecules for organic solar cells;
  2. Organic emitters: the design of molecules for OLED, display, and laser applications;
  3. Drug molecule ligands: the design of small drug molecules for 3 target proteins, with known pockets and structures;
  4. Chemical reaction substrates: the design of motifs for a self-reacting chemical compound.

We study the benchmark for a variety of generative models in the paper, including genetic algorithms, variational autoencoders, language models, and reinforcement algorithms.