Artificial intelligence (AI) is transforming our approach to biomolecular engineering. Driven by the goal of accelerating drug development, our aim is to develop AI-driven molecular engineering methods which will enhance our approach to molecular discovery, such as drug discovery, drug repurposing, and chemical probe identification. This entails the development of generative and predictive tools that can learn from biochemical data, such as molecular structures, chemical reactions, and biomedical data.
While AI can be applied to a range of molecular engineering tasks, one ideal area is de novo molecular design. De novo design – the concept of designing molecules with desired properties from scratch so as to minimize experimental screening – is poised to allow scientists to more efficiently traverse chemical space in search of optimal molecules, and delegate error-prone decisions to computers via the use of predictive and generative computational models. In drug development, de novo design methods can aid medicinal chemists in the design and selection of drug candidates, with the added advantage that they can learn from datasets of billions of molecules in minutes and be constantly updated with new data. Molecular deep generative models (DGMs) are a particular class of de novo design methods which use deep neural networks to build new molecules in silico, and work by proposing atom-by-atom (or fragment-by-fragment) modifications to an initial graph structure to generate compounds predicted to achieve a certain property profile. Such models can be applied to a range of therapeutic modalities, from small molecules to protein therapeutics (Figure 1).
Figure 1. A key research theme in our lab is the development of deep generative models (DGMs) for a variety of therapeutic modalities. From left to right: a small molecule (agonist), a protein therapeutic, a heterobifunctional degrader, a molecular glue, and an RNA therapeutic (aptamer).
Crucial to deep generative models for de novo design is the use of optimization strategies which can guide the model towards promising areas of chemical space. Our group is highly interested in the application of reinforcement learning strategies and genetic algorithms to molecular optimization. A critical component of any optimization strategy is the use of accurate molecular property prediction models which can be used to score or rank molecules. In the cheminformatics literature, these types of models are commonly referred to as quantitative structure-property relationships (QSPR). In the case of policy-gradient reinforcement learning, a property prediction model can be used to reward/punish the reinforcement learning agent, which determines the loss and is used to update the agent parameters such that sampling actions which lead to desirable molecules becomes more likely. The development of improved molecular generative models is inextricably tied to the development of better property prediction models. Properties of interest include biological activity (e.g., IC50, DC50), ADMET (e.g., bioavailability, toxicity), physico-chemical properties (e.g., pKa, solubility), and ternary structure formation.
We are currently interested in applications of deep generative models to the following exciting molecular engineering tasks:
Small molecule drugs approved under a “New Drug Application (NDA)” by the FDA comprise only ~80% of the 1,200 new molecular entities approved between 1985-2021, with the other 20% being new biological products. Small molecules (Figure 2) are generally designed to impede the function of biologically-relevant target proteins; for instance, small molecule inhibitors interfere with their targets by binding strongly enough to a protein of interest such that it’s behavior is affected and it can no longer carry out it’s function.
Figure 2. Typical mechanism of action for a small molecule protein binder.
However, it is estimated that ∼75% of the human proteome lacks deep binding sites and is considered “undruggable” by traditional small molecule inhibitors. Nonetheless, these so-called undruggable targets are implicated in a wide range of diseases, including cancer, autoimmune diseases, and cardio-metabolomic diseases, motivating the development of therapeutic modalities beyond small molecule inhibitors. To this end, our group is interested in the development of tools for the controlled design of multi-target therapeutic modalities such as PROteolysis TArgeting Chimeras (PROTACs; Figure 3) and molecular glues.
Figure 3. Mechanism of action for a PROteolysis TArgeting Chimera (PROTAC), a class of multi-target therapeutic modality for targeted protein degradation. POI: protein of interest.
The function of PROTACs is generally enabled by a three-component structure consisting of two binding domains and an organic linker (Figure 4). The two binding domains include a warhead designed to bind a protein of interest (POI) and an E3 ligand designed to bind an E3 ligase. Deep generative models can be used to optimize each component independently or to enginere all components of the structure simultaneously. In the ideal scenario, the linker anchors the two proteins together in a short-lived ternary complex, leading to ubiquitination of the POI. The ubiquinated POI is eventually degraded by the proteasome, thus preventing it from carrying out its function in a cell.
Figure 4. (a) Illustration of the three-component PROTAC structure, and (b) an example PROTAC (PubChem CID: 155168919) from PROTAC-DB. The functionalities of each component are highly interdependent, such that rational design of PROTACs remains challenging.
Not only do such therapeutic modalities open up the space of druggable targets using small molecules, but they can also leverage weak binding interactions with proteins and have a catalytic mechanism of action, requiring lower potential doses in patients. Polypharmacology-based therapeutic approaches are also relevant in the development of treatments of various central nervous system (CNS) disorders, such as Parkinson’s disease, schizophrenia, and substance abuse/addiction. Our group is interested in using computational tools to better understand the biology of these diseases and their potential treatment via multi-target therapeutics. By understanding how a small bound molecule affects a given protein’s interactome, the rational design of targeted protein degraders becomes more tractable.
Molecular simulations are not just a useful tool for gaining a better understanding of the molecular interactions and dynamics which govern a specific biological process, but also a useful tool for data generation. In cases where there is not sufficient data available for developing a predictive model, suitable training data can be generated via computational chemistry methods, such as ab initio calculations and classical molecular simulations. We are currently interested in the use of molecular dynamics simulations for predicting successful ternary complex formation when designing new multi-target therapeutics; such data can in turn be used to predict the extent of protein degradation in a new targeted protein degradation system.
Many targets used for deep molecular generation have been extremely well-studied and characterized, and it is unclear how much the models generalize to targets for which much less information is available (e.g., few, if any, known actives known, or lack an experimental crystal structure). To add to the challenge, many measured properties are heavilty interrelated, making it hard to tease out the effect a single small change will have on a potential new drug. Generalizing our models to maximize the amount of information extracted from available datasets is an active area of research in our group which will allow us to use our models in new drug discovery territory.