Research Themes
Artificial intelligence (AI) is transforming our approach to molecular engineering. Driven by the goal of accelerating drug development and materials design, our aim is to develop AI-driven methods that enhance how we discover and optimize molecules. We organize our work into two layers. The first is a set of method development pillars, where we build the generative and predictive tools that learn from both structured and unstructured data, spanning molecular structures, chemical reactions, physical simulations, and biomedical data such as single-cell transcriptomics and microscopy images. The second is a set of application domains, where we put those methods to work on concrete therapeutic and materials problems. Increasingly, our methods also draw on chemical language models, foundation models, and agentic workflows to make sense of the unstructured chemical literature and to scale data curation and design.
Our methods development is organized around six pillars. Together they form a pipeline: from how molecules are represented and the data we learn from, through simulation and accurate property prediction, to the design of new molecules and the prediction of how to make them.
How a molecule is encoded shapes everything a model can learn from it. We study a wide range of molecular representations, from string and descriptor-based encodings to molecular graphs and higher-order topological representations that capture many-body structure through hypergraph and topological message passing. A recurring question is which representations generalize best across chemical space, whether the target is a small-molecule drug or a copolymer.
Good models need good data, and much of chemistry’s most valuable information sits locked in unstructured text, figures, and tables. We develop methods to extract, structure, and curate chemical data at scale. This includes agentic literature-extraction workflows that augment sparse databases, for example in targeted protein degradation, and tools that identify and split molecular substructures from raw records. We also advocate for community standards that make reaction and assay data reusable in the first place.
Molecular simulations are both a way to understand the interactions and dynamics that govern a biological or chemical process and a way to generate training data where experimental data is scarce. We use classical molecular dynamics (MD) and ab initio methods such as density functional theory (DFT) to study systems of interest and to produce labelled data for predictive models, including unified datasets that combine experiment and simulation paired with fast surrogate models. One current interest is using MD to predict ternary complex formation when designing multi-target therapeutics.
Accurate property prediction underpins both ranking candidate molecules and rewarding generative models. We build single- and multi-task models, often sharing information across related endpoints to improve data efficiency, for properties spanning biological activity (e.g., IC50, DC50), ADMET (e.g., permeability, efflux, bioavailability, toxicity), drug metabolism, physico-chemical properties (e.g., pKa, solubility), and ternary complex formation. We pay close attention to evaluation, since headline metrics often mask how a model behaves across chemical space.
At the heart of de novo design is the ability to propose new molecules with a desired property profile and to optimize them efficiently, so as to minimize experimental screening. We develop molecular deep generative models (DGMs) that build molecules in silico, often atom-by-atom or fragment-by-fragment, across a range of therapeutic modalities (Figure 1).

Figure 1. Deep generative models (DGMs) can be developed for a variety of therapeutic modalities. From left to right: a small molecule (agonist), a protein therapeutic, a heterobifunctional degrader, a molecular glue, and an RNA therapeutic (aptamer).
To steer generation toward promising regions of chemical space, we apply reinforcement learning and genetic algorithms, scored by the property prediction models described above. We are also interested in conditional generation, steering models with auxiliary signals such as omics or phenotypic data, and in carefully measuring how well generative models actually cover useful chemical space.
A molecule is only useful if it can be made. We develop methods for synthesizability assessment and multi-step retrosynthetic planning, framing route prediction with approaches such as decision transformers and bottom-up synthesis trees. We also study how reliable these models are, including how well large language models agree with human experts on the quality of proposed synthesis plans.
We apply and stress-test these methods in domains where AI-driven design can have outsized impact. Three are central to the group today, alongside emerging directions in materials and life-science discovery.
Small molecule drugs approved under a “New Drug Application (NDA)” by the FDA comprise only about 80% of the roughly 1,200 new molecular entities approved between 1985 and 2021, with the rest being new biological products. Small molecules (Figure 2) are generally designed to impede the function of biologically-relevant target proteins. A small molecule inhibitor, for instance, binds a protein of interest strongly enough that the protein can no longer carry out its function.

Figure 2. Typical mechanism of action for a small molecule protein binder.
However, an estimated 75% of the human proteome lacks deep binding sites and is considered “undruggable” by traditional small molecule inhibitors. Many of these targets nonetheless drive cancer, autoimmune diseases, and cardio-metabolic diseases, which motivates therapeutic modalities beyond inhibition. We are interested in the controlled design of multi-target modalities such as PROteolysis TArgeting Chimeras (PROTACs; Figure 3) and molecular glues, which can act catalytically at low doses and exploit weak, transient interactions.

Figure 3. Mechanism of action for a PROteolysis TArgeting Chimera (PROTAC), a class of multi-target therapeutic modality for targeted protein degradation. POI: protein of interest.
A PROTAC uses a three-component structure (Figure 4): a warhead that binds the protein of interest (POI), an E3 ligand that recruits an E3 ligase, and a linker joining them. When the linker holds the two proteins together in a short-lived ternary complex, the POI is ubiquitinated and then degraded by the proteasome. Because the components are highly interdependent, rational design remains challenging, which is why this is one of our most active areas. Our work here draws on every method pillar above: mining and benchmarking degradation data, identifying PROTAC substructures, predicting degradation activity, modeling ternary complex formation with modern structure-prediction tools (such as AlphaFold3 and Boltz-1), and generatively designing degraders.

Figure 4. (a) Illustration of the three-component PROTAC structure, and (b) an example PROTAC (PubChem CID: 155168919) from PROTAC-DB. The functionalities of each component are highly interdependent, such that rational design of PROTACs remains challenging.
Safer, higher-performance batteries depend on better electrolytes. In collaboration with researchers at Uppsala University, we apply machine learning to electrolyte modeling and design, including generative design of solvents and the prediction of copolymer properties relevant to solid polymer electrolytes. The goal is to navigate a vast chemical space toward materials that balance ionic conductivity, electrochemical stability, and processability.
Many materials used in industrial processes such as semiconductor manufacturing rely on per- and polyfluoroalkyl substances (PFAS), which are persistent and toxic. In collaboration with Intel and EMD Electronics (Merck), we use multi-modal deep learning and generative modeling to discover PFAS alternatives, focusing on the surfactants and photoresists central to chip fabrication. Property prediction from combined experimental and simulated data helps prioritize safer candidates before they are ever synthesized.
Beyond molecular systems, we work on generative models for crystalline and inorganic materials, and on the benchmarks needed to tell whether such models actually produce useful, synthesizable structures. This includes crystal structure prediction directly from powder diffraction data using autoregressive language models, and contributions to community resources for materials discovery such as LeMaterial.
A growing line of applied work uses high-content biological data, such as microscopy images and single-cell transcriptomics, to guide discovery without relying on a known protein target. We study learned representations of cell images and transcriptomic profiles that support robust phenotypic readouts and, ultimately, phenotype-conditioned molecular design.