Building a CVAE for Novel Molecule Generation using SELFIES

Introduction
Generating novel molecules with desired properties is a key challenge in drug discovery and material science. Traditional methods rely on combinatorial chemistry, but AI-driven approaches using generative models like Variational Autoencoders (VAE) have proven effective. In this post, we will explore how a Conditional Variational Autoencoder (CVAE) can be used for molecule generation using SELFIES (SELF-referencing Embedded Strings), a more robust molecular representation than SMILES.
Why SELFIES over SMILES?
SMILES (Simplified Molecular Input Line Entry System) is widely used for molecular representation, but it suffers from validity issues—many randomly generated SMILES strings do not correspond to valid molecules. SELFIES, on the other hand, guarantees validity by design, making it a better choice for generative models.
Understanding the CVAE Approach
A CVAE is an extension of the standard VAE that incorporates conditional information to guide the generation process. In our case, molecular properties (e.g., logP, QED, molecular weight) serve as conditioning factors, allowing us to generate molecules with desired characteristics.
The CVAE consists of:
Encoder: Converts a SELFIES sequence into a latent space representation.
Latent Space: A compressed, probabilistic representation of molecular features.
Decoder: Maps latent representations back to SELFIES strings, generating new molecules.
Dataset Preparation
We used the ZINC 250k and QM9 datasets, converting SMILES to SELFIES for training. We also normalized molecular property values to ensure effective conditioning in the CVAE.
Model Architecture
Our CVAE implementation consists of:
Tokenization: Converting SELFIES into one-hot encoded sequences.
Encoder (BiLSTM/GRU): Processes input SELFIES strings and encodes them into latent vectors.
Latent Representation: A mean and variance representation from which latent vectors are sampled.
Decoder (LSTM/GRU): Takes the latent representation and reconstructs the original molecule.
Property Conditioning: Molecular properties are concatenated with latent vectors to guide generation.
Training the Model
Steps:
Convert SMILES to SELFIES and tokenize the sequences.
Train the CVAE with a reconstruction loss (cross-entropy) and KL divergence.
Regularize latent space for smooth interpolation between molecular structures.
Optimize using Adam optimizer with a controlled learning rate.
Evaluate validity, novelty, and diversity of generated molecules.
Key Results:
Loss reduced from 8000 to ~10 after fine-tuning.
93% of generated molecules passed validity and Lipinski’s Rule of Five.
Achieved high novelty and diversity in molecule generation.
Deployment with Streamlit
To make our model user-friendly, we built a Streamlit app where users can:
Input desired molecular properties.
Generate molecules in real-time.
Visualize molecular structures.
Check Lipinski’s Rule of Five compliance.
Conclusion
Using SELFIES with CVAE significantly improves molecular generation validity while allowing fine-tuned control over molecular properties. This approach has promising applications in drug discovery, enabling AI-driven design of novel molecules with desired pharmacological traits.
Next Steps
Experiment with larger datasets for better generalization.
Implement reinforcement learning to further optimize generated molecules.
Extend the model to multi-property optimization for real-world drug design applications.
Subscribe to my newsletter
Read articles from Sambhav Jain directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
