Molecular Generation Models¶
The generator models inherit from the torch_molecule.base.generator.BaseMolecularGenerator
class and share common methods for model training, generation and persistence.
Training and Generation
fit(X, **kwargs)
: Train the model on given data, where X contains SMILES strings (y should be provided for conditional generation)generate(n_samples, **kwargs)
: Generate new molecules and return a list of SMILES strings (y should be provided for conditional generation)
Model Persistence
inherited from torch_molecule.base.base.BaseModel
save_to_local(path)
: Save the trained model to a local fileload_from_local(path)
: Load a trained model from a local filesave_to_hf(repo_id)
: Push the model to Hugging Face HubNot implemented for: -
torch_molecule.generator.graph_ga.modeling_graph_ga.GraphGAMolecularGenerator
load_from_hf(repo_id, local_cache)
: Load a model from Hugging Face Hub and save it to a local fileNot implemented for: -
torch_molecule.generator.graph_ga.modeling_graph_ga.GraphGAMolecularGenerator
save(path, repo_id)
: Save the model to either local storage or Hugging Faceload(path, repo_id)
: Load a model from either local storage or Hugging Face
Modeling Molecules as Graphs with GNN / Transformer-based Generators¶
GraphDiT for Un/Multi-conditional Molecular Generation
- class torch_molecule.generator.graph_dit.modeling_graph_dit.GraphDITMolecularGenerator(device: ~torch.device | None = None, model_name: str = 'GraphDITMolecularGenerator', num_layer: int = 6, hidden_size: int = 1152, dropout: float = 0.0, drop_condition: float = 0.0, num_head: int = 16, mlp_ratio: float = 4, task_type: ~typing.List[str] = <factory>, timesteps: int = 500, batch_size: int = 128, epochs: int = 10000, learning_rate: float = 0.0002, grad_clip_value: float | None = None, weight_decay: float = 0.0, lw_X: float = 1, lw_E: float = 5, guide_scale: float = 2.0, use_lr_scheduler: bool = False, scheduler_factor: float = 0.5, scheduler_patience: int = 5, verbose: bool = False)[source]¶
Bases:
BaseMolecularGenerator
This generator implements the graph diffusion transformer for (multi-conditional and unconditional) molecular generation.
References
Graph Diffusion Transformers for Multi-Conditional Molecular Generation. NeurIPS 2024. https://openreview.net/forum?id=cfrDLD1wfO
Implementation: https://github.com/liugangcode/Graph-DiT
- Parameters:
num_layer (int, default=6) – Number of transformer layers
hidden_size (int, default=1152) – Dimension of hidden layers
dropout (float, default=0.0) – Dropout rate for transformer layers
drop_condition (float, default=0.0) – Dropout rate for condition embedding
num_head (int, default=16) – Number of attention heads in transformer
mlp_ratio (float, default=4) – Ratio of MLP hidden dimension to transformer hidden dimension
task_type (List[str], default=[]) – List specifying type of each task (‘regression’ or ‘classification’)
timesteps (int, default=500) – Number of diffusion timesteps
batch_size (int, default=128) – Batch size for training
epochs (int, default=10000) – Number of training epochs
learning_rate (float, default=0.0002) – Learning rate for optimization
grad_clip_value (Optional[float], default=None) – Value for gradient clipping (None = no clipping)
weight_decay (float, default=0.0) – Weight decay for optimization
lw_X (float, default=1) – Loss weight for node reconstruction
lw_E (float, default=5) – Loss weight for edge reconstruction
guide_scale (float, default=2.0) – Scale factor for classifier-free guidance during sampling
use_lr_scheduler (bool, default=False) – Whether to use learning rate scheduler
scheduler_factor (float, default=0.5) – Factor by which to reduce learning rate on plateau
scheduler_patience (int, default=5) – Number of epochs with no improvement after which learning rate will be reduced
verbose (bool, default=False) – Whether to display progress bars and logs
- fit(X_train: List[str], y_train: List | ndarray | None = None) GraphDITMolecularGenerator [source]¶
- generate(labels: List[List] | ndarray | Tensor | None = None, num_nodes: List[List] | ndarray | Tensor | None = None, batch_size: int = 32) List[str] [source]¶
Generate molecules with specified properties and optional node counts.
- Parameters:
labels (Optional[Union[List[List], np.ndarray, torch.Tensor]], default=None) – Target properties for the generated molecules. Can be provided as: - A list of lists for multiple properties - A numpy array of shape (batch_size, n_properties) - A torch tensor of shape (batch_size, n_properties) For single label (properties values), can also be provided as 1D array/tensor. If None, generates unconditional samples.
num_nodes (Optional[Union[List[List], np.ndarray, torch.Tensor]], default=None) – Number of nodes for each molecule in the batch. If None, samples from the training distribution. Can be provided as: - A list of lists - A numpy array of shape (batch_size, 1) - A torch tensor of shape (batch_size, 1)
batch_size (int, default=32) – Number of molecules to generate. Only used if labels is None.
- Returns:
List of generated molecules in SMILES format.
- Return type:
List[str]
DiGress for Unconditional Molecular Generation
- class torch_molecule.generator.digress.modeling_digress.DigressMolecularGenerator(device: device | None = None, model_name: str = 'DigressMolecularGenerator', hidden_size_X: int = 256, hidden_size_E: int = 128, num_layer: int = 5, n_head: int = 8, dropout: float = 0.1, timesteps: int = 500, batch_size: int = 512, epochs: int = 1000, learning_rate: float = 0.0002, grad_clip_value: float | None = None, weight_decay: float = 1e-12, lw_X: float = 1, lw_E: float = 5, use_lr_scheduler: bool = False, scheduler_factor: float = 0.5, scheduler_patience: int = 5, verbose: bool = False)[source]¶
Bases:
BaseMolecularGenerator
This generator implements the DiGress for unconditional molecular generation.
References
DiGress: Discrete Denoising diffusion for graph generation. International Conference on Learning Representations (ICLR) 2023. https://openreview.net/forum?id=UaAD-Nu86WX
- Parameters:
hidden_size_X (int, optional) – Hidden dimension size for node features, defaults to 256
hidden_size_E (int, optional) – Hidden dimension size for edge features, defaults to 128
num_layer (int, optional) – Number of transformer layers, defaults to 5
n_head (int, optional) – Number of attention heads, defaults to 8
dropout (float, optional) – Dropout rate for transformer layers, defaults to 0.1
timesteps (int, optional) – Number of diffusion timesteps, defaults to 500
batch_size (int, optional) – Batch size for training, defaults to 128
epochs (int, optional) – Number of training epochs, defaults to 10000
learning_rate (float, optional) – Learning rate for optimization, defaults to 0.0002
grad_clip_value (Optional[float], optional) – Value for gradient clipping (None = no clipping), defaults to None
weight_decay (float, optional) – Weight decay for optimization, defaults to 0.0
lw_X (float, optional) – Loss weight for node reconstruction, defaults to 1
lw_E (float, optional) – Loss weight for edge reconstruction, defaults to 10
use_lr_scheduler (bool, optional) – Whether to use learning rate scheduler, defaults to False
scheduler_factor (float, optional) – Factor for learning rate scheduler (if use_lr_scheduler is True), defaults to 0.5
scheduler_patience (int, optional) – Patience for learning rate scheduler (if use_lr_scheduler is True), defaults to 5
verbose (bool) – Whether to display progress bars and logs. Default is False.
- fit(X_train: List[str]) DigressMolecularGenerator [source]¶
- generate(num_nodes: List[List] | ndarray | Tensor | None = None, batch_size: int = 32) List[str] [source]¶
Randomly generate molecules with specified node counts.
- Parameters:
num_nodes (Optional[Union[List[List], np.ndarray, torch.Tensor]], default=None) – Number of nodes for each molecule in the batch. If None, samples from the training distribution. Can be provided as: - A list of lists - A numpy array of shape (batch_size, 1) - A torch tensor of shape (batch_size, 1)
batch_size (int, default=32) – Number of molecules to generate.
- Returns:
List of generated molecules in SMILES format.
- Return type:
List[str]
GDSS for score-based molecular generation
- class torch_molecule.generator.gdss.modeling_gdss.GDSSMolecularGenerator(device: device | None = None, model_name: str = 'GDSSMolecularGenerator', num_layer: int = 3, hidden_size_adj: float = 8, hidden_size: int = 16, attention_dim: int = 16, num_head: int = 4, sde_type_x: str = 'VE', sde_beta_min_x: float = 0.1, sde_beta_max_x: float = 1, sde_num_scales_x: int = 1000, sde_type_adj: str = 'VE', sde_beta_min_adj: float = 0.1, sde_beta_max_adj: float = 1, sde_num_scales_adj: int = 1000, batch_size: int = 128, epochs: int = 500, learning_rate: float = 0.005, grad_clip_value: float | None = 1.0, weight_decay: float = 0.0001, use_loss_reduce_mean: bool = False, use_lr_scheduler: bool = False, scheduler_factor: float = 0.5, scheduler_patience: int = 5, sampler_predictor: str = 'Reverse', sampler_corrector: str = 'Langevin', sampler_snr: float = 0.2, sampler_scale_eps: float = 0.7, sampler_n_steps: int = 1, sampler_probability_flow: bool = False, sampler_noise_removal: bool = True, verbose: bool = False)[source]¶
Bases:
BaseMolecularGenerator
This generator implements “Score-based Generative Modeling of Graphs via the System of Stochastic Differential Equations”
References
Official Implementation: https://github.com/harryjo97/GDSS
- Parameters:
num_layer (int, default=3) – Number of layers in the score networks.
hidden_size_adj (float, default=8) – Hidden dimension size for the adjacency in the adjacency score network.
hidden_size (int, default=16) – Hidden dimension size latent representation.
attention_dim (int, default=16) – Dimension of attention layers.
num_head (int, default=4) – Number of attention heads.
sde_type_x (str, default='VE') – SDE type for node features. One of ‘VP’, ‘VE’, ‘subVP’.
sde_beta_min_x (float, default=0.1) – Minimum noise level for node features.
sde_beta_max_x (float, default=1) – Maximum noise level for node features.
sde_num_scales_x (int, default=1000) – Number of noise scales for node features.
sde_type_adj (str, default='VE') – SDE type for adjacency matrix. One of ‘VP’, ‘VE’, ‘subVP’.
sde_beta_min_adj (float, default=0.1) – Minimum noise level for adjacency matrix.
sde_beta_max_adj (float, default=1) – Maximum noise level for adjacency matrix.
sde_num_scales_adj (int, default=1000) – Number of noise scales for adjacency matrix.
batch_size (int, default=128) – Batch size for training.
epochs (int, default=500) – Number of training epochs.
learning_rate (float, default=0.005) – Learning rate for optimizer.
grad_clip_value (Optional[float], default=1) – Value for gradient clipping. None means no clipping.
weight_decay (float, default=1e-4) – Weight decay for optimizer.
use_loss_reduce_mean (bool, default=False) – Whether to use mean reduction for loss calculation.
use_lr_scheduler (bool, default=False) – Whether to use learning rate scheduler.
scheduler_factor (float, default=0.5) – Factor by which to reduce learning rate when using scheduler (only used if use_lr_scheduler is True).
scheduler_patience (int, default=5) – Number of epochs with no improvement after which learning rate will be reduced (only used if use_lr_scheduler is True).
sampler_predictor (str, default='Reverse') – Predictor method for sampling. One of ‘Euler’, ‘Reverse’.
sampler_corrector (str, default='Langevin') – Corrector method for sampling. One of ‘Langevin’, ‘None’.
sampler_snr (float, default=0.2) – Signal-to-noise ratio for corrector.
sampler_scale_eps (float, default=0.7) – Scale factor for noise level in corrector.
sampler_n_steps (int, default=1) – Number of corrector steps per predictor step.
sampler_probability_flow (bool, default=False) – Whether to use probability flow ODE for sampling.
sampler_noise_removal (bool, default=True) – Whether to remove noise in the final step of sampling.
verbose (bool, default=False) – Whether to display progress bars and logs.
- fit(X_train: List[str]) GDSSMolecularGenerator [source]¶
Fit the model to the training data.
- Parameters:
X_train (List[str]) – List of training data in SMILES format.
- Returns:
self – The fitted model.
- Return type:
- generate(num_nodes: List[List] | ndarray | Tensor | None = None, batch_size: int = 32) List[str] [source]¶
Randomly generate molecules with specified node counts.
- Parameters:
num_nodes (Optional[Union[List[List], np.ndarray, torch.Tensor]], default=None) – Number of nodes for each molecule in the batch. If None, samples from the training distribution. Can be provided as: - A list of lists - A numpy array of shape (batch_size, 1) - A torch tensor of shape (batch_size, 1)
batch_size (int, default=32) – Number of molecules to generate.
- Returns:
List of generated molecules in SMILES format.
- Return type:
List[str]
Modeling Molecules as Graphs with Heuristic-based Generators¶
Graph Genetic Algorithm for Un/Multi-conditional Molecular Generation
- class torch_molecule.generator.graph_ga.modeling_graph_ga.GraphGAMolecularGenerator(device: device | None = None, model_name: str = 'GraphGAMolecularGenerator', num_task: int = 0, population_size: int = 100, offspring_size: int = 50, mutation_rate: float = 0.0067, n_jobs: int = 1, iteration: int = 5, verbose: bool = False)[source]¶
Bases:
BaseMolecularGenerator
This generator implements the Graph Genetic Algorithm for molecular generation.
References
A Graph-Based Genetic Algorithm and Its Application to the Multiobjective Evolution of
Median Molecules. Journal of Chemical Information and Computer Sciences. https://pubs.acs.org/doi/10.1021/ci034290p - Implementation: https://github.com/wenhao-gao/mol_opt
- Parameters:
num_task (int, default=0) – Number of properties to condition on. Set to 0 for unconditional generation.
population_size (int, default=100) – Size of the population in each iteration.
offspring_size (int, default=50) – Number of offspring molecules to generate in each iteration.
mutation_rate (float, default=0.0067) – Probability of mutation occurring during reproduction.
n_jobs (int, default=1) – Number of parallel jobs to run. -1 means using all processors.
iteration (int, default=5) – Number of iterations for each target label (or random sample) to run the genetic algorithm.
verbose (bool, default=False) – Whether to display progress bars and logs.
- fit(X_train: List[str], y_train: List | ndarray | None = None, oracle: List[Callable] | None = None) GraphGAMolecularGenerator [source]¶
Fit the model to the training data.
- Parameters:
X_train (List[str]) – Training data, which will be used as the initial population.
y_train (Optional[Union[List, np.ndarray]]) – Training labels for conditional generation (num_task is not 0).
oracle (Optional[Callable]) –
Oracle used to score the generated molecules. If not provided, default oracles based on
sklearn.ensemble.RandomForestRegressor
are trained on the X_train and y_train.For a customized oracle, it should be a Callable object, i.e.,
oracle(X, y)
. Please properly wrap your oracle to take two inputs:a list of
rdkit.Chem.rdchem.Mol
objects anda (1, num_task) numpy array of target values that all the molecules in the list target to achieve. Take care of NaN values if any.
Scores for different tasks should be aggregated, i.e., mean or sum. The return should be a list of scores (float). Smaller scores mean closer to the target goal.
Oracles are not needed for unconditional generation.
- Returns:
self – Fitted model.
- Return type:
Default Oracles in GraphGA
- class torch_molecule.generator.graph_ga.oracle.Oracle(models=None, num_task=1)[source]¶
Bases:
object
Oracle class for scoring molecules.
This class wraps predictive models (like RandomForestRegressor) to score molecules based on their properties. It handles conversion of SMILES to fingerprints.
- Parameters:
models (List[Any], optional) – List of trained models that implement a predict method. If None, RandomForestRegressors will be created when fit is called.
num_task (int, default=1) – Number of properties to predict.
- fit(X_train, y_train)[source]¶
Fit the underlying models with training data.
- Parameters:
X_train (List[str] or List[RDKit.Mol]) – Training molecules as SMILES strings or RDKit Mol objects.
y_train (np.ndarray) – Training labels with shape (n_samples, num_task).
- Returns:
self – Fitted oracle.
- Return type:
Modeling Molecules as Sequences with Transformer-based Generators¶
MolGPT for Unconditional Molecular Generation
- class torch_molecule.generator.molgpt.modeling_molgpt.MolGPTMolecularGenerator(device: device | None = None, model_name: str = 'MolGPTMolecularGenerator', num_layer: int = 8, num_head: int = 8, hidden_size: int = 256, max_len: int = 128, num_task: int = 0, use_scaffold: bool = False, use_lstm: bool = False, lstm_layers: int = 0, batch_size: int = 64, epochs: int = 1000, learning_rate: float = 0.0003, adamw_betas: Tuple[float, float] = (0.9, 0.95), weight_decay: float = 0.1, grad_norm_clip: float = 1.0, verbose: bool = False)[source]¶
Bases:
BaseMolecularGenerator
This generator implements the molecular GPT model for generating molecules.
The model uses a GPT-like architecture to learn the distribution of SMILES strings and generate new molecules. It supports conditional generation based on properties and/or molecular scaffolds.
References
MolGPT: Molecular Generation Using a Transformer-Decoder Model. Journal of Chemical Information and Modeling. https://pubs.acs.org/doi/10.1021/acs.jcim.1c00600
- Parameters:
num_layer (int, default=8) – Number of transformer layers in the model.
num_head (int, default=8) – Number of attention heads in each transformer layer.
hidden_size (int, default=256) – Dimension of the hidden representations.
max_len (int, default=128) – Maximum length of SMILES strings.
num_task (int, default=0) – Number of property prediction tasks for conditional generation. O for unconditional generation.
use_scaffold (bool, default=False) – Whether to use scaffold conditioning.
use_lstm (bool, default=False) – Whether to use LSTM for encoding scaffold.
lstm_layers (int, default=0) – Number of LSTM layers if use_lstm is True.
batch_size (int, default=64) – Batch size for training.
epochs (int, default=1000) – Number of training epochs.
learning_rate (float, default=3e-4) – Learning rate for optimizer.
adamw_betas (Tuple[float, float], default=(0.9, 0.95)) – Beta parameters for AdamW optimizer.
weight_decay (float, default=0.1) – Weight decay for optimizer.
grad_norm_clip (float, default=1.0) – Gradient norm clipping value.
verbose (bool, default=False) – Whether to display progress bars during training.
- fit(X_train, y_train=None, X_scaffold=None)[source]¶
Train the MolGPT model on SMILES strings.
- Parameters:
X_train (List[str]) – List of SMILES strings for training
y_train (Optional[List[float]]) – Optional list of property values for conditional generation
X_scaffold (Optional[List[str]]) – Optional list of scaffold SMILES strings for conditional generation
- Returns:
self – The fitted model
- Return type:
MolGPTGenerator
- generate(n_samples=10, properties=None, scaffolds=None, max_len=None, temperature=1.0, top_k=10, starting_token='C')[source]¶
Generate molecules using the trained model.
- Parameters:
n_samples (int, default=10) – Number of molecules to generate
properties (Optional[List[List[float]]]) – Property values for conditional generation
scaffolds (Optional[List[str]]) – Scaffold SMILES for conditional generation
max_len (Optional[int]) – Maximum length of generated SMILES
temperature (float, default=1.0) – Sampling temperature
top_k (int, default=10) – Top-k sampling parameter
starting_token (Optional[str]) – Starting token for generation (default is ‘C’)
- Returns:
List of generated SMILES strings
- Return type:
List[str]