Molecular Generation Models

The generator models inherit from the torch_molecule.base.generator.BaseMolecularGenerator class and share common methods for model training, generation and persistence.

Training and Generation

  • fit(X, **kwargs): Train the model on given data, where X contains SMILES strings (y should be provided for conditional generation)

  • generate(n_samples, **kwargs): Generate new molecules and return a list of SMILES strings (y should be provided for conditional generation)

Model Persistence

inherited from torch_molecule.base.base.BaseModel

Modeling Molecules as Graphs with GNN / Transformer-based Generators

GraphDiT for Un/Multi-conditional Molecular Generation

class torch_molecule.generator.graph_dit.modeling_graph_dit.GraphDITMolecularGenerator(device: ~torch.device | None = None, model_name: str = 'GraphDITMolecularGenerator', num_layer: int = 6, hidden_size: int = 1152, dropout: float = 0.0, drop_condition: float = 0.0, num_head: int = 16, mlp_ratio: float = 4, task_type: ~typing.List[str] = <factory>, timesteps: int = 500, batch_size: int = 128, epochs: int = 10000, learning_rate: float = 0.0002, grad_clip_value: float | None = None, weight_decay: float = 0.0, lw_X: float = 1, lw_E: float = 5, guide_scale: float = 2.0, use_lr_scheduler: bool = False, scheduler_factor: float = 0.5, scheduler_patience: int = 5, verbose: bool = False)[source]

Bases: BaseMolecularGenerator

This generator implements the graph diffusion transformer for (multi-conditional and unconditional) molecular generation.

References

Parameters:
  • num_layer (int, default=6) – Number of transformer layers

  • hidden_size (int, default=1152) – Dimension of hidden layers

  • dropout (float, default=0.0) – Dropout rate for transformer layers

  • drop_condition (float, default=0.0) – Dropout rate for condition embedding

  • num_head (int, default=16) – Number of attention heads in transformer

  • mlp_ratio (float, default=4) – Ratio of MLP hidden dimension to transformer hidden dimension

  • task_type (List[str], default=[]) – List specifying type of each task (‘regression’ or ‘classification’)

  • timesteps (int, default=500) – Number of diffusion timesteps

  • batch_size (int, default=128) – Batch size for training

  • epochs (int, default=10000) – Number of training epochs

  • learning_rate (float, default=0.0002) – Learning rate for optimization

  • grad_clip_value (Optional[float], default=None) – Value for gradient clipping (None = no clipping)

  • weight_decay (float, default=0.0) – Weight decay for optimization

  • lw_X (float, default=1) – Loss weight for node reconstruction

  • lw_E (float, default=5) – Loss weight for edge reconstruction

  • guide_scale (float, default=2.0) – Scale factor for classifier-free guidance during sampling

  • use_lr_scheduler (bool, default=False) – Whether to use learning rate scheduler

  • scheduler_factor (float, default=0.5) – Factor by which to reduce learning rate on plateau

  • scheduler_patience (int, default=5) – Number of epochs with no improvement after which learning rate will be reduced

  • verbose (bool, default=False) – Whether to display progress bars and logs

fit(X_train: List[str], y_train: List | ndarray | None = None) GraphDITMolecularGenerator[source]
generate(labels: List[List] | ndarray | Tensor | None = None, num_nodes: List[List] | ndarray | Tensor | None = None, batch_size: int = 32) List[str][source]

Generate molecules with specified properties and optional node counts.

Parameters:
  • labels (Optional[Union[List[List], np.ndarray, torch.Tensor]], default=None) – Target properties for the generated molecules. Can be provided as: - A list of lists for multiple properties - A numpy array of shape (batch_size, n_properties) - A torch tensor of shape (batch_size, n_properties) For single label (properties values), can also be provided as 1D array/tensor. If None, generates unconditional samples.

  • num_nodes (Optional[Union[List[List], np.ndarray, torch.Tensor]], default=None) – Number of nodes for each molecule in the batch. If None, samples from the training distribution. Can be provided as: - A list of lists - A numpy array of shape (batch_size, 1) - A torch tensor of shape (batch_size, 1)

  • batch_size (int, default=32) – Number of molecules to generate. Only used if labels is None.

Returns:

List of generated molecules in SMILES format.

Return type:

List[str]

DiGress for Unconditional Molecular Generation

class torch_molecule.generator.digress.modeling_digress.DigressMolecularGenerator(device: device | None = None, model_name: str = 'DigressMolecularGenerator', hidden_size_X: int = 256, hidden_size_E: int = 128, num_layer: int = 5, n_head: int = 8, dropout: float = 0.1, timesteps: int = 500, batch_size: int = 512, epochs: int = 1000, learning_rate: float = 0.0002, grad_clip_value: float | None = None, weight_decay: float = 1e-12, lw_X: float = 1, lw_E: float = 5, use_lr_scheduler: bool = False, scheduler_factor: float = 0.5, scheduler_patience: int = 5, verbose: bool = False)[source]

Bases: BaseMolecularGenerator

This generator implements the DiGress for unconditional molecular generation.

References

Parameters:
  • hidden_size_X (int, optional) – Hidden dimension size for node features, defaults to 256

  • hidden_size_E (int, optional) – Hidden dimension size for edge features, defaults to 128

  • num_layer (int, optional) – Number of transformer layers, defaults to 5

  • n_head (int, optional) – Number of attention heads, defaults to 8

  • dropout (float, optional) – Dropout rate for transformer layers, defaults to 0.1

  • timesteps (int, optional) – Number of diffusion timesteps, defaults to 500

  • batch_size (int, optional) – Batch size for training, defaults to 128

  • epochs (int, optional) – Number of training epochs, defaults to 10000

  • learning_rate (float, optional) – Learning rate for optimization, defaults to 0.0002

  • grad_clip_value (Optional[float], optional) – Value for gradient clipping (None = no clipping), defaults to None

  • weight_decay (float, optional) – Weight decay for optimization, defaults to 0.0

  • lw_X (float, optional) – Loss weight for node reconstruction, defaults to 1

  • lw_E (float, optional) – Loss weight for edge reconstruction, defaults to 10

  • use_lr_scheduler (bool, optional) – Whether to use learning rate scheduler, defaults to False

  • scheduler_factor (float, optional) – Factor for learning rate scheduler (if use_lr_scheduler is True), defaults to 0.5

  • scheduler_patience (int, optional) – Patience for learning rate scheduler (if use_lr_scheduler is True), defaults to 5

  • verbose (bool) – Whether to display progress bars and logs. Default is False.

fit(X_train: List[str]) DigressMolecularGenerator[source]
generate(num_nodes: List[List] | ndarray | Tensor | None = None, batch_size: int = 32) List[str][source]

Randomly generate molecules with specified node counts.

Parameters:
  • num_nodes (Optional[Union[List[List], np.ndarray, torch.Tensor]], default=None) – Number of nodes for each molecule in the batch. If None, samples from the training distribution. Can be provided as: - A list of lists - A numpy array of shape (batch_size, 1) - A torch tensor of shape (batch_size, 1)

  • batch_size (int, default=32) – Number of molecules to generate.

Returns:

List of generated molecules in SMILES format.

Return type:

List[str]

GDSS for score-based molecular generation

class torch_molecule.generator.gdss.modeling_gdss.GDSSMolecularGenerator(device: device | None = None, model_name: str = 'GDSSMolecularGenerator', num_layer: int = 3, hidden_size_adj: float = 8, hidden_size: int = 16, attention_dim: int = 16, num_head: int = 4, sde_type_x: str = 'VE', sde_beta_min_x: float = 0.1, sde_beta_max_x: float = 1, sde_num_scales_x: int = 1000, sde_type_adj: str = 'VE', sde_beta_min_adj: float = 0.1, sde_beta_max_adj: float = 1, sde_num_scales_adj: int = 1000, batch_size: int = 128, epochs: int = 500, learning_rate: float = 0.005, grad_clip_value: float | None = 1.0, weight_decay: float = 0.0001, use_loss_reduce_mean: bool = False, use_lr_scheduler: bool = False, scheduler_factor: float = 0.5, scheduler_patience: int = 5, sampler_predictor: str = 'Reverse', sampler_corrector: str = 'Langevin', sampler_snr: float = 0.2, sampler_scale_eps: float = 0.7, sampler_n_steps: int = 1, sampler_probability_flow: bool = False, sampler_noise_removal: bool = True, verbose: bool = False)[source]

Bases: BaseMolecularGenerator

This generator implements “Score-based Generative Modeling of Graphs via the System of Stochastic Differential Equations”

References

Parameters:
  • num_layer (int, default=3) – Number of layers in the score networks.

  • hidden_size_adj (float, default=8) – Hidden dimension size for the adjacency in the adjacency score network.

  • hidden_size (int, default=16) – Hidden dimension size latent representation.

  • attention_dim (int, default=16) – Dimension of attention layers.

  • num_head (int, default=4) – Number of attention heads.

  • sde_type_x (str, default='VE') – SDE type for node features. One of ‘VP’, ‘VE’, ‘subVP’.

  • sde_beta_min_x (float, default=0.1) – Minimum noise level for node features.

  • sde_beta_max_x (float, default=1) – Maximum noise level for node features.

  • sde_num_scales_x (int, default=1000) – Number of noise scales for node features.

  • sde_type_adj (str, default='VE') – SDE type for adjacency matrix. One of ‘VP’, ‘VE’, ‘subVP’.

  • sde_beta_min_adj (float, default=0.1) – Minimum noise level for adjacency matrix.

  • sde_beta_max_adj (float, default=1) – Maximum noise level for adjacency matrix.

  • sde_num_scales_adj (int, default=1000) – Number of noise scales for adjacency matrix.

  • batch_size (int, default=128) – Batch size for training.

  • epochs (int, default=500) – Number of training epochs.

  • learning_rate (float, default=0.005) – Learning rate for optimizer.

  • grad_clip_value (Optional[float], default=1) – Value for gradient clipping. None means no clipping.

  • weight_decay (float, default=1e-4) – Weight decay for optimizer.

  • use_loss_reduce_mean (bool, default=False) – Whether to use mean reduction for loss calculation.

  • use_lr_scheduler (bool, default=False) – Whether to use learning rate scheduler.

  • scheduler_factor (float, default=0.5) – Factor by which to reduce learning rate when using scheduler (only used if use_lr_scheduler is True).

  • scheduler_patience (int, default=5) – Number of epochs with no improvement after which learning rate will be reduced (only used if use_lr_scheduler is True).

  • sampler_predictor (str, default='Reverse') – Predictor method for sampling. One of ‘Euler’, ‘Reverse’.

  • sampler_corrector (str, default='Langevin') – Corrector method for sampling. One of ‘Langevin’, ‘None’.

  • sampler_snr (float, default=0.2) – Signal-to-noise ratio for corrector.

  • sampler_scale_eps (float, default=0.7) – Scale factor for noise level in corrector.

  • sampler_n_steps (int, default=1) – Number of corrector steps per predictor step.

  • sampler_probability_flow (bool, default=False) – Whether to use probability flow ODE for sampling.

  • sampler_noise_removal (bool, default=True) – Whether to remove noise in the final step of sampling.

  • verbose (bool, default=False) – Whether to display progress bars and logs.

fit(X_train: List[str]) GDSSMolecularGenerator[source]

Fit the model to the training data.

Parameters:

X_train (List[str]) – List of training data in SMILES format.

Returns:

self – The fitted model.

Return type:

GDSSMolecularGenerator

generate(num_nodes: List[List] | ndarray | Tensor | None = None, batch_size: int = 32) List[str][source]

Randomly generate molecules with specified node counts.

Parameters:
  • num_nodes (Optional[Union[List[List], np.ndarray, torch.Tensor]], default=None) – Number of nodes for each molecule in the batch. If None, samples from the training distribution. Can be provided as: - A list of lists - A numpy array of shape (batch_size, 1) - A torch tensor of shape (batch_size, 1)

  • batch_size (int, default=32) – Number of molecules to generate.

Returns:

List of generated molecules in SMILES format.

Return type:

List[str]

Modeling Molecules as Graphs with Heuristic-based Generators

Graph Genetic Algorithm for Un/Multi-conditional Molecular Generation

class torch_molecule.generator.graph_ga.modeling_graph_ga.GraphGAMolecularGenerator(device: device | None = None, model_name: str = 'GraphGAMolecularGenerator', num_task: int = 0, population_size: int = 100, offspring_size: int = 50, mutation_rate: float = 0.0067, n_jobs: int = 1, iteration: int = 5, verbose: bool = False)[source]

Bases: BaseMolecularGenerator

This generator implements the Graph Genetic Algorithm for molecular generation.

References

  • A Graph-Based Genetic Algorithm and Its Application to the Multiobjective Evolution of

Median Molecules. Journal of Chemical Information and Computer Sciences. https://pubs.acs.org/doi/10.1021/ci034290p - Implementation: https://github.com/wenhao-gao/mol_opt

Parameters:
  • num_task (int, default=0) – Number of properties to condition on. Set to 0 for unconditional generation.

  • population_size (int, default=100) – Size of the population in each iteration.

  • offspring_size (int, default=50) – Number of offspring molecules to generate in each iteration.

  • mutation_rate (float, default=0.0067) – Probability of mutation occurring during reproduction.

  • n_jobs (int, default=1) – Number of parallel jobs to run. -1 means using all processors.

  • iteration (int, default=5) – Number of iterations for each target label (or random sample) to run the genetic algorithm.

  • verbose (bool, default=False) – Whether to display progress bars and logs.

fit(X_train: List[str], y_train: List | ndarray | None = None, oracle: List[Callable] | None = None) GraphGAMolecularGenerator[source]

Fit the model to the training data.

Parameters:
  • X_train (List[str]) – Training data, which will be used as the initial population.

  • y_train (Optional[Union[List, np.ndarray]]) – Training labels for conditional generation (num_task is not 0).

  • oracle (Optional[Callable]) –

    Oracle used to score the generated molecules. If not provided, default oracles based on sklearn.ensemble.RandomForestRegressor are trained on the X_train and y_train.

    For a customized oracle, it should be a Callable object, i.e., oracle(X, y). Please properly wrap your oracle to take two inputs:

    • a list of rdkit.Chem.rdchem.Mol objects and

    • a (1, num_task) numpy array of target values that all the molecules in the list target to achieve. Take care of NaN values if any.

    Scores for different tasks should be aggregated, i.e., mean or sum. The return should be a list of scores (float). Smaller scores mean closer to the target goal.

    Oracles are not needed for unconditional generation.

Returns:

self – Fitted model.

Return type:

GraphGAMolecularGenerator

generate(labels: List[List] | ndarray | None = None, num_samples: int = 32) List[str][source]

Generate molecules using genetic algorithm optimization.

Default Oracles in GraphGA

class torch_molecule.generator.graph_ga.oracle.Oracle(models=None, num_task=1)[source]

Bases: object

Oracle class for scoring molecules.

This class wraps predictive models (like RandomForestRegressor) to score molecules based on their properties. It handles conversion of SMILES to fingerprints.

Parameters:
  • models (List[Any], optional) – List of trained models that implement a predict method. If None, RandomForestRegressors will be created when fit is called.

  • num_task (int, default=1) – Number of properties to predict.

fit(X_train, y_train)[source]

Fit the underlying models with training data.

Parameters:
  • X_train (List[str] or List[RDKit.Mol]) – Training molecules as SMILES strings or RDKit Mol objects.

  • y_train (np.ndarray) – Training labels with shape (n_samples, num_task).

Returns:

self – Fitted oracle.

Return type:

Oracle

Modeling Molecules as Sequences with Transformer-based Generators

MolGPT for Unconditional Molecular Generation

class torch_molecule.generator.molgpt.modeling_molgpt.MolGPTMolecularGenerator(device: device | None = None, model_name: str = 'MolGPTMolecularGenerator', num_layer: int = 8, num_head: int = 8, hidden_size: int = 256, max_len: int = 128, num_task: int = 0, use_scaffold: bool = False, use_lstm: bool = False, lstm_layers: int = 0, batch_size: int = 64, epochs: int = 1000, learning_rate: float = 0.0003, adamw_betas: Tuple[float, float] = (0.9, 0.95), weight_decay: float = 0.1, grad_norm_clip: float = 1.0, verbose: bool = False)[source]

Bases: BaseMolecularGenerator

This generator implements the molecular GPT model for generating molecules.

The model uses a GPT-like architecture to learn the distribution of SMILES strings and generate new molecules. It supports conditional generation based on properties and/or molecular scaffolds.

References

Parameters:
  • num_layer (int, default=8) – Number of transformer layers in the model.

  • num_head (int, default=8) – Number of attention heads in each transformer layer.

  • hidden_size (int, default=256) – Dimension of the hidden representations.

  • max_len (int, default=128) – Maximum length of SMILES strings.

  • num_task (int, default=0) – Number of property prediction tasks for conditional generation. O for unconditional generation.

  • use_scaffold (bool, default=False) – Whether to use scaffold conditioning.

  • use_lstm (bool, default=False) – Whether to use LSTM for encoding scaffold.

  • lstm_layers (int, default=0) – Number of LSTM layers if use_lstm is True.

  • batch_size (int, default=64) – Batch size for training.

  • epochs (int, default=1000) – Number of training epochs.

  • learning_rate (float, default=3e-4) – Learning rate for optimizer.

  • adamw_betas (Tuple[float, float], default=(0.9, 0.95)) – Beta parameters for AdamW optimizer.

  • weight_decay (float, default=0.1) – Weight decay for optimizer.

  • grad_norm_clip (float, default=1.0) – Gradient norm clipping value.

  • verbose (bool, default=False) – Whether to display progress bars during training.

fit(X_train, y_train=None, X_scaffold=None)[source]

Train the MolGPT model on SMILES strings.

Parameters:
  • X_train (List[str]) – List of SMILES strings for training

  • y_train (Optional[List[float]]) – Optional list of property values for conditional generation

  • X_scaffold (Optional[List[str]]) – Optional list of scaffold SMILES strings for conditional generation

Returns:

self – The fitted model

Return type:

MolGPTGenerator

generate(n_samples=10, properties=None, scaffolds=None, max_len=None, temperature=1.0, top_k=10, starting_token='C')[source]

Generate molecules using the trained model.

Parameters:
  • n_samples (int, default=10) – Number of molecules to generate

  • properties (Optional[List[List[float]]]) – Property values for conditional generation

  • scaffolds (Optional[List[str]]) – Scaffold SMILES for conditional generation

  • max_len (Optional[int]) – Maximum length of generated SMILES

  • temperature (float, default=1.0) – Sampling temperature

  • top_k (int, default=10) – Top-k sampling parameter

  • starting_token (Optional[str]) – Starting token for generation (default is ‘C’)

Returns:

List of generated SMILES strings

Return type:

List[str]