Molecular Property Prediction Models

The predictor models inherit from the torch_molecule.base.predictor.BaseMolecularPredictor class and share common methods for model training, evaluation, prediction and persistence.

Training and Prediction

  • fit(X, y, **kwargs): Train the model on given data, where X contains SMILES strings and y contains target values

  • autofit(X, y, search_parameters, n_trials=50, **kwargs): Automatically search for optimal hyperparameters using Optuna and train the model

  • predict(X, **kwargs): Make predictions on new SMILES strings and return a dictionary containing predictions and optional uncertainty estimates

Model Persistence

inherited from torch_molecule.base.base.BaseModel

  • save_to_local(path): Save the trained model to a local file

  • load_from_local(path): Load a trained model from a local file

  • save_to_hf(repo_id): Push the model to Hugging Face Hub

  • load_from_hf(repo_id, local_cache): Load a model from Hugging Face Hub and save it to a local file

  • save(path, repo_id): Save the model to either local storage or Hugging Face

  • load(path, repo_id): Load a model from either local storage or Hugging Face

Modeling Molecules as Graphs with Graph Neural Networks

Graph Neural Networks

class torch_molecule.predictor.gnn.modeling_gnn.GNNMolecularPredictor(num_task: int = 1, task_type: str = 'regression', num_layer: int = 5, hidden_size: int = 300, gnn_type: str = 'gin-virtual', drop_ratio: float = 0.5, norm_layer: str = 'batch_norm', graph_pooling: str = 'sum', augmented_feature: list[Literal['morgan', 'maccs']] | None = None, batch_size: int = 128, epochs: int = 500, learning_rate: float = 0.001, weight_decay: float = 0.0, grad_clip_value: float | None = None, patience: int = 50, use_lr_scheduler: bool = False, scheduler_factor: float = 0.5, scheduler_patience: int = 5, loss_criterion: Callable | None = None, evaluate_criterion: str | Callable | None = None, evaluate_higher_better: bool | None = None, verbose: bool = False, device: device | str | None = None, model_name: str = 'GNNMolecularPredictor')[source]

Bases: BaseMolecularPredictor

This predictor implements a GNN model for molecular property prediction tasks.

Parameters:
  • num_task (int, default=1) – Number of prediction tasks.

  • task_type (str, default="regression") – Type of prediction task, either “regression” or “classification”.

  • num_layer (int, default=5) – Number of GNN layers.

  • hidden_size (int, default=300) – Dimension of hidden node features.

  • gnn_type (str, default="gin-virtual") – Type of GNN architecture to use. One of [“gin-virtual”, “gcn-virtual”, “gin”, “gcn”].

  • drop_ratio (float, default=0.5) – Dropout probability.

  • norm_layer (str, default="batch_norm") – Type of normalization layer to use. One of [“batch_norm”, “layer_norm”, “instance_norm”, “graph_norm”, “size_norm”, “pair_norm”].

  • graph_pooling (str, default="sum") – Method for aggregating node features to graph-level representations. One of [“sum”, “mean”, “max”].

  • augmented_feature (list or None, default=None) – Additional molecular fingerprints to use as features. It will be concatenated with the graph representation after pooling. Examples like [“morgan”, “maccs”] or None.

  • batch_size (int, default=128) – Number of samples per batch for training.

  • epochs (int, default=500) – Maximum number of training epochs.

  • loss_criterion (callable, optional) – Loss function for training.

  • evaluate_criterion (str or callable, optional) – Metric for model evaluation.

  • evaluate_higher_better (bool, optional) – Whether higher values of the evaluation metric are better.

  • learning_rate (float, default=0.001) – Learning rate for optimizer.

  • grad_clip_value (float, optional) – Maximum norm of gradients for gradient clipping.

  • weight_decay (float, default=0.0) – L2 regularization strength.

  • patience (int, default=50) – Number of epochs to wait for improvement before early stopping.

  • use_lr_scheduler (bool, default=False) – Whether to use learning rate scheduler.

  • scheduler_factor (float, default=0.5) – Factor by which to reduce learning rate when plateau is reached.

  • scheduler_patience (int, default=5) – Number of epochs with no improvement after which learning rate will be reduced.

  • verbose (bool, default=False) – Whether to print progress information during training.

autofit(X_train: List[str], y_train: List | ndarray | None, X_val: List[str] | None = None, y_val: List | ndarray | None = None, X_unlbl: List[str] | None = None, search_parameters: Dict[str, ParameterSpec] | None = None, n_trials: int = 10) GNNMolecularPredictor[source]

Automatically find the best hyperparameters using Optuna optimization.

fit(X_train: List[str], y_train: List | ndarray | None, X_val: List[str] | None = None, y_val: List | ndarray | None = None, X_unlbl: List[str] | None = None) GNNMolecularPredictor[source]

Fit the model to the training data with optional validation set.

Parameters:
  • X_train (List[str]) – Training set input molecular structures as SMILES strings

  • y_train (Union[List, np.ndarray]) – Training set target values for property prediction

  • X_val (List[str], optional) – Validation set input molecular structures as SMILES strings. If None, training data will be used for validation

  • y_val (Union[List, np.ndarray], optional) – Validation set target values. Required if X_val is provided

  • X_unlbl (List[str], optional) – Unlabeled set input molecular structures as SMILES strings.

Returns:

self – Fitted estimator

Return type:

GNNMolecularPredictor

predict(X: List[str]) Dict[str, ndarray][source]

Make predictions using the fitted model.

Parameters:

X (List[str]) – List of SMILES strings to make predictions for

Returns:

Dictionary containing:
  • ’prediction’: Model predictions (shape: [n_samples, n_tasks])

Return type:

Dict[str, np.ndarray]

Graph Rationalization with Environment-based Data Augmentation

class torch_molecule.predictor.grea.modeling_grea.GREAMolecularPredictor(gamma: float = 0.4, num_task: int = 1, task_type: str = 'regression', num_layer: int = 5, hidden_size: int = 300, gnn_type: str = 'gin-virtual', drop_ratio: float = 0.5, norm_layer: str = 'batch_norm', graph_pooling: str = 'sum', augmented_feature: list[Literal['morgan', 'maccs']] | None = None, batch_size: int = 128, epochs: int = 500, learning_rate: float = 0.001, weight_decay: float = 0.0, grad_clip_value: float | None = None, patience: int = 50, use_lr_scheduler: bool = False, scheduler_factor: float = 0.5, scheduler_patience: int = 5, loss_criterion: Callable | None = None, evaluate_criterion: str | Callable | None = None, evaluate_higher_better: bool | None = None, verbose: bool = False, device: device | str | None = None, model_name: str = 'GREAMolecularPredictor')[source]

Bases: GNNMolecularPredictor

This predictor implements GREA model from the paper “Graph Rationalization with Environment-based Augmentations”.

During model training, it learns the rationales (explainable subgraphs). Use them for data augmentation. During prediction, the model uses the rationales to make predictions and the rationales themselves can also explain the predictions.

References

Parameters:
  • gamma (float, default=0.4) – GREA-specific parameter that penalize the size of the rationales (ratio between the number of nodes in the rationales and the number of nodes in the original graph).

  • num_task (int, default=1) – Number of prediction tasks.

  • task_type (str, default="regression") – Type of prediction task, either “regression” or “classification”.

  • num_layer (int, default=5) – Number of GNN layers.

  • hidden_size (int, default=300) – Dimension of hidden node features.

  • gnn_type (str, default="gin-virtual") – Type of GNN architecture to use. One of [“gin-virtual”, “gcn-virtual”, “gin”, “gcn”].

  • drop_ratio (float, default=0.5) – Dropout probability.

  • norm_layer (str, default="batch_norm") – Type of normalization layer to use. One of [“batch_norm”, “layer_norm”, “instance_norm”, “graph_norm”, “size_norm”, “pair_norm”].

  • graph_pooling (str, default="sum") – Method for aggregating node features to graph-level representations. One of [“sum”, “mean”, “max”].

  • augmented_feature (list or None, default=None) – Additional molecular fingerprints to use as features. It will be concatenated with the graph representation after pooling. Examples like [“morgan”, “maccs”] or None.

  • batch_size (int, default=128) – Number of samples per batch for training.

  • epochs (int, default=500) – Maximum number of training epochs.

  • learning_rate (float, default=0.001) – Learning rate for optimizer.

  • weight_decay (float, default=0.0) – L2 regularization strength.

  • grad_clip_value (float, optional) – Maximum norm of gradients for gradient clipping.

  • patience (int, default=50) – Number of epochs to wait for improvement before early stopping.

  • use_lr_scheduler (bool, default=False) – Whether to use learning rate scheduler.

  • scheduler_factor (float, default=0.5) – Factor by which to reduce learning rate when plateau is reached.

  • scheduler_patience (int, default=5) – Number of epochs with no improvement after which learning rate will be reduced.

  • loss_criterion (callable, optional) – Loss function for training.

  • evaluate_criterion (str or callable, optional) – Metric for model evaluation.

  • evaluate_higher_better (bool, optional) – Whether higher values of the evaluation metric are better.

  • verbose (bool, default=False) – Whether to print progress information during training.

  • device (torch.device or str, optional) – Device to use for computation.

  • model_name (str, default="GREAMolecularPredictor") – Name of the model.

autofit(X_train: List[str], y_train: List | ndarray | None, X_val: List[str] | None = None, y_val: List | ndarray | None = None, X_unlbl: List[str] | None = None, search_parameters: Dict[str, ParameterSpec] | None = None, n_trials: int = 10) GNNMolecularPredictor

Automatically find the best hyperparameters using Optuna optimization.

fit(X_train: List[str], y_train: List | ndarray | None, X_val: List[str] | None = None, y_val: List | ndarray | None = None, X_unlbl: List[str] | None = None) GNNMolecularPredictor

Fit the model to the training data with optional validation set.

Parameters:
  • X_train (List[str]) – Training set input molecular structures as SMILES strings

  • y_train (Union[List, np.ndarray]) – Training set target values for property prediction

  • X_val (List[str], optional) – Validation set input molecular structures as SMILES strings. If None, training data will be used for validation

  • y_val (Union[List, np.ndarray], optional) – Validation set target values. Required if X_val is provided

  • X_unlbl (List[str], optional) – Unlabeled set input molecular structures as SMILES strings.

Returns:

self – Fitted estimator

Return type:

GNNMolecularPredictor

predict(X: List[str]) Dict[str, ndarray | List[List]][source]

Make predictions using the fitted model.

Parameters:

X (List[str]) – List of SMILES strings to make predictions for

Returns:

Dictionary containing:
  • ’prediction’: Model predictions (shape: [n_samples, n_tasks])

  • ’variance’: Prediction variances (shape: [n_samples, n_tasks])

  • ’node_importance’: A nested list where the outer list has length n_samples and each inner list has length n_nodes for that molecule

Return type:

Dict[str, np.ndarray]

Semi-Supervised Graph Imbalanced Regression Models

class torch_molecule.predictor.sgir.modeling_sgir.SGIRMolecularPredictor(num_anchor: int = 10, warmup_epoch: int = 20, labeling_interval: int = 5, augmentation_interval: int = 5, top_quantile: float = 0.1, label_logscale: bool = False, lw_aug: float = 1, gamma: float = 0.4, num_task: int = 1, task_type: str = 'regression', num_layer: int = 5, hidden_size: int = 300, gnn_type: str = 'gin-virtual', drop_ratio: float = 0.5, norm_layer: str = 'batch_norm', graph_pooling: str = 'sum', augmented_feature: list[Literal['morgan', 'maccs']] | None = None, batch_size: int = 128, epochs: int = 500, learning_rate: float = 0.001, weight_decay: float = 0.0, grad_clip_value: float | None = None, patience: int = 50, use_lr_scheduler: bool = False, scheduler_factor: float = 0.5, scheduler_patience: int = 5, loss_criterion: Callable | None = None, evaluate_criterion: str | Callable | None = None, evaluate_higher_better: bool | None = None, verbose: bool = False, device: device | str | None = None, model_name: str = 'SGIRMolecularPredictor')[source]

Bases: GREAMolecularPredictor

This predictor implements SGIR for semi-supervised graph imbalanced regression.

It trains the GREA model based on pseudo-labeling and data augmentation.

References

Parameters:
  • num_anchor (int, default=10) – Number of anchor points used to split the label space during pseudo-labeling.

  • warmup_epoch (int, default=20) – Number of epochs to train before starting pseudo-labeling and data augmentation.

  • labeling_interval (int, default=5) – Interval (in epochs) between two pseudo-labeling steps. It controls the update frequency of pseudo-labeling.

  • augmentation_interval (int, default=5) – Interval (in epochs) between two data augmentation steps. It controls the update frequency of data augmentation.

  • top_quantile (float, default=0.1) – Quantile threshold for selecting high confidence predictions during pseudo-labeling.

  • label_logscale (bool, default=False) – Whether to use log scale for the label space during pseudo-labeling and data augmentation.

  • lw_aug (float, default=1) – Loss weight for the augmented data.

  • gamma (float, default=0.4) – GREA-specific parameter that penalize the size of the rationales (ratio between the number of nodes in the rationales and the number of nodes in the original graph).

  • num_task (int, default=1) – Number of prediction tasks. SGIR currently only supports regression tasks with 1 task.

  • task_type (str, default="regression") – Type of prediction task, either “regression” or “classification”.

  • num_layer (int, default=5) – Number of GNN layers.

  • hidden_size (int, default=300) – Dimension of hidden node features.

  • gnn_type (str, default="gin-virtual") – Type of GNN architecture to use. One of [“gin-virtual”, “gcn-virtual”, “gin”, “gcn”].

  • drop_ratio (float, default=0.5) – Dropout probability.

  • norm_layer (str, default="batch_norm") – Type of normalization layer to use. One of [“batch_norm”, “layer_norm”, “instance_norm”, “graph_norm”, “size_norm”, “pair_norm”].

  • graph_pooling (str, default="sum") – Method for aggregating node features to graph-level representations. One of [“sum”, “mean”, “max”].

  • augmented_feature (list or None, default=None) – Additional molecular fingerprints to use as features. It will be concatenated with the graph representation after pooling. Examples like [“morgan”, “maccs”] or None.

  • batch_size (int, default=128) – Number of samples per batch for training.

  • epochs (int, default=500) – Maximum number of training epochs.

  • loss_criterion (callable, optional) – Loss function for training.

  • evaluate_criterion (str or callable, optional) – Metric for model evaluation.

  • evaluate_higher_better (bool, optional) – Whether higher values of the evaluation metric are better.

  • learning_rate (float, default=0.001) – Learning rate for optimizer.

  • grad_clip_value (float, optional) – Maximum norm of gradients for gradient clipping.

  • weight_decay (float, default=0.0) – L2 regularization strength.

  • patience (int, default=50) – Number of epochs to wait for improvement before early stopping.

  • use_lr_scheduler (bool, default=False) – Whether to use learning rate scheduler.

  • scheduler_factor (float, default=0.5) – Factor by which to reduce learning rate when plateau is reached.

  • scheduler_patience (int, default=5) – Number of epochs with no improvement after which learning rate will be reduced.

  • verbose (bool, default=False) – Whether to print progress information during training.

  • device (torch.device or str, optional) – Device to run the model on. If None, will auto-detect GPU or use CPU.

  • model_name (str, default="SGIRMolecularPredictor") – Name identifier for the model.

autofit(X_train: List[str], y_train: List | ndarray | None, X_val: List[str] | None = None, y_val: List | ndarray | None = None, X_unlbl: List[str] | None = None, search_parameters: Dict[str, ParameterSpec] | None = None, n_trials: int = 10) GNNMolecularPredictor

Automatically find the best hyperparameters using Optuna optimization.

fit(X_train: List[str], y_train: List | ndarray | None, X_val: List[str] | None = None, y_val: List | ndarray | None = None, X_unlbl: List[str] | None = None) SGIRMolecularPredictor[source]

Fit the model to training data with optional validation set.

predict(X: List[str]) Dict[str, ndarray | List[List]]

Make predictions using the fitted model.

Parameters:

X (List[str]) – List of SMILES strings to make predictions for

Returns:

Dictionary containing:
  • ’prediction’: Model predictions (shape: [n_samples, n_tasks])

  • ’variance’: Prediction variances (shape: [n_samples, n_tasks])

  • ’node_importance’: A nested list where the outer list has length n_samples and each inner list has length n_nodes for that molecule

Return type:

Dict[str, np.ndarray]

Learning Repetition-Invariant Representations for Polymer Informatics

class torch_molecule.predictor.grin.modeling_grin.GRINMolecularPredictor(l1_penalty: float = 0.001, epochs_to_penalize: int = 100, num_task: int = 1, task_type: str = 'regression', num_layer: int = 5, hidden_size: int = 300, gnn_type: str = 'gin-virtual', drop_ratio: float = 0.5, norm_layer: str = 'batch_norm', graph_pooling: str = 'sum', augmented_feature: list[Literal['morgan', 'maccs']] | None = None, batch_size: int = 128, epochs: int = 500, learning_rate: float = 0.001, weight_decay: float = 0.0, grad_clip_value: float | None = None, patience: int = 50, use_lr_scheduler: bool = False, scheduler_factor: float = 0.5, scheduler_patience: int = 5, loss_criterion: Callable | None = None, evaluate_criterion: str | Callable | None = None, evaluate_higher_better: bool | None = None, verbose: bool = False, device: device | str | None = None, model_name: str = 'GRINMolecularPredictor')[source]

Bases: GNNMolecularPredictor

This predictor implements GRIN for Max Spanning Tree algorithm aligned GNN.

The full name of GRIN is Graph Invariant Representation Learning.

References

Parameters:
  • l1_penalty (float, default=1e-3) – Weight for the L1 penalty.

  • epochs_to_penalize (int, default=100) – Number of epochs to train before starting L1 penalty.

  • num_task (int, default=1) – Number of prediction tasks.

  • task_type (str, default="regression") – Type of prediction task, either “regression” or “classification”.

  • num_layer (int, default=5) – Number of GNN layers.

  • hidden_size (int, default=300) – Dimension of hidden node features.

  • gnn_type (str, default="gin-virtual") – Type of GNN architecture to use. One of [“gin-virtual”, “gcn-virtual”, “gin”, “gcn”].

  • drop_ratio (float, default=0.5) – Dropout probability.

  • norm_layer (str, default="batch_norm") – Type of normalization layer to use. One of [“batch_norm”, “layer_norm”, “instance_norm”, “graph_norm”, “size_norm”, “pair_norm”].

  • graph_pooling (str, default="sum") – Method for aggregating node features to graph-level representations. One of [“sum”, “mean”, “max”].

  • augmented_feature (list or None, default=None) – Additional molecular fingerprints to use as features. It will be concatenated with the graph representation after pooling. Examples like [“morgan”, “maccs”] or None.

  • batch_size (int, default=128) – Number of samples per batch for training.

  • epochs (int, default=500) – Maximum number of training epochs.

  • loss_criterion (callable, optional) – Loss function for training.

  • evaluate_criterion (str or callable, optional) – Metric for model evaluation.

  • evaluate_higher_better (bool, optional) – Whether higher values of the evaluation metric are better.

  • learning_rate (float, default=0.001) – Learning rate for optimizer.

  • grad_clip_value (float, optional) – Maximum norm of gradients for gradient clipping.

  • weight_decay (float, default=0.0) – L2 regularization strength.

  • patience (int, default=50) – Number of epochs to wait for improvement before early stopping.

  • use_lr_scheduler (bool, default=False) – Whether to use learning rate scheduler.

  • scheduler_factor (float, default=0.5) – Factor by which to reduce learning rate when plateau is reached.

  • scheduler_patience (int, default=5) – Number of epochs with no improvement after which learning rate will be reduced.

  • verbose (bool, default=False) – Whether to print progress information during training.

  • device (torch.device or str, optional) – Device to use for computation.

  • model_name (str, default="GRINMolecularPredictor") – Name of the model.

autofit(X_train: List[str], y_train: List | ndarray | None, X_val: List[str] | None = None, y_val: List | ndarray | None = None, X_unlbl: List[str] | None = None, search_parameters: Dict[str, ParameterSpec] | None = None, n_trials: int = 10) GNNMolecularPredictor

Automatically find the best hyperparameters using Optuna optimization.

fit(X_train: List[str], y_train: List | ndarray | None, X_val: List[str] | None = None, y_val: List | ndarray | None = None, X_unlbl: List[str] | None = None) GNNMolecularPredictor

Fit the model to the training data with optional validation set.

Parameters:
  • X_train (List[str]) – Training set input molecular structures as SMILES strings

  • y_train (Union[List, np.ndarray]) – Training set target values for property prediction

  • X_val (List[str], optional) – Validation set input molecular structures as SMILES strings. If None, training data will be used for validation

  • y_val (Union[List, np.ndarray], optional) – Validation set target values. Required if X_val is provided

  • X_unlbl (List[str], optional) – Unlabeled set input molecular structures as SMILES strings.

Returns:

self – Fitted estimator

Return type:

GNNMolecularPredictor

predict(X: List[str]) Dict[str, ndarray]

Make predictions using the fitted model.

Parameters:

X (List[str]) – List of SMILES strings to make predictions for

Returns:

Dictionary containing:
  • ’prediction’: Model predictions (shape: [n_samples, n_tasks])

Return type:

Dict[str, np.ndarray]

Graph neural networks extrapolate out-of-distribution for shortest paths

class torch_molecule.predictor.bfgnn.modeling_bfgnn.BFGNNMolecularPredictor(l1_penalty: float = 0.001, num_task: int = 1, task_type: str = 'regression', num_layer: int = 5, hidden_size: int = 300, gnn_type: str = 'gin-virtual', drop_ratio: float = 0.5, norm_layer: str = 'batch_norm', graph_pooling: str = 'sum', augmented_feature: list[Literal['morgan', 'maccs']] | None = None, batch_size: int = 128, epochs: int = 500, learning_rate: float = 0.001, weight_decay: float = 0.0, grad_clip_value: float | None = None, patience: int = 50, use_lr_scheduler: bool = False, scheduler_factor: float = 0.5, scheduler_patience: int = 5, loss_criterion: Callable | None = None, evaluate_criterion: str | Callable | None = None, evaluate_higher_better: bool | None = None, verbose: bool = False, device: device | str | None = None, model_name: str = 'BFGNNMolecularPredictor')[source]

Bases: GNNMolecularPredictor

This predictor implements algorithm alignment of Bellman-Ford algorithm with GNN.

References

Parameters:
  • l1_penalty (float, default=1e-3) – L1 regularization penalty strength for feature selection.

  • num_task (int, default=1) – Number of prediction tasks.

  • task_type (str, default="regression") – Type of prediction task, either “regression” or “classification”.

  • num_layer (int, default=5) – Number of GNN layers.

  • hidden_size (int, default=300) – Dimension of hidden node features.

  • gnn_type (str, default="gin-virtual") – Type of GNN architecture to use. One of [“gin-virtual”, “gcn-virtual”, “gin”, “gcn”].

  • drop_ratio (float, default=0.5) – Dropout probability.

  • norm_layer (str, default="batch_norm") – Type of normalization layer to use. One of [“batch_norm”, “layer_norm”, “instance_norm”, “graph_norm”, “size_norm”, “pair_norm”].

  • graph_pooling (str, default="sum") – Method for aggregating node features to graph-level representations. One of [“sum”, “mean”, “max”].

  • augmented_feature (list or None, default=None) – Additional molecular fingerprints to use as features. It will be concatenated with the graph representation after pooling. Examples like [“morgan”, “maccs”] or None.

  • batch_size (int, default=128) – Number of samples per batch for training.

  • epochs (int, default=500) – Maximum number of training epochs.

  • loss_criterion (callable, optional) – Loss function for training.

  • evaluate_criterion (str or callable, optional) – Metric for model evaluation.

  • evaluate_higher_better (bool, optional) – Whether higher values of the evaluation metric are better.

  • learning_rate (float, default=0.001) – Learning rate for optimizer.

  • grad_clip_value (float, optional) – Maximum norm of gradients for gradient clipping.

  • weight_decay (float, default=0.0) – L2 regularization strength.

  • patience (int, default=50) – Number of epochs to wait for improvement before early stopping.

  • use_lr_scheduler (bool, default=False) – Whether to use learning rate scheduler.

  • scheduler_factor (float, default=0.5) – Factor by which to reduce learning rate when plateau is reached.

  • scheduler_patience (int, default=5) – Number of epochs with no improvement after which learning rate will be reduced.

  • verbose (bool, default=False) – Whether to print progress information during training.

autofit(X_train: List[str], y_train: List | ndarray | None, X_val: List[str] | None = None, y_val: List | ndarray | None = None, X_unlbl: List[str] | None = None, search_parameters: Dict[str, ParameterSpec] | None = None, n_trials: int = 10) GNNMolecularPredictor

Automatically find the best hyperparameters using Optuna optimization.

fit(X_train: List[str], y_train: List | ndarray | None, X_val: List[str] | None = None, y_val: List | ndarray | None = None, X_unlbl: List[str] | None = None) GNNMolecularPredictor

Fit the model to the training data with optional validation set.

Parameters:
  • X_train (List[str]) – Training set input molecular structures as SMILES strings

  • y_train (Union[List, np.ndarray]) – Training set target values for property prediction

  • X_val (List[str], optional) – Validation set input molecular structures as SMILES strings. If None, training data will be used for validation

  • y_val (Union[List, np.ndarray], optional) – Validation set target values. Required if X_val is provided

  • X_unlbl (List[str], optional) – Unlabeled set input molecular structures as SMILES strings.

Returns:

self – Fitted estimator

Return type:

GNNMolecularPredictor

predict(X: List[str]) Dict[str, ndarray]

Make predictions using the fitted model.

Parameters:

X (List[str]) – List of SMILES strings to make predictions for

Returns:

Dictionary containing:
  • ’prediction’: Model predictions (shape: [n_samples, n_tasks])

Return type:

Dict[str, np.ndarray]

Discovering Invariant Rationales for Graph Neural Networks

class torch_molecule.predictor.dir.modeling_dir.DIRMolecularPredictor(causal_ratio: float = 0.8, lw_invariant: float = 0.0001, num_task: int = 1, task_type: str = 'regression', num_layer: int = 5, hidden_size: int = 300, gnn_type: str = 'gin-virtual', drop_ratio: float = 0.5, norm_layer: str = 'batch_norm', graph_pooling: str = 'sum', augmented_feature: list[Literal['morgan', 'maccs']] | None = None, batch_size: int = 128, epochs: int = 500, learning_rate: float = 0.001, weight_decay: float = 0.0, grad_clip_value: float | None = None, patience: int = 50, use_lr_scheduler: bool = False, scheduler_factor: float = 0.5, scheduler_patience: int = 5, loss_criterion: Callable | None = None, evaluate_criterion: str | Callable | None = None, evaluate_higher_better: bool | None = None, verbose: bool = False, device: device | str | None = None, model_name: str = 'DIRMolecularPredictor')[source]

Bases: GNNMolecularPredictor

This predictor implements the DIR for molecular property prediction tasks.

The full name of DIR is Discovering Invariant Rationales.

References

Parameters:
  • causal_ratio (float, default=0.8) – The ratio of causal edges to keep during training. A higher ratio means more edges are considered causal/important for the prediction. This controls the sparsity of the learned rationales.

  • lw_invariant (float, default=1e-4) – The weight of the invariance loss term. This loss encourages the model to learn rationales that are invariant across different environments/perturbations. A higher value puts more emphasis on learning invariant features.

  • num_task (int, default=1) – Number of prediction tasks.

  • task_type (str, default="regression") – Type of prediction task, either “regression” or “classification”.

  • num_layer (int, default=5) – Number of GNN layers.

  • hidden_size (int, default=300) – Dimension of hidden node features.

  • gnn_type (str, default="gin-virtual") – Type of GNN architecture to use. One of [“gin-virtual”, “gcn-virtual”, “gin”, “gcn”].

  • drop_ratio (float, default=0.5) – Dropout probability.

  • norm_layer (str, default="batch_norm") – Type of normalization layer to use. One of [“batch_norm”, “layer_norm”, “instance_norm”, “graph_norm”, “size_norm”, “pair_norm”].

  • graph_pooling (str, default="sum") – Method for aggregating node features to graph-level representations. One of [“sum”, “mean”, “max”].

  • augmented_feature (list or None, default=None) – Additional molecular fingerprints to use as features. It will be concatenated with the graph representation after pooling. Examples like [“morgan”, “maccs”] or None.

  • batch_size (int, default=128) – Number of samples per batch for training.

  • epochs (int, default=500) – Maximum number of training epochs.

  • learning_rate (float, default=0.001) – Learning rate for optimizer.

  • weight_decay (float, default=0.0) – L2 regularization strength.

  • grad_clip_value (float, optional) – Maximum norm of gradients for gradient clipping.

  • patience (int, default=50) – Number of epochs to wait for improvement before early stopping.

  • use_lr_scheduler (bool, default=False) – Whether to use learning rate scheduler.

  • scheduler_factor (float, default=0.5) – Factor by which to reduce learning rate when plateau is reached.

  • scheduler_patience (int, default=5) – Number of epochs with no improvement after which learning rate will be reduced.

  • loss_criterion (callable, optional) – Loss function for training.

  • evaluate_criterion (str or callable, optional) – Metric for model evaluation.

  • evaluate_higher_better (bool, optional) – Whether higher values of the evaluation metric are better.

  • verbose (bool, default=False) – Whether to print progress information during training.

  • device (torch.device or str, optional) – Device to use for computation.

  • model_name (str, default="DIRMolecularPredictor") – Name of the model.

autofit(X_train: List[str], y_train: List | ndarray | None, X_val: List[str] | None = None, y_val: List | ndarray | None = None, X_unlbl: List[str] | None = None, search_parameters: Dict[str, ParameterSpec] | None = None, n_trials: int = 10) GNNMolecularPredictor

Automatically find the best hyperparameters using Optuna optimization.

fit(X_train: List[str], y_train: List | ndarray | None, X_val: List[str] | None = None, y_val: List | ndarray | None = None, X_unlbl: List[str] | None = None) GNNMolecularPredictor

Fit the model to the training data with optional validation set.

Parameters:
  • X_train (List[str]) – Training set input molecular structures as SMILES strings

  • y_train (Union[List, np.ndarray]) – Training set target values for property prediction

  • X_val (List[str], optional) – Validation set input molecular structures as SMILES strings. If None, training data will be used for validation

  • y_val (Union[List, np.ndarray], optional) – Validation set target values. Required if X_val is provided

  • X_unlbl (List[str], optional) – Unlabeled set input molecular structures as SMILES strings.

Returns:

self – Fitted estimator

Return type:

GNNMolecularPredictor

predict(X: List[str]) Dict[str, ndarray | List[List]][source]

Make predictions using the fitted model.

Parameters:

X (List[str]) – List of SMILES strings to make predictions for

Returns:

Dictionary containing:
  • ’prediction’: Model predictions (shape: [n_samples, n_tasks])

Return type:

Dict[str, np.ndarray]

Invariant Risk Minimization with GNNs

class torch_molecule.predictor.irm.modeling_irm.IRMMolecularPredictor(IRM_environment: Tensor | ndarray | List | str = 'random', scale: float = 1.0, penalty_weight: float = 1.0, penalty_anneal_iters: int = 100, num_task: int = 1, task_type: str = 'regression', num_layer: int = 5, hidden_size: int = 300, gnn_type: str = 'gin-virtual', drop_ratio: float = 0.5, norm_layer: str = 'batch_norm', graph_pooling: str = 'sum', augmented_feature: list[Literal['morgan', 'maccs']] | None = None, batch_size: int = 128, epochs: int = 500, learning_rate: float = 0.001, weight_decay: float = 0.0, grad_clip_value: float | None = None, patience: int = 50, use_lr_scheduler: bool = False, scheduler_factor: float = 0.5, scheduler_patience: int = 5, loss_criterion: Callable | None = None, evaluate_criterion: str | Callable | None = None, evaluate_higher_better: bool | None = None, verbose: bool = False, device: device | str | None = None, model_name: str = 'IRMMolecularPredictor')[source]

Bases: GNNMolecularPredictor

This predictor implements a Invariant Risk Minimization model with the GNN.

The full name of IRM is Invariant Risk Minimization.

References

Parameters:
  • IRM_environment (Union[torch.Tensor, np.ndarray, List, str], default="random") – Environment assignments for IRM. Can be a list of integers (one per sample), or “random” to assign environments randomly.

  • scale (float, default=1.0) – Scaling factor for the IRM penalty term.

  • penalty_weight (float, default=1.0) – Weight of the IRM penalty in the loss function.

  • penalty_anneal_iters (int, default=100) – Number of iterations for annealing the penalty weight.

  • num_task (int, default=1) – Number of prediction tasks.

  • task_type (str, default="regression") – Type of prediction task, either “regression” or “classification”.

  • num_layer (int, default=5) – Number of GNN layers.

  • hidden_size (int, default=300) – Dimension of hidden node features.

  • gnn_type (str, default="gin-virtual") – Type of GNN architecture to use. One of [“gin-virtual”, “gcn-virtual”, “gin”, “gcn”].

  • drop_ratio (float, default=0.5) – Dropout probability.

  • norm_layer (str, default="batch_norm") – Type of normalization layer to use. One of [“batch_norm”, “layer_norm”, “instance_norm”, “graph_norm”, “size_norm”, “pair_norm”].

  • graph_pooling (str, default="sum") – Method for aggregating node features to graph-level representations. One of [“sum”, “mean”, “max”].

  • augmented_feature (list or None, default=None) – Additional molecular fingerprints to use as features. It will be concatenated with the graph representation after pooling. Examples like [“morgan”, “maccs”] or None.

  • batch_size (int, default=128) – Number of samples per batch for training.

  • epochs (int, default=500) – Maximum number of training epochs.

  • loss_criterion (callable, optional) – Loss function for training.

  • evaluate_criterion (str or callable, optional) – Metric for model evaluation.

  • evaluate_higher_better (bool, optional) – Whether higher values of the evaluation metric are better.

  • learning_rate (float, default=0.001) – Learning rate for optimizer.

  • grad_clip_value (float, optional) – Maximum norm of gradients for gradient clipping.

  • weight_decay (float, default=0.0) – L2 regularization strength.

  • patience (int, default=50) – Number of epochs to wait for improvement before early stopping.

  • use_lr_scheduler (bool, default=False) – Whether to use learning rate scheduler.

  • scheduler_factor (float, default=0.5) – Factor by which to reduce learning rate when plateau is reached.

  • scheduler_patience (int, default=5) – Number of epochs with no improvement after which learning rate will be reduced.

  • verbose (bool, default=False) – Whether to print progress information during training.

  • device (torch.device or str, optional) – Device to run the model on.

autofit(X_train: List[str], y_train: List | ndarray | None, X_val: List[str] | None = None, y_val: List | ndarray | None = None, X_unlbl: List[str] | None = None, search_parameters: Dict[str, ParameterSpec] | None = None, n_trials: int = 10) GNNMolecularPredictor

Automatically find the best hyperparameters using Optuna optimization.

fit(X_train: List[str], y_train: List | ndarray | None, X_val: List[str] | None = None, y_val: List | ndarray | None = None, X_unlbl: List[str] | None = None) GNNMolecularPredictor

Fit the model to the training data with optional validation set.

Parameters:
  • X_train (List[str]) – Training set input molecular structures as SMILES strings

  • y_train (Union[List, np.ndarray]) – Training set target values for property prediction

  • X_val (List[str], optional) – Validation set input molecular structures as SMILES strings. If None, training data will be used for validation

  • y_val (Union[List, np.ndarray], optional) – Validation set target values. Required if X_val is provided

  • X_unlbl (List[str], optional) – Unlabeled set input molecular structures as SMILES strings.

Returns:

self – Fitted estimator

Return type:

GNNMolecularPredictor

predict(X: List[str]) Dict[str, ndarray]

Make predictions using the fitted model.

Parameters:

X (List[str]) – List of SMILES strings to make predictions for

Returns:

Dictionary containing:
  • ’prediction’: Model predictions (shape: [n_samples, n_tasks])

Return type:

Dict[str, np.ndarray]

Relational Pooling for Graph Representations with GNNs

class torch_molecule.predictor.rpgnn.modeling_rpgnn.RPGNNMolecularPredictor(num_perm: int = 3, fixed_size: int = 10, num_node_feature: int = 9, num_task: int = 1, task_type: str = 'regression', num_layer: int = 5, hidden_size: int = 300, gnn_type: str = 'gin-virtual', drop_ratio: float = 0.5, norm_layer: str = 'batch_norm', graph_pooling: str = 'sum', augmented_feature: list[Literal['morgan', 'maccs']] | None = None, batch_size: int = 128, epochs: int = 500, learning_rate: float = 0.001, weight_decay: float = 0.0, grad_clip_value: float | None = None, patience: int = 50, use_lr_scheduler: bool = False, scheduler_factor: float = 0.5, scheduler_patience: int = 5, loss_criterion: Callable | None = None, evaluate_criterion: str | Callable | None = None, evaluate_higher_better: bool | None = None, verbose: bool = False, device: device | str | None = None, model_name: str = 'RPGNNMolecularPredictor')[source]

Bases: GNNMolecularPredictor

This predictor implements a GNN model based on Relational pooling.

The full name of RPGNN is Relational Pooling for Graph Representations.

References

Parameters:
  • num_task (int, default=1) – Number of prediction tasks.

  • task_type (str, default="regression") – Type of prediction task, either “regression” or “classification”.

  • num_layer (int, default=5) – Number of GNN layers.

  • hidden_size (int, default=300) – Dimension of hidden node features.

  • gnn_type (str, default="gin-virtual") – Type of GNN architecture to use. One of [“gin-virtual”, “gcn-virtual”, “gin”, “gcn”].

  • drop_ratio (float, default=0.5) – Dropout probability.

  • norm_layer (str, default="batch_norm") – Type of normalization layer to use. One of [“batch_norm”, “layer_norm”, “instance_norm”, “graph_norm”, “size_norm”, “pair_norm”].

  • graph_pooling (str, default="sum") – Method for aggregating node features to graph-level representations. One of [“sum”, “mean”, “max”].

  • augmented_feature (list or None, default=None) – Additional molecular fingerprints to use as features. It will be concatenated with the graph representation after pooling. Examples like [“morgan”, “maccs”] or None.

  • batch_size (int, default=128) – Number of samples per batch for training.

  • epochs (int, default=500) – Maximum number of training epochs.

  • loss_criterion (callable, optional) – Loss function for training.

  • evaluate_criterion (str or callable, optional) – Metric for model evaluation.

  • evaluate_higher_better (bool, optional) – Whether higher values of the evaluation metric are better.

  • learning_rate (float, default=0.001) – Learning rate for optimizer.

  • grad_clip_value (float, optional) – Maximum norm of gradients for gradient clipping.

  • weight_decay (float, default=0.0) – L2 regularization strength.

  • patience (int, default=50) – Number of epochs to wait for improvement before early stopping.

  • use_lr_scheduler (bool, default=False) – Whether to use learning rate scheduler.

  • scheduler_factor (float, default=0.5) – Factor by which to reduce learning rate when plateau is reached.

  • scheduler_patience (int, default=5) – Number of epochs with no improvement after which learning rate will be reduced.

  • verbose (bool, default=False) – Whether to print progress information during training.

  • device (torch.device or str, optional) – Device to run the model on. If None, will auto-detect GPU or use CPU.

  • model_name (str, default="RPGNNMolecularPredictor") – Name identifier for the model.

  • num_perm (int, default=3) – Number of random permutations to use for relational pooling.

  • fixed_size (int, default=10) – Maximum number of nodes to consider in the graph.

  • num_node_feature (int, default=9) – Dimension of the input node features. This should match the number of atomic features used to represent each node in the molecular graph (e.g., atomic number, degree, hybridization, etc.).

autofit(X_train: List[str], y_train: List | ndarray | None, X_val: List[str] | None = None, y_val: List | ndarray | None = None, X_unlbl: List[str] | None = None, search_parameters: Dict[str, ParameterSpec] | None = None, n_trials: int = 10) GNNMolecularPredictor

Automatically find the best hyperparameters using Optuna optimization.

fit(X_train: List[str], y_train: List | ndarray | None, X_val: List[str] | None = None, y_val: List | ndarray | None = None, X_unlbl: List[str] | None = None) GNNMolecularPredictor

Fit the model to the training data with optional validation set.

Parameters:
  • X_train (List[str]) – Training set input molecular structures as SMILES strings

  • y_train (Union[List, np.ndarray]) – Training set target values for property prediction

  • X_val (List[str], optional) – Validation set input molecular structures as SMILES strings. If None, training data will be used for validation

  • y_val (Union[List, np.ndarray], optional) – Validation set target values. Required if X_val is provided

  • X_unlbl (List[str], optional) – Unlabeled set input molecular structures as SMILES strings.

Returns:

self – Fitted estimator

Return type:

GNNMolecularPredictor

predict(X: List[str]) Dict[str, ndarray]

Make predictions using the fitted model.

Parameters:

X (List[str]) – List of SMILES strings to make predictions for

Returns:

Dictionary containing:
  • ’prediction’: Model predictions (shape: [n_samples, n_tasks])

Return type:

Dict[str, np.ndarray]

SizeShiftReg: a Regularization Method for Improving Size-Generalization in GNNs

class torch_molecule.predictor.ssr.modeling_ssr.SSRMolecularPredictor(coarse_ratios: List[float] = [0.8, 0.9], cmd_coeff: float = 0.1, fine_grained: bool = True, n_moments: int = 5, coarse_pool: str = 'mean', num_task: int = 1, task_type: str = 'regression', num_layer: int = 5, hidden_size: int = 300, gnn_type: str = 'gin-virtual', drop_ratio: float = 0.5, norm_layer: str = 'batch_norm', graph_pooling: str = 'sum', augmented_feature: list[Literal['morgan', 'maccs']] | None = None, batch_size: int = 128, epochs: int = 500, learning_rate: float = 0.001, weight_decay: float = 0.0, grad_clip_value: float | None = None, patience: int = 50, use_lr_scheduler: bool = False, scheduler_factor: float = 0.5, scheduler_patience: int = 5, loss_criterion: Callable | None = None, evaluate_criterion: str | Callable | None = None, evaluate_higher_better: bool | None = None, verbose: bool = False, device: device | str | None = None, model_name: str = 'SSRMolecularPredictor')[source]

Bases: GNNMolecularPredictor

This predictor implements a SizeShiftReg model with the GNN.

References

Parameters:
  • coarse_ratios (List[float], default=[0.8, 0.9]) – List of ratios for graph coarsening. Each ratio determines the percentage of nodes to keep in the coarsened graph.

  • cmd_coeff (float, default=0.1) – Weight for CMD (Central Moment Discrepancy) loss. Controls the strength of the size-shift regularization.

  • fine_grained (bool, default=True) – Whether to use fine-grained CMD. When True, matches distributions at a more detailed level.

  • n_moments (int, default=5) – Number of moments to match in the CMD calculation. Higher values capture more complex distribution characteristics.

  • coarse_pool (str, default='mean') – Pooling method for coarsened graphs.

  • num_task (int, default=1) – Number of prediction tasks.

  • task_type (str, default="regression") – Type of prediction task, either “regression” or “classification”.

  • num_layer (int, default=5) – Number of GNN layers.

  • hidden_size (int, default=300) – Dimension of hidden node features.

  • gnn_type (str, default="gin-virtual") – Type of GNN architecture to use. One of [“gin-virtual”, “gcn-virtual”, “gin”, “gcn”].

  • drop_ratio (float, default=0.5) – Dropout probability.

  • norm_layer (str, default="batch_norm") – Type of normalization layer to use. One of [“batch_norm”, “layer_norm”, “instance_norm”, “graph_norm”, “size_norm”, “pair_norm”].

  • graph_pooling (str, default="sum") – Method for aggregating node features to graph-level representations. One of [“sum”, “mean”, “max”].

  • augmented_feature (list or None, default=None) – Additional molecular fingerprints to use as features. It will be concatenated with the graph representation after pooling. Examples like [“morgan”, “maccs”] or None.

  • batch_size (int, default=128) – Number of samples per batch for training.

  • epochs (int, default=500) – Maximum number of training epochs.

  • loss_criterion (callable, optional) – Loss function for training.

  • evaluate_criterion (str or callable, optional) – Metric for model evaluation.

  • evaluate_higher_better (bool, optional) – Whether higher values of the evaluation metric are better.

  • learning_rate (float, default=0.001) – Learning rate for optimizer.

  • grad_clip_value (float, optional) – Maximum norm of gradients for gradient clipping.

  • weight_decay (float, default=0.0) – L2 regularization strength.

  • patience (int, default=50) – Number of epochs to wait for improvement before early stopping.

  • use_lr_scheduler (bool, default=False) – Whether to use learning rate scheduler.

  • scheduler_factor (float, default=0.5) – Factor by which to reduce learning rate when plateau is reached.

  • scheduler_patience (int, default=5) – Number of epochs with no improvement after which learning rate will be reduced.

  • verbose (bool, default=False) – Whether to print progress information during training.

  • device (torch.device or str, optional) – Device to use for computations.

  • model_name (str, default="SSRMolecularPredictor") – Name of the model.

autofit(X_train: List[str], y_train: List | ndarray | None, X_val: List[str] | None = None, y_val: List | ndarray | None = None, X_unlbl: List[str] | None = None, search_parameters: Dict[str, ParameterSpec] | None = None, n_trials: int = 10) GNNMolecularPredictor

Automatically find the best hyperparameters using Optuna optimization.

fit(X_train: List[str], y_train: List | ndarray | None, X_val: List[str] | None = None, y_val: List | ndarray | None = None, X_unlbl: List[str] | None = None) GNNMolecularPredictor

Fit the model to the training data with optional validation set.

Parameters:
  • X_train (List[str]) – Training set input molecular structures as SMILES strings

  • y_train (Union[List, np.ndarray]) – Training set target values for property prediction

  • X_val (List[str], optional) – Validation set input molecular structures as SMILES strings. If None, training data will be used for validation

  • y_val (Union[List, np.ndarray], optional) – Validation set target values. Required if X_val is provided

  • X_unlbl (List[str], optional) – Unlabeled set input molecular structures as SMILES strings.

Returns:

self – Fitted estimator

Return type:

GNNMolecularPredictor

predict(X: List[str]) Dict[str, ndarray]

Make predictions using the fitted model.

Parameters:

X (List[str]) – List of SMILES strings to make predictions for

Returns:

Dictionary containing:
  • ’prediction’: Model predictions (shape: [n_samples, n_tasks])

Return type:

Dict[str, np.ndarray]

Modeling Molecules as Sequences with RNNs

LSTM models based on SMILES

class torch_molecule.predictor.lstm.modeling_lstm.LSTMMolecularPredictor(device: device | str | None = None, model_name: str = 'LSTMMolecularPredictor', num_task: int = 1, task_type: str = 'regression', input_dim: int = 54, output_dim: int = 15, LSTMunits: int = 60, max_input_len: int = 200, batch_size: int = 128, epochs: int = 500, loss_criterion: Callable | None = None, evaluate_criterion: str | Callable | None = None, evaluate_higher_better: bool | None = None, learning_rate: float = 0.001, weight_decay: float = 0.0, patience: int = 50, use_lr_scheduler: bool = False, scheduler_factor: float = 0.5, scheduler_patience: int = 5, verbose: bool = False)[source]

Bases: BaseMolecularPredictor

This predictor implements a LSTM model for molecular property prediction tasks.

References

Parameters:
  • num_task (int, default=1) – Number of prediction tasks.

  • task_type (str, default="regression") – Type of prediction task, either “regression” or “classification”.

  • input_dim (int, default=54) – Size of vocabulary for SMILES tokenization.

  • output_dim (int, default=15) – Dimension of embedding vectors.

  • LSTMunits (int, default=60) – Number of hidden units in LSTM layers.

  • max_input_len (int, default=200) – Maximum length of input sequences. Longer sequences will be truncated.

  • batch_size (int, default=128) – Number of samples per batch for training.

  • epochs (int, default=500) – Maximum number of training epochs.

  • loss_criterion (callable, optional) – Loss function for training. Defaults to MSELoss for regression.

  • evaluate_criterion (str or callable, optional) – Metric for model evaluation.

  • evaluate_higher_better (bool, optional) – Whether higher values of the evaluation metric are better.

  • learning_rate (float, default=0.001) – Learning rate for optimizer.

  • weight_decay (float, default=0.0) – L2 regularization strength.

  • patience (int, default=50) – Number of epochs to wait for improvement before early stopping.

  • use_lr_scheduler (bool, default=False) – Whether to use learning rate scheduler.

  • scheduler_factor (float, default=0.5) – Factor by which to reduce learning rate when plateau is reached.

  • scheduler_patience (int, default=5) – Number of epochs with no improvement after which learning rate will be reduced.

  • verbose (bool, default=False) – Whether to print progress information during training.

  • device (torch.device or str, optional) – Device to run the model on. If None, will auto-detect GPU or use CPU.

  • model_name (str, default="LSTMMolecularPredictor") – Name identifier for the model.

autofit(X_train: List[str], y_train: List | ndarray | None, X_val: List[str] | None = None, y_val: List | ndarray | None = None, search_parameters: Dict[str, ParameterSpec] | None = None, n_trials: int = 10) LSTMMolecularPredictor[source]
fit(X_train: List[str], y_train: List | ndarray | None, X_val: List[str] | None = None, y_val: List | ndarray | None = None) LSTMMolecularPredictor[source]

Fit the model to the training data with optional validation set.

Parameters:
  • X_train (List[str]) – Training set input molecular structures as SMILES strings

  • y_train (Union[List, np.ndarray]) – Training set target values for property prediction

  • X_val (List[str], optional) – Validation set input molecular structures as SMILES strings. If None, training data will be used for validation

  • y_val (Union[List, np.ndarray], optional) – Validation set target values. Required if X_val is provided

predict(X: List[str]) Dict[str, ndarray][source]

Make predictions using the fitted model.

Parameters:

X (List[str]) – List of SMILES strings to make predictions for

Returns:

Dictionary containing:
  • ’prediction’: Model predictions (shape: [n_samples, n_tasks])

Return type:

Dict[str, np.ndarray]

Transformer models based on SMILES

class torch_molecule.predictor.smiles_transformer.modeling_transformer.SMILESTransformerMolecularPredictor(num_task: int = 1, task_type: str = 'regression', input_dim: int = 54, hidden_size: int = 128, n_heads: int = 4, num_layers: int = 3, dim_feedforward: int | None = 256, max_input_len: int = 200, dropout: float = 0.1, batch_size: int = 64, epochs: int = 200, loss_criterion: Callable | None = None, evaluate_criterion: str | Callable | None = None, evaluate_higher_better: bool | None = None, learning_rate: float = 0.0001, weight_decay: float = 0.0, patience: int = 20, use_lr_scheduler: bool = True, scheduler_factor: float = 0.5, scheduler_patience: int = 5, verbose: bool = False, device: device | str | None = None, model_name: str = 'SMILESTransformerMolecularPredictor')[source]

Bases: LSTMMolecularPredictor

This predictor implements a Transformer model for SMILES-based molecular property predictions.

Notes

This implementation uses a transformer encoder architecture to learn representations of molecular structures from SMILES strings.

Parameters:
  • device (torch.device or str, optional) – Device to run the model on. If None, will auto-detect GPU or use CPU.

  • model_name (str, default="SMILESTransformerMolecularPredictor") – Name identifier for the model.

  • num_task (int, default=1) – Number of prediction tasks.

  • task_type (str, default="regression") – Type of prediction task, either “regression” or “classification”.

  • input_dim (int, default=54) – Size of vocabulary for SMILES tokenization.

  • hidden_size (int, default=128) – Dimension of embedding vectors.

  • n_heads (int, default=4) – Number of attention heads in transformer layers.

  • num_layers (int, default=3) – Number of transformer encoder layers.

  • dim_feedforward (int, default=256) – Dimension of the feedforward network in transformer layers.

  • max_input_len (int, default=200) – Maximum length of input sequences. Shorter sequences will be padded.

  • dropout (float, default=0.1) – Dropout rate for transformer layers.

  • batch_size (int, default=64) – Number of samples per batch for training.

  • epochs (int, default=200) – Maximum number of training epochs.

  • loss_criterion (callable, optional) – Loss function for training. Defaults to MSELoss for regression.

  • evaluate_criterion (str or callable, optional) – Metric for model evaluation.

  • evaluate_higher_better (bool, optional) – Whether higher values of the evaluation metric are better.

  • learning_rate (float, default=0.0001) – Learning rate for optimizer.

  • weight_decay (float, default=0.0) – L2 regularization strength.

  • patience (int, default=20) – Number of epochs to wait for improvement before early stopping.

  • use_lr_scheduler (bool, default=True) – Whether to use learning rate scheduler.

  • scheduler_factor (float, default=0.5) – Factor by which to reduce learning rate when plateau is reached.

  • scheduler_patience (int, default=5) – Number of epochs with no improvement after which learning rate will be reduced.

  • verbose (bool, default=False) – Whether to print progress information during training.

autofit(X_train: List[str], y_train: List | ndarray | None, X_val: List[str] | None = None, y_val: List | ndarray | None = None, search_parameters: Dict[str, ParameterSpec] | None = None, n_trials: int = 10) LSTMMolecularPredictor
fit(X_train: List[str], y_train: List | ndarray | None, X_val: List[str] | None = None, y_val: List | ndarray | None = None) LSTMMolecularPredictor

Fit the model to the training data with optional validation set.

Parameters:
  • X_train (List[str]) – Training set input molecular structures as SMILES strings

  • y_train (Union[List, np.ndarray]) – Training set target values for property prediction

  • X_val (List[str], optional) – Validation set input molecular structures as SMILES strings. If None, training data will be used for validation

  • y_val (Union[List, np.ndarray], optional) – Validation set target values. Required if X_val is provided

predict(X: List[str]) Dict[str, ndarray]

Make predictions using the fitted model.

Parameters:

X (List[str]) – List of SMILES strings to make predictions for

Returns:

Dictionary containing:
  • ’prediction’: Model predictions (shape: [n_samples, n_tasks])

Return type:

Dict[str, np.ndarray]