Utility Functions¶

class torch_molecule.utils.checker.MolecularInputChecker[source]¶

Bases: object

Class for validating input data used in molecular models.

static validate_inputs(X: List[str], y: List | ndarray | None = None, num_task: int = 0, num_pretask: int = 0, return_rdkit_mol: bool = True) → Tuple[List[str] | List[Mol], ndarray | None][source]¶

Validate a list of SMILES strings, and optionally validate a target array.

Parameters:

X (List[str]) – List of SMILES strings
y (Optional[Union[List, np.ndarray]], optional) – Optional target values, by default None
num_task (int, optional) – Total number of tasks; used to check dimensions of y, by default 0
num_pretask (int, optional) – Number of (pseudo)-tasks that are predefined in the modeling; used to check dimensions of y. Preliminarily used in supervised pretraining, by default 0
return_rdkit_mol (bool, optional) – If True, convert SMILES to RDKit Mol objects, by default True

Returns:

A tuple containing:

The original or converted SMILES (RDKit Mol objects if return_rdkit_mol=True)
The target array as a numpy array, or None if y was not provided

Return type:

Tuple[Union[List[str], List[“Chem.Mol”]], Optional[np.ndarray]]

Raises:

ValueError – If SMILES or target dimensions are invalid

static validate_smiles(smiles: str, idx: int) → Tuple[bool, str | None, Mol | None][source]¶

Validate a single SMILES string at a given index.

Parameters:

smiles (str) – The SMILES string to validate
idx (int) – The index of the SMILES string in the original list

Returns:

A tuple containing:

A boolean indicating whether the SMILES string is valid
A string describing the error if the SMILES is invalid, or None if valid
The RDKit Mol object if valid, or None if invalid

Return type:

Tuple[bool, Optional[str], Optional[Chem.Mol]]

class torch_molecule.utils.checkpoint.HuggingFaceCheckpointManager[source]¶

Bases: object

Handles saving and loading of models to and from the Hugging Face Hub.

static load_model_from_hf(model_instance, repo_id: str, path: str, config_filename: str = 'config.json') → None[source]¶: Load model from Hugging Face Hub, saving locally to path first.

static push_to_huggingface(model_instance, repo_id: str, task_id: str = 'default', metadata_dict: Dict[str, Any] | None = None, metrics: Dict[str, float] | None = None, commit_message: str = 'Update model', token: str | None = None, private: bool = False, config_filename: str = 'config.json') → None[source]¶: Push a task-specific model checkpoint to Hugging Face Hub.

class torch_molecule.utils.checkpoint.LocalCheckpointManager[source]¶

Bases: object

Handles saving and loading of models to and from local paths.

static load_model_from_local(model_instance, path: str) → None[source]¶: Load model weights and configuration from a local file.

static save_model_to_local(model_instance, path: str) → None[source]¶: Save model weights and configuration to a local file.

torch_molecule.utils.format.sanitize_config(config_dict)[source]¶

Recursively sanitize config dictionary for JSON serialization.

Handles nested structures and special cases.

Parameters:: config_dict (dict) – Configuration dictionary to sanitize
Returns:: Sanitized configuration dictionary that is JSON serializable
Return type:: dict

torch_molecule.utils.format.serialize_config(obj)[source]¶

Helper function to make config JSON serializable.

Handles special cases like lambda functions, torch modules, and numpy arrays.

Parameters:: obj (Any) – The object to serialize
Returns:: JSON serializable representation of the object
Return type:: Any

torch_molecule.utils.hf.create_model_card(model_class: str, model_name: str, tasks_config: Dict, model_config: Dict, repo_id: str, existing_readme: str = '') → str[source]¶

Create a model card for multiple tasks.

Parameters:

model_class (str) – Class name of the model
model_name (str) – Name of the model
tasks_config (Dict) – Configuration for all tasks
model_config (Dict) – General model configuration
repo_id (str) – Repository ID
existing_readme (str) – Existing README content

Returns:

Generated model card content

Return type:

str

torch_molecule.utils.hf.get_existing_repo_data(repo_id: str, token: str | None = None) → Tuple[bool, Dict, str][source]¶

Get existing repository data from HuggingFace Hub.

Parameters:

repo_id (str) – Repository ID
token (Optional[str]) – HuggingFace token

Returns:

Tuple containing (repo_exists, existing_config, existing_readme)

Return type:

Tuple[bool, Dict, str]

torch_molecule.utils.hf.merge_task_configs(task_id: str, existing_config: Dict, new_task_config: Dict, num_params: int) → Dict[source]¶

Merge task-specific configuration and maintain version history.

Parameters:

task_id (str) – Task identifier (e.g., ‘O2’, ‘N2’)
existing_config (Dict) – Existing configuration dictionary
new_task_config (Dict) – New task configuration to merge
num_params (int) – Number of model parameters

Returns:

Updated configuration with task history

Return type:

Dict

class torch_molecule.utils.search.ParameterSpec(param_type: ParameterType, value_range: Tuple[Any, Any] | List[Any])[source]¶

Bases: NamedTuple

Specification for a hyperparameter including its type and valid range/options.

param_type: ParameterType¶: Alias for field number 0

value_range: Tuple[Any, Any] | List[Any]¶: Alias for field number 1

class torch_molecule.utils.search.ParameterType(value)[source]¶

Bases: Enum

Enum defining types of hyperparameters for optimization.

Each type corresponds to a specific Optuna suggest method and parameter behavior.

CATEGORICAL = 'categorical'¶

FLOAT = 'float'¶

INTEGER = 'integer'¶

LOG_FLOAT = 'log_float'¶

torch_molecule.utils.search.parse_list_params(params_str)[source]¶

torch_molecule.utils.search.suggest_parameter(trial: Any, param_name: str, param_spec: ParameterSpec) → Any[source]¶

Suggest a parameter value using the appropriate Optuna suggest method.

Parameters:

trial (optuna.Trial) – The Optuna trial object
param_name (str) – Name of the parameter
param_spec (ParameterSpec) – Specification of the parameter type and range

Returns:

The suggested parameter value

Return type:

Any

Raises:

ValueError – If the parameter type is not recognized

torch_molecule.utils.graph.features.atom_feature_vector_to_dict(atom_feature)[source]¶

torch_molecule.utils.graph.features.atom_to_feature_vector(atom)[source]¶: Converts rdkit atom object to feature list of indices :param mol: rdkit atom object :return: list

torch_molecule.utils.graph.features.bond_feature_vector_to_dict(bond_feature)[source]¶

torch_molecule.utils.graph.features.bond_to_feature_vector(bond)[source]¶: Converts rdkit bond object to feature list of indices :param mol: rdkit bond object :return: list

torch_molecule.utils.graph.features.get_atom_feature_dims()[source]¶

torch_molecule.utils.graph.features.get_bond_feature_dims()[source]¶

torch_molecule.utils.graph.features.getmaccsfingerprint(mol)[source]¶

torch_molecule.utils.graph.features.getmorganfingerprint(mol)[source]¶

torch_molecule.utils.graph.features.safe_index(l, e)[source]¶: Return index of element e in list l. If e is not present, return the last index

torch_molecule.utils.graph.graph_from_smiles.add_fingerprint_feature(mol, feature_type, get_fingerprint_fn)[source]¶

torch_molecule.utils.graph.graph_from_smiles.get_augmented_property(mol, properties)[source]¶

torch_molecule.utils.graph.graph_from_smiles.graph_from_smiles(smiles_or_mol, properties, augmented_features=None, augmented_properties=None)[source]¶

Converts SMILES string or RDKit molecule to graph Data object

Parameters:

smiles_or_mol (Union[str, rdkit.Chem.rdchem.Mol]) – SMILES string or RDKit molecule object
properties (Any) – Properties to include in the graph
augmented_features (list) – List of augmented features to include
augmented_properties (list, optional) – List of augmented properties to include

Returns:

Graph object dictionary

Return type:

dict

torch_molecule.utils.graph.graph_to_smiles.build_molecule_with_partial_charges(atom_types, edge_types, atom_decoder, verbose=False)[source]¶

torch_molecule.utils.graph.graph_to_smiles.check_valency(mol)[source]¶

torch_molecule.utils.graph.graph_to_smiles.check_valid(smiles)[source]¶

torch_molecule.utils.graph.graph_to_smiles.connect_fragments(mol)[source]¶

torch_molecule.utils.graph.graph_to_smiles.correct_mol(mol, connection=False)[source]¶

torch_molecule.utils.graph.graph_to_smiles.get_mol(smiles_or_mol)[source]¶: Loads SMILES/molecule into RDKit’s object

torch_molecule.utils.graph.graph_to_smiles.graph_to_smiles(molecule_list: List[Tuple], atom_decoder: list) → List[str | None][source]¶

torch_molecule.utils.graph.graph_to_smiles.mol2smiles(mol)[source]¶

torch_molecule.utils.graph.graph_to_smiles.select_atom_with_available_valency(frag)[source]¶

torch_molecule.utils.graph.graph_to_smiles.select_atoms_with_available_valency(frag)[source]¶

torch_molecule.utils.graph.graph_to_smiles.try_to_connect_fragments(combined_mol, frag, atom1, atom2)[source]¶

torch_molecule.utils.generic.metrics.accuracy_score(y_true, logits, avergae=None, thresholds=None, task_weights=None, task_types=None)[source]¶

Calculate accuracy for multiple tasks from prediction logits.

Parameters:¶

y_truenumpy.ndarray: Ground truth labels with shape (n_samples, n_tasks)
logitsnumpy.ndarray: Prediction logits with shape (n_samples, n_tasks)
task_typeslist or None, optional: List of task types (‘binary’ or ‘multiclass’) for each task If None, assumes all tasks are binary
thresholdsnumpy.ndarray or None, optional: Classification thresholds for binary tasks with shape (n_tasks,) If None, uses 0.5 for all binary tasks
task_weightsnumpy.ndarray or None, optional: Weights for each task with shape (n_tasks,) If None, all tasks are weighted equally

Returns:¶

dict: A dictionary containing: - ‘task_accuracies’: Accuracy for each individual task - ‘weighted_accuracy’: Overall weighted accuracy across all tasks - ‘macro_accuracy’: Simple average of all task accuracies - ‘predictions’: Binary predictions after applying activation and thresholds

Raises:¶

ValueError: If input shapes don’t match or dimensions are incorrect

torch_molecule.utils.generic.metrics.mean_absolute_error(y_true: ndarray | list, y_pred: ndarray | list, average: bool = True, sample_weight: ndarray | None = None) → float | ndarray[source]¶

Calculate Mean Absolute Error for multi-task regression, handling NaN values.

Parameters:

y_true (Union[np.ndarray, list]) – Ground truth values. Shape should be (n_samples, n_tasks)
y_pred (Union[np.ndarray, list]) – Predicted values. Shape should be (n_samples, n_tasks)
average (bool, default=True) – If True, return the average MAE across all valid tasks. If False, return individual MAE for each task (NaN for invalid tasks).
sample_weight (Optional[np.ndarray], default=None) – Sample weights. Shape should be (n_samples,)

Returns:

If average=True, returns mean MAE across all valid tasks. If average=False, returns array of MAE scores with NaN for invalid tasks.

Return type:

Union[float, np.ndarray]

torch_molecule.utils.generic.metrics.mean_squared_error(y_true: ndarray | list, y_pred: ndarray | list, average: bool = True, sample_weight: ndarray | None = None, squared: bool = True) → float | ndarray[source]¶

Calculate Mean Squared Error for multi-task regression, handling NaN values.

Parameters:

y_true (Union[np.ndarray, list]) – Ground truth values. Shape should be (n_samples, n_tasks)
y_pred (Union[np.ndarray, list]) – Predicted values. Shape should be (n_samples, n_tasks)
average (bool, default=True) – If True, return the average MSE across all valid tasks. If False, return individual MSE for each task (NaN for invalid tasks).
sample_weight (Optional[np.ndarray], default=None) – Sample weights. Shape should be (n_samples,)
squared (bool, default=True) – If True, returns MSE value. If False, returns RMSE value.

Returns:

If average=True, returns mean MSE/RMSE across all valid tasks. If average=False, returns array of MSE/RMSE scores with NaN for invalid tasks.

Return type:

Union[float, np.ndarray]

torch_molecule.utils.generic.metrics.r2_score(y_true: ndarray | list, y_pred: ndarray | list, average: bool = True, sample_weight: ndarray | None = None) → float | ndarray[source]¶

Calculate R² Score for multi-task regression, handling NaN values.

Parameters:

y_true (Union[np.ndarray, list]) – Ground truth values. Shape should be (n_samples, n_tasks)
y_pred (Union[np.ndarray, list]) – Predicted values. Shape should be (n_samples, n_tasks)
average (bool, default=True) – If True, return the average R² across all valid tasks. If False, return individual R² for each task (NaN for invalid tasks).
sample_weight (Optional[np.ndarray], default=None) – Sample weights. Shape should be (n_samples,)

Returns:

If average=True, returns mean R² across all valid tasks. If average=False, returns array of R² scores with NaN for invalid tasks.

Return type:

Union[float, np.ndarray]

torch_molecule.utils.generic.metrics.roc_auc_score(y_true: ndarray | list, y_pred: ndarray | list, average: bool = True, sample_weight: ndarray | None = None) → float | ndarray[source]¶

Calculate ROC AUC scores for multi-task binary classification, handling NaN values.

For each task dimension, computes AUC score using only the non-NaN samples. Tasks with insufficient valid samples or unique labels are masked in the output.

Parameters:

y_true (Union[np.ndarray, list]) – True binary labels. Shape should be (n_samples, n_tasks)
y_pred (Union[np.ndarray, list]) – Predicted probabilities. Shape should be (n_samples, n_tasks)
average (bool, default=True) – If True, return the average ROC AUC score across all valid tasks. If False, return individual scores for each task (NaN for invalid tasks).
sample_weight (Optional[np.ndarray], default=None) – Sample weights for each instance. Shape should be (n_samples,)

Returns:

If average=True, returns mean ROC AUC score across all valid tasks. If average=False, returns array of ROC AUC scores with NaN for invalid tasks.

Return type:

Union[float, np.ndarray]

Raises:

ValueError – If input shapes don’t match or no valid tasks are found
TypeError – If inputs are not of correct type

Examples

>>> y_true = np.array([[0, 1, np.nan], [1, 0, 1], [1, np.nan, 0], [0, 0, 1]])
>>> y_pred = np.array([[0.1, 0.8, 0.7], [0.9, 0.2, 0.8], [0.8, 0.7, 0.3], [0.2, 0.1, 0.9]])
>>> score = roc_auc_score(y_true, y_pred)
>>> print(f"Average ROC AUC across valid tasks: {score:.3f}")

torch_molecule.utils.generic.metrics.root_mean_squared_error(y_true, y_pred, average, sample_weight)[source]¶

torch_molecule.utils.generic.metrics.sigmoid(x)[source]¶: Numerically stable sigmoid function.

torch_molecule.utils.generic.weights.init_weights(net, init_type='xavier', init_gain=0.02, verbose=False)[source]¶: Initialize network weights. :param net: :type net: network :param init_type: normal | xavier | kaiming | orthogonal :type init_type: str :param init_gain: :type init_gain: float