Utility Functions¶
- class torch_molecule.utils.checker.MolecularInputChecker[source]¶
Bases:
object
Class for validating input data used in molecular models.
- static validate_inputs(X: List[str], y: List | ndarray | None = None, num_task: int = 0, num_pretask: int = 0, return_rdkit_mol: bool = True) Tuple[List[str] | List[Mol], ndarray | None] [source]¶
Validate a list of SMILES strings, and optionally validate a target array.
- Parameters:
X (List[str]) – List of SMILES strings
y (Optional[Union[List, np.ndarray]], optional) – Optional target values, by default None
num_task (int, optional) – Total number of tasks; used to check dimensions of y, by default 0
num_pretask (int, optional) – Number of (pseudo)-tasks that are predefined in the modeling; used to check dimensions of y. Preliminarily used in supervised pretraining, by default 0
return_rdkit_mol (bool, optional) – If True, convert SMILES to RDKit Mol objects, by default True
- Returns:
A tuple containing:
The original or converted SMILES (RDKit Mol objects if return_rdkit_mol=True)
The target array as a numpy array, or None if y was not provided
- Return type:
Tuple[Union[List[str], List[“Chem.Mol”]], Optional[np.ndarray]]
- Raises:
ValueError – If SMILES or target dimensions are invalid
- static validate_smiles(smiles: str, idx: int) Tuple[bool, str | None, Mol | None] [source]¶
Validate a single SMILES string at a given index.
- Parameters:
smiles (str) – The SMILES string to validate
idx (int) – The index of the SMILES string in the original list
- Returns:
A tuple containing:
A boolean indicating whether the SMILES string is valid
A string describing the error if the SMILES is invalid, or None if valid
The RDKit Mol object if valid, or None if invalid
- Return type:
Tuple[bool, Optional[str], Optional[Chem.Mol]]
- class torch_molecule.utils.checkpoint.HuggingFaceCheckpointManager[source]¶
Bases:
object
Handles saving and loading of models to and from the Hugging Face Hub.
- static load_model_from_hf(model_instance, repo_id: str, path: str, config_filename: str = 'config.json') None [source]¶
Load model from Hugging Face Hub, saving locally to path first.
- static push_to_huggingface(model_instance, repo_id: str, task_id: str = 'default', metadata_dict: Dict[str, Any] | None = None, metrics: Dict[str, float] | None = None, commit_message: str = 'Update model', token: str | None = None, private: bool = False, config_filename: str = 'config.json') None [source]¶
Push a task-specific model checkpoint to Hugging Face Hub.
- class torch_molecule.utils.checkpoint.LocalCheckpointManager[source]¶
Bases:
object
Handles saving and loading of models to and from local paths.
- torch_molecule.utils.format.sanitize_config(config_dict)[source]¶
Recursively sanitize config dictionary for JSON serialization.
Handles nested structures and special cases.
- Parameters:
config_dict (dict) – Configuration dictionary to sanitize
- Returns:
Sanitized configuration dictionary that is JSON serializable
- Return type:
dict
- torch_molecule.utils.format.serialize_config(obj)[source]¶
Helper function to make config JSON serializable.
Handles special cases like lambda functions, torch modules, and numpy arrays.
- Parameters:
obj (Any) – The object to serialize
- Returns:
JSON serializable representation of the object
- Return type:
Any
- torch_molecule.utils.hf.create_model_card(model_class: str, model_name: str, tasks_config: Dict, model_config: Dict, repo_id: str, existing_readme: str = '') str [source]¶
Create a model card for multiple tasks.
- Parameters:
model_class (str) – Class name of the model
model_name (str) – Name of the model
tasks_config (Dict) – Configuration for all tasks
model_config (Dict) – General model configuration
repo_id (str) – Repository ID
existing_readme (str) – Existing README content
- Returns:
Generated model card content
- Return type:
str
- torch_molecule.utils.hf.get_existing_repo_data(repo_id: str, token: str | None = None) Tuple[bool, Dict, str] [source]¶
Get existing repository data from HuggingFace Hub.
- Parameters:
repo_id (str) – Repository ID
token (Optional[str]) – HuggingFace token
- Returns:
Tuple containing (repo_exists, existing_config, existing_readme)
- Return type:
Tuple[bool, Dict, str]
- torch_molecule.utils.hf.merge_task_configs(task_id: str, existing_config: Dict, new_task_config: Dict, num_params: int) Dict [source]¶
Merge task-specific configuration and maintain version history.
- Parameters:
task_id (str) – Task identifier (e.g., ‘O2’, ‘N2’)
existing_config (Dict) – Existing configuration dictionary
new_task_config (Dict) – New task configuration to merge
num_params (int) – Number of model parameters
- Returns:
Updated configuration with task history
- Return type:
Dict
- class torch_molecule.utils.search.ParameterSpec(param_type: ParameterType, value_range: Tuple[Any, Any] | List[Any])[source]¶
Bases:
NamedTuple
Specification for a hyperparameter including its type and valid range/options.
- param_type: ParameterType¶
Alias for field number 0
- value_range: Tuple[Any, Any] | List[Any]¶
Alias for field number 1
- class torch_molecule.utils.search.ParameterType(value)[source]¶
Bases:
Enum
Enum defining types of hyperparameters for optimization.
Each type corresponds to a specific Optuna suggest method and parameter behavior.
- CATEGORICAL = 'categorical'¶
- FLOAT = 'float'¶
- INTEGER = 'integer'¶
- LOG_FLOAT = 'log_float'¶
- torch_molecule.utils.search.suggest_parameter(trial: Any, param_name: str, param_spec: ParameterSpec) Any [source]¶
Suggest a parameter value using the appropriate Optuna suggest method.
- Parameters:
trial (optuna.Trial) – The Optuna trial object
param_name (str) – Name of the parameter
param_spec (ParameterSpec) – Specification of the parameter type and range
- Returns:
The suggested parameter value
- Return type:
Any
- Raises:
ValueError – If the parameter type is not recognized
- torch_molecule.utils.graph.features.atom_to_feature_vector(atom)[source]¶
Converts rdkit atom object to feature list of indices :param mol: rdkit atom object :return: list
- torch_molecule.utils.graph.features.bond_to_feature_vector(bond)[source]¶
Converts rdkit bond object to feature list of indices :param mol: rdkit bond object :return: list
- torch_molecule.utils.graph.features.safe_index(l, e)[source]¶
Return index of element e in list l. If e is not present, return the last index
- torch_molecule.utils.graph.graph_from_smiles.add_fingerprint_feature(mol, feature_type, get_fingerprint_fn)[source]¶
- torch_molecule.utils.graph.graph_from_smiles.graph_from_smiles(smiles_or_mol, properties, augmented_features=None, augmented_properties=None)[source]¶
Converts SMILES string or RDKit molecule to graph Data object
- Parameters:
smiles_or_mol (Union[str, rdkit.Chem.rdchem.Mol]) – SMILES string or RDKit molecule object
properties (Any) – Properties to include in the graph
augmented_features (list) – List of augmented features to include
augmented_properties (list, optional) – List of augmented properties to include
- Returns:
Graph object dictionary
- Return type:
dict
- torch_molecule.utils.graph.graph_to_smiles.build_molecule_with_partial_charges(atom_types, edge_types, atom_decoder, verbose=False)[source]¶
- torch_molecule.utils.graph.graph_to_smiles.get_mol(smiles_or_mol)[source]¶
Loads SMILES/molecule into RDKit’s object
- torch_molecule.utils.graph.graph_to_smiles.graph_to_smiles(molecule_list: List[Tuple], atom_decoder: list) List[str | None] [source]¶
- torch_molecule.utils.graph.graph_to_smiles.try_to_connect_fragments(combined_mol, frag, atom1, atom2)[source]¶
- torch_molecule.utils.generic.metrics.accuracy_score(y_true, logits, avergae=None, thresholds=None, task_weights=None, task_types=None)[source]¶
Calculate accuracy for multiple tasks from prediction logits.
Parameters:¶
- y_truenumpy.ndarray
Ground truth labels with shape (n_samples, n_tasks)
- logitsnumpy.ndarray
Prediction logits with shape (n_samples, n_tasks)
- task_typeslist or None, optional
List of task types (‘binary’ or ‘multiclass’) for each task If None, assumes all tasks are binary
- thresholdsnumpy.ndarray or None, optional
Classification thresholds for binary tasks with shape (n_tasks,) If None, uses 0.5 for all binary tasks
- task_weightsnumpy.ndarray or None, optional
Weights for each task with shape (n_tasks,) If None, all tasks are weighted equally
Returns:¶
- dict
A dictionary containing: - ‘task_accuracies’: Accuracy for each individual task - ‘weighted_accuracy’: Overall weighted accuracy across all tasks - ‘macro_accuracy’: Simple average of all task accuracies - ‘predictions’: Binary predictions after applying activation and thresholds
Raises:¶
- ValueError
If input shapes don’t match or dimensions are incorrect
- torch_molecule.utils.generic.metrics.mean_absolute_error(y_true: ndarray | list, y_pred: ndarray | list, average: bool = True, sample_weight: ndarray | None = None) float | ndarray [source]¶
Calculate Mean Absolute Error for multi-task regression, handling NaN values.
- Parameters:
y_true (Union[np.ndarray, list]) – Ground truth values. Shape should be (n_samples, n_tasks)
y_pred (Union[np.ndarray, list]) – Predicted values. Shape should be (n_samples, n_tasks)
average (bool, default=True) – If True, return the average MAE across all valid tasks. If False, return individual MAE for each task (NaN for invalid tasks).
sample_weight (Optional[np.ndarray], default=None) – Sample weights. Shape should be (n_samples,)
- Returns:
If average=True, returns mean MAE across all valid tasks. If average=False, returns array of MAE scores with NaN for invalid tasks.
- Return type:
Union[float, np.ndarray]
- torch_molecule.utils.generic.metrics.mean_squared_error(y_true: ndarray | list, y_pred: ndarray | list, average: bool = True, sample_weight: ndarray | None = None, squared: bool = True) float | ndarray [source]¶
Calculate Mean Squared Error for multi-task regression, handling NaN values.
- Parameters:
y_true (Union[np.ndarray, list]) – Ground truth values. Shape should be (n_samples, n_tasks)
y_pred (Union[np.ndarray, list]) – Predicted values. Shape should be (n_samples, n_tasks)
average (bool, default=True) – If True, return the average MSE across all valid tasks. If False, return individual MSE for each task (NaN for invalid tasks).
sample_weight (Optional[np.ndarray], default=None) – Sample weights. Shape should be (n_samples,)
squared (bool, default=True) – If True, returns MSE value. If False, returns RMSE value.
- Returns:
If average=True, returns mean MSE/RMSE across all valid tasks. If average=False, returns array of MSE/RMSE scores with NaN for invalid tasks.
- Return type:
Union[float, np.ndarray]
- torch_molecule.utils.generic.metrics.r2_score(y_true: ndarray | list, y_pred: ndarray | list, average: bool = True, sample_weight: ndarray | None = None) float | ndarray [source]¶
Calculate R² Score for multi-task regression, handling NaN values.
- Parameters:
y_true (Union[np.ndarray, list]) – Ground truth values. Shape should be (n_samples, n_tasks)
y_pred (Union[np.ndarray, list]) – Predicted values. Shape should be (n_samples, n_tasks)
average (bool, default=True) – If True, return the average R² across all valid tasks. If False, return individual R² for each task (NaN for invalid tasks).
sample_weight (Optional[np.ndarray], default=None) – Sample weights. Shape should be (n_samples,)
- Returns:
If average=True, returns mean R² across all valid tasks. If average=False, returns array of R² scores with NaN for invalid tasks.
- Return type:
Union[float, np.ndarray]
- torch_molecule.utils.generic.metrics.roc_auc_score(y_true: ndarray | list, y_pred: ndarray | list, average: bool = True, sample_weight: ndarray | None = None) float | ndarray [source]¶
Calculate ROC AUC scores for multi-task binary classification, handling NaN values.
For each task dimension, computes AUC score using only the non-NaN samples. Tasks with insufficient valid samples or unique labels are masked in the output.
- Parameters:
y_true (Union[np.ndarray, list]) – True binary labels. Shape should be (n_samples, n_tasks)
y_pred (Union[np.ndarray, list]) – Predicted probabilities. Shape should be (n_samples, n_tasks)
average (bool, default=True) – If True, return the average ROC AUC score across all valid tasks. If False, return individual scores for each task (NaN for invalid tasks).
sample_weight (Optional[np.ndarray], default=None) – Sample weights for each instance. Shape should be (n_samples,)
- Returns:
If average=True, returns mean ROC AUC score across all valid tasks. If average=False, returns array of ROC AUC scores with NaN for invalid tasks.
- Return type:
Union[float, np.ndarray]
- Raises:
ValueError – If input shapes don’t match or no valid tasks are found
TypeError – If inputs are not of correct type
Examples
>>> y_true = np.array([[0, 1, np.nan], [1, 0, 1], [1, np.nan, 0], [0, 0, 1]]) >>> y_pred = np.array([[0.1, 0.8, 0.7], [0.9, 0.2, 0.8], [0.8, 0.7, 0.3], [0.2, 0.1, 0.9]]) >>> score = roc_auc_score(y_true, y_pred) >>> print(f"Average ROC AUC across valid tasks: {score:.3f}")