Molecular Encoder Models¶
The encoder models inherit from the torch_molecule.base.encoder.BaseMolecularEncoder
class and share common methods for model pretraining and encoding, as well as model persistence.
Training and Encoding
fit(X, **kwargs)
: Pretrain the model on given data, where X contains SMILES stringsNot implemented for: - N/A
encode(X, **kwargs)
: Encode new SMILES strings and return a dictionary containing encoded representations
Model Persistence
inherited from torch_molecule.base.base.BaseModel
save_to_local(path)
: Save the trained model to a local fileload_from_local(path)
: Load a trained model from a local filesave_to_hf(repo_id)
: Push the model to Hugging Face Hubload_from_hf(repo_id, local_cache)
: Load a model from Hugging Face Hub and save it to a local filesave(path, repo_id)
: Save the model to either local storage or Hugging Faceload(path, repo_id)
: Load a model from either local storage or Hugging Face
Self-supervised Molecular Representation Learning¶
MoAma for Molecular Representation Learning
- class torch_molecule.encoder.moama.modeling_moama.MoamaMolecularEncoder(device: device | None = None, model_name: str = 'MoamaMolecularEncoder', mask_rate: float = 0.15, lw_rec: float = 0.5, num_layer: int = 5, hidden_size: int = 300, drop_ratio: float = 0.5, norm_layer: str = 'batch_norm', encoder_type: str = 'gin-virtual', readout: str = 'sum', batch_size: int = 128, epochs: int = 500, learning_rate: float = 0.001, grad_clip_value: float | None = None, weight_decay: float = 0.0, use_lr_scheduler: bool = False, scheduler_factor: float = 0.5, scheduler_patience: int = 5, verbose: bool = False)[source]¶
Bases:
BaseMolecularEncoder
This encoder implements Motif-aware Attribute Masking for Molecular Graph Pre-training.
References
- Parameters:
mask_rate (float, default=0.15) – Fraction of nodes to mask in each graph
lw_rec (float, default=0.5) – Weight balancing between reconstruction loss and fingerprint loss. Higher values emphasize reconstruction, lower values emphasize fingerprint matching.
num_layer (int, default=5) – Number of GNN layers.
hidden_size (int, default=300) – Dimension of hidden node features.
drop_ratio (float, default=0.5) – Dropout probability.
norm_layer (str, default="batch_norm") – Type of normalization layer to use. One of [“batch_norm”, “layer_norm”, “instance_norm”, “graph_norm”, “size_norm”, “pair_norm”].
encoder_type (str, default="gin-virtual") – Type of GNN architecture to use. One of [“gin-virtual”, “gcn-virtual”, “gin”, “gcn”].
readout (str, default="sum") – Method for aggregating node features to obtain graph-level representations. One of [“sum”, “mean”, “max”].
batch_size (int, default=128) – Number of samples per batch for training.
epochs (int, default=500) – Maximum number of training epochs.
learning_rate (float, default=0.001) – Learning rate for optimizer.
grad_clip_value (float, optional) – Maximum norm of gradients for gradient clipping.
weight_decay (float, default=0.0) – L2 regularization strength.
use_lr_scheduler (bool, default=False) – Whether to use a learning rate scheduler.
scheduler_factor (float, default=0.5) – Factor by which to reduce the learning rate when plateau is detected.
scheduler_patience (int, default=5) – Number of epochs with no improvement after which learning rate will be reduced.
verbose (bool, default=False) – Whether to print progress information during training.
model_name (str, default="MoamaMolecularEncoder") – Name of the encoder model.
- encode(X: List[str], return_type: Literal['np', 'pt'] = 'pt') ndarray | Tensor [source]¶
Encode molecules into vector representations.
- Parameters:
X (List[str]) – List of SMILES strings
return_type (Literal["np", "pt"], default="pt") – Return type of the representations
- Returns:
representations – Molecular representations
- Return type:
ndarray or torch.Tensor
- fit(X_train: List[str]) MoamaMolecularEncoder [source]¶
Fit the model to the training data with optional validation set.
- Parameters:
X_train (List[str]) – Training set input molecular structures as SMILES strings
- Returns:
self – Fitted estimator
- Return type:
Attribute Masking for Molecular Representation Learning
- class torch_molecule.encoder.attrmask.modeling_attrmask.AttrMaskMolecularEncoder(device: device | None = None, model_name: str = 'AttrMaskMolecularEncoder', mask_num: int = 0, mask_rate: float = 0.15, num_layer: int = 5, hidden_size: int = 300, drop_ratio: float = 0.5, norm_layer: str = 'batch_norm', encoder_type: str = 'gin-virtual', readout: str = 'sum', batch_size: int = 128, epochs: int = 500, learning_rate: float = 0.001, grad_clip_value: float | None = None, weight_decay: float = 0.0, use_lr_scheduler: bool = False, scheduler_factor: float = 0.5, scheduler_patience: int = 5, verbose: bool = False)[source]¶
Bases:
BaseMolecularEncoder
This encoder implements a GNN-based model for molecular representation learning using the attribute masking pretraining strategy.
References
Paper: Strategies for Pre-training Graph Neural Networks (ICLR 2020) https://arxiv.org/abs/1905.12265
Code: https://github.com/snap-stanford/pretrain-gnns/tree/master/chem
- Parameters:
mask_num (int, default=0) – Number of atom features to mask during pretraining. If set to 0, masking is determined by mask_rate.
mask_rate (float, default=0.15) – Proportion of atoms to mask randomly. Ignored if mask_num is set.
num_layer (int, default=5) – Number of GNN layers.
hidden_size (int, default=300) – Dimension of hidden node features.
drop_ratio (float, default=0.5) – Dropout probability.
norm_layer (str, default="batch_norm") – Type of normalization layer to use. One of [“batch_norm”, “layer_norm”, “instance_norm”, “graph_norm”, “size_norm”, “pair_norm”].
encoder_type (str, default="gin-virtual") – Type of GNN architecture to use. One of [“gin-virtual”, “gcn-virtual”, “gin”, “gcn”].
readout (str, default="sum") – Method for aggregating node features to obtain graph-level representations. One of [“sum”, “mean”, “max”].
batch_size (int, default=128) – Number of samples per batch for training.
epochs (int, default=500) – Maximum number of training epochs.
learning_rate (float, default=0.001) – Learning rate for optimizer.
grad_clip_value (float, optional) – Maximum norm of gradients for gradient clipping.
weight_decay (float, default=0.0) – L2 regularization strength.
use_lr_scheduler (bool, default=False) – Whether to use a learning rate scheduler.
scheduler_factor (float, default=0.5) – Factor by which to reduce the learning rate when plateau is detected.
scheduler_patience (int, default=5) – Number of epochs with no improvement after which learning rate will be reduced.
verbose (bool, default=False) – Whether to print progress information during training.
model_name (str, default="AttrMaskMolecularEncoder") – Name of the encoder model.
- encode(X: List[str], return_type: Literal['np', 'pt'] = 'pt') ndarray | Tensor [source]¶
Encode molecules into vector representations.
- Parameters:
X (List[str]) – List of SMILES strings
return_type (Literal["np", "pt"], default="pt") – Return type of the representations
- Returns:
representations – Molecular representations
- Return type:
ndarray or torch.Tensor
- fit(X_train: List[str]) AttrMaskMolecularEncoder [source]¶
Fit the model to the training data with optional validation set.
- Parameters:
X_train (List[str]) – Training set input molecular structures as SMILES strings
- Returns:
self – Fitted estimator
- Return type:
Graph masked autoencoder
- class torch_molecule.encoder.graphmae.modeling_graphmae.GraphMAEMolecularEncoder(device: device | None = None, model_name: str = 'GraphMAEMolecularEncoder', mask_rate: float = 0.15, mask_edge: bool = False, predictor_type: str = 'gin', num_layer: int = 5, hidden_size: int = 300, drop_ratio: float = 0.5, norm_layer: str = 'batch_norm', encoder_type: str = 'gin-virtual', readout: str = 'sum', batch_size: int = 128, epochs: int = 500, learning_rate: float = 0.001, grad_clip_value: float | None = None, weight_decay: float = 0.0, use_lr_scheduler: bool = False, scheduler_factor: float = 0.5, scheduler_patience: int = 5, verbose: bool = False)[source]¶
Bases:
BaseMolecularEncoder
GraphMAE: Self-Supervised Masked Graph Autoencoders
References
- Parameters:
mask_rate (float, default=0.15) – Fraction of nodes to mask during training.
mask_edge (bool, default=False) – Whether to mask edges in addition to nodes.
predictor_type (str, default="gin") – Type of predictor network to use for reconstruction. Options: [“gin”, “gcn”, “linear”]
num_layer (int, default=5) – Number of message passing layers in the GNN.
hidden_size (int, default=300) – Dimension of hidden node representations.
drop_ratio (float, default=0.5) – Dropout probability.
norm_layer (str, default="batch_norm") – Type of normalization to use. Options: [“batch_norm”, “layer_norm”, “instance_norm”, “graph_norm”, “size_norm”, “pair_norm”]
encoder_type (str, default="gin-virtual") – Type of GNN encoder to use. Options: [“gin-virtual”, “gcn-virtual”, “gin”, “gcn”]
readout (str, default="sum") – Pooling method to use for graph-level representations. Options: [“sum”, “mean”, “max”]
batch_size (int, default=128) – Batch size for training and inference.
epochs (int, default=500) – Number of training epochs.
learning_rate (float, default=0.001) – Learning rate for optimizer.
grad_clip_value (Optional[float], default=None) – Maximum norm of gradients for gradient clipping. No clipping if None.
weight_decay (float, default=0.0) – L2 regularization factor.
use_lr_scheduler (bool, default=False) – Whether to use a learning rate scheduler.
scheduler_factor (float, default=0.5) – Factor by which to reduce learning rate when using scheduler.
scheduler_patience (int, default=5) – Number of epochs with no improvement after which learning rate will be reduced.
verbose (bool, default=False) – Whether to display progress bars and logs.
model_name (str, default="GraphMAEMolecularEncoder") – Name of the model.
Examples
>>> from torch_molecule import GraphMAEMolecularEncoder >>> encoder = GraphMAEMolecularEncoder(hidden_size=128, epochs=100) >>> encoder.fit(["CC(=O)OC1=CC=CC=C1C(=O)O", "CCO", "C1=CC=CC=C1"]) >>> representations = encoder.encode(["CCO"])
- encode(X: List[str], return_type: Literal['np', 'pt'] = 'pt') ndarray | Tensor [source]¶
Encode molecules into vector representations.
- Parameters:
X (List[str]) – List of SMILES strings
return_type (Literal["np", "pt"], default="pt") – Return type of the representations
- Returns:
representations – Molecular representations
- Return type:
ndarray or torch.Tensor
- fit(X_train: List[str]) GraphMAEMolecularEncoder [source]¶
Fit the model to the training data with optional validation set.
- Parameters:
X_train (List[str]) – Training set input molecular structures as SMILES strings
- Returns:
self – Fitted estimator
- Return type:
Context Prediction
- class torch_molecule.encoder.contextpred.modeling_contextpred.ContextPredMolecularEncoder(device: device | None = None, model_name: str = 'ContextPredMolecularEncoder', mode: str = 'cbow', context_size: int = 2, neg_samples: int = 1, num_layer: int = 3, hidden_size: int = 300, drop_ratio: float = 0.5, norm_layer: str = 'batch_norm', encoder_type: str = 'gin-virtual', readout: str = 'sum', batch_size: int = 128, epochs: int = 500, learning_rate: float = 0.001, grad_clip_value: float | None = None, weight_decay: float = 0.0, use_lr_scheduler: bool = False, scheduler_factor: float = 0.5, scheduler_patience: int = 5, verbose: bool = False)[source]¶
Bases:
BaseMolecularEncoder
This encoder implements a GNN-based model for molecular representation learning using the context prediction pretraining strategy.
References
Paper: Strategies for Pre-training Graph Neural Networks (ICLR 2020) https://arxiv.org/abs/1905.12265
Code: https://github.com/snap-stanford/pretrain-gnns/tree/master/chem
- Parameters:
mode (str, default="cbow") – Type of context prediction task. One of [“cbow”, “skipgram”].
context_size (int, default=2) – Size of the context window used for predicting node-level features.
neg_samples (int, default=1) – Number of negative samples used in the training objective.
num_layer (int, default=3) – Number of GNN layers.
hidden_size (int, default=300) – Dimension of hidden node features.
drop_ratio (float, default=0.5) – Dropout probability.
norm_layer (str, default="batch_norm") – Type of normalization layer to use. One of [“batch_norm”, “layer_norm”, “instance_norm”, “graph_norm”, “size_norm”, “pair_norm”].
encoder_type (str, default="gin-virtual") – Type of GNN architecture to use. One of [“gin-virtual”, “gcn-virtual”, “gin”, “gcn”].
readout (str, default="sum") – Method for aggregating node features to obtain graph-level representations. One of [“sum”, “mean”, “max”].
batch_size (int, default=128) – Number of samples per batch for training.
epochs (int, default=500) – Maximum number of training epochs.
learning_rate (float, default=0.001) – Learning rate for the optimizer.
grad_clip_value (float, optional) – Maximum norm of gradients for gradient clipping.
weight_decay (float, default=0.0) – L2 regularization strength.
use_lr_scheduler (bool, default=False) – Whether to use a learning rate scheduler during training.
scheduler_factor (float, default=0.5) – Factor by which to reduce learning rate when plateau is reached.
scheduler_patience (int, default=5) – Number of epochs with no improvement after which learning rate will be reduced.
verbose (bool, default=False) – Whether to print progress information during training.
- encode(X: List[str], return_type: Literal['np', 'pt'] = 'pt') ndarray | Tensor [source]¶
Encode molecules into vector representations.
- Parameters:
X (List[str]) – List of SMILES strings
return_type (Literal["np", "pt"], default="pt") – Return type of the representations
- Returns:
representations – Molecular representations
- Return type:
ndarray or torch.Tensor
- fit(X_train: List[str]) ContextPredMolecularEncoder [source]¶
Fit the model to the training data with optional validation set.
- Parameters:
X_train (List[str]) – Training set input molecular structures as SMILES strings
- Returns:
self – Fitted estimator
- Return type:
Edge Prediction
- class torch_molecule.encoder.edgepred.modeling_edgepred.EdgePredMolecularEncoder(device: device | None = None, model_name: str = 'EdgePredMolecularEncoder', num_layer: int = 5, hidden_size: int = 300, drop_ratio: float = 0.5, norm_layer: str = 'batch_norm', encoder_type: str = 'gin-virtual', readout: str = 'sum', batch_size: int = 128, epochs: int = 500, learning_rate: float = 0.001, grad_clip_value: float | None = None, weight_decay: float = 0.0, use_lr_scheduler: bool = False, scheduler_factor: float = 0.5, scheduler_patience: int = 5, verbose: bool = False)[source]¶
Bases:
BaseMolecularEncoder
This encoder implements a GNN-based model for molecular representation learning using the edge prediction.
References
Paper: Strategies for Pre-training Graph Neural Networks (ICLR 2020) https://arxiv.org/abs/1905.12265
Code: https://github.com/snap-stanford/pretrain-gnns/tree/master/chem
- Parameters:
num_layer (int, default=5) – Number of GNN layers.
hidden_size (int, default=300) – Dimension of hidden node features.
drop_ratio (float, default=0.5) – Dropout probability.
norm_layer (str, default="batch_norm") – Type of normalization layer to use. One of [“batch_norm”, “layer_norm”, “instance_norm”, “graph_norm”, “size_norm”, “pair_norm”].
encoder_type (str, default="gin-virtual") – Type of GNN architecture to use.
readout (str, default="sum") – Method for aggregating node features to obtain graph-level representations.
batch_size (int, default=128) – Number of samples per batch for training.
epochs (int, default=500) – Maximum number of training epochs.
learning_rate (float, default=0.001) – Learning rate for optimizer.
grad_clip_value (float, optional) – Maximum norm of gradients for gradient clipping.
weight_decay (float, default=0.0) – L2 regularization strength.
use_lr_scheduler (bool, default=False) – Whether to use learning rate scheduler.
scheduler_factor (float, default=0.5) – Factor by which to reduce learning rate when plateau is reached.
scheduler_patience (int, default=5) – Number of epochs with no improvement after which learning rate will be reduced.
verbose (bool, default=False) – Whether to print progress information during training.
- encode(X: List[str], return_type: Literal['np', 'pt'] = 'pt') ndarray | Tensor [source]¶
Encode molecules into vector representations.
- Parameters:
X (List[str]) – List of SMILES strings
return_type (Literal["np", "pt"], default="pt") – Return type of the representations
- Returns:
representations – Molecular representations
- Return type:
ndarray or torch.Tensor
- fit(X_train: List[str]) EdgePredMolecularEncoder [source]¶
Fit the model to the training data with optional validation set.
- Parameters:
X_train (List[str]) – Training set input molecular structures as SMILES strings
- Returns:
self – Fitted estimator
- Return type:
InfoGraph
- class torch_molecule.encoder.infograph.modeling_infograph.InfoGraphMolecularEncoder(device: device | None = None, model_name: str = 'InfographMolecularEncoder', lw_prior: float = 0.0, embedding_dim: int = 160, num_layer: int = 5, drop_ratio: float = 0.5, norm_layer: str = 'batch_norm', encoder_type: str = 'gin-virtual', readout: str = 'sum', batch_size: int = 128, epochs: int = 500, learning_rate: float = 0.001, grad_clip_value: float | None = None, weight_decay: float = 0.0, use_lr_scheduler: bool = False, scheduler_factor: float = 0.5, scheduler_patience: int = 5, verbose: bool = False)[source]¶
Bases:
BaseMolecularEncoder
This encoder implements a InfoGraph for molecular representation learning.
InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization (ICLR 2020)
References
- Parameters:
lw_prior (float, default=0.) – Weight for prior loss term.
embedding_dim (int, default=160) – Dimension of final graph embedding. Must be divisible by num_layer.
num_layer (int, default=5) – Number of GNN layers.
drop_ratio (float, default=0.5) – Dropout probability.
norm_layer (str, default="batch_norm") – Type of normalization layer to use. One of [“batch_norm”, “layer_norm”, “instance_norm”, “graph_norm”, “size_norm”, “pair_norm”].
encoder_type (str, default="gin-virtual") – Type of GNN architecture to use. One of [“gin-virtual”, “gcn-virtual”, “gin”, “gcn”].
readout (str, default="sum") – Method for aggregating node features to obtain graph-level representations. One of [“sum”, “mean”, “max”].
batch_size (int, default=128) – Number of samples per batch for training.
epochs (int, default=500) – Maximum number of training epochs.
learning_rate (float, default=0.001) – Learning rate for optimizer.
grad_clip_value (float, optional) – Maximum norm of gradients for gradient clipping.
weight_decay (float, default=0.0) – L2 regularization strength.
use_lr_scheduler (bool, default=False) – Whether to use a learning rate scheduler.
scheduler_factor (float, default=0.5) – Factor by which to reduce the learning rate when plateau is detected.
scheduler_patience (int, default=5) – Number of epochs with no improvement after which learning rate will be reduced.
verbose (bool, default=False) – Whether to print progress information during training.
model_name (str, default="InfographMolecularEncoder") – Name of the encoder model.
- encode(X: List[str], return_type: Literal['np', 'pt'] = 'pt') ndarray | Tensor [source]¶
Encode molecules into vector representations.
- Parameters:
X (List[str]) – List of SMILES strings
return_type (Literal["np", "pt"], default="pt") – Return type of the representations
- Returns:
representations – Molecular representations
- Return type:
ndarray or torch.Tensor
Supervised Pretraining for Molecules¶
Pretraining with Supervised/Pseudolabeled Data
- class torch_molecule.encoder.supervised.modeling_supervised.SupervisedMolecularEncoder(device: device | None = None, model_name: str = 'SupervisedMolecularEncoder', num_task: int | None = None, predefined_task: List[str] | None = None, encoder_type: str = 'gin-virtual', readout: str = 'sum', num_layer: int = 5, hidden_size: int = 300, drop_ratio: float = 0.5, norm_layer: str = 'batch_norm', batch_size: int = 128, epochs: int = 500, learning_rate: float = 0.001, grad_clip_value: float | None = None, weight_decay: float = 0.0, use_lr_scheduler: bool = False, scheduler_factor: float = 0.5, scheduler_patience: int = 5, verbose: bool = False)[source]¶
Bases:
BaseMolecularEncoder
This encoder implements a GNN model for supervised molecular representation learning with user-defined or predefined fingerprint/calculated property tasks.
- Parameters:
num_task (int, optional) – Number of user-defined tasks for supervised pretraining. If it is specified, user must provide y_train in the fit function.
predefined_task (List[str], optional) – List of predefined tasks to use. Must be from the supported task list [“morgan”, “maccs”, “logP”]. If None and num_task is None, all predefined tasks will be used.
encoder_type (str, default="gin-virtual") – Type of GNN architecture to use. One of [“gin-virtual”, “gcn-virtual”, “gin”, “gcn”].
readout (str, default="sum") – Method for aggregating node features to obtain graph-level representations. One of [“sum”, “mean”, “max”].
num_layer (int, default=5) – Number of GNN layers.
hidden_size (int, default=300) – Dimension of hidden node features.
drop_ratio (float, default=0.5) – Dropout probability.
norm_layer (str, default="batch_norm") – Type of normalization layer to use. One of [“batch_norm”, “layer_norm”, “instance_norm”, “graph_norm”, “size_norm”, “pair_norm”].
batch_size (int, default=128) – Number of samples per batch for training.
epochs (int, default=500) – Maximum number of training epochs.
learning_rate (float, default=0.001) – Learning rate for optimizer.
grad_clip_value (float, optional) – Maximum norm of gradients for gradient clipping.
weight_decay (float, default=0.0) – L2 regularization strength.
use_lr_scheduler (bool, default=False) – Whether to use a learning rate scheduler.
scheduler_factor (float, default=0.5) – Factor by which to reduce the learning rate when plateau is detected.
scheduler_patience (int, default=5) – Number of epochs with no improvement after which learning rate will be reduced.
verbose (bool, default=False) – Whether to print progress information during training.
- encode(X: List[str], return_type: Literal['np', 'pt'] = 'pt') ndarray | Tensor [source]¶
Encode molecules into vector representations.
- Parameters:
X (List[str]) – List of SMILES strings
return_type (Literal["np", "pt"], default="pt") – Return type of the representations
- Returns:
representations – Molecular representations
- Return type:
ndarray or torch.Tensor
- fit(X_train: List[str], y_train: List | ndarray | None = None) SupervisedMolecularEncoder [source]¶
Fit the model to the training data with optional validation set.
- Parameters:
X_train (List[str]) – Training set input molecular structures as SMILES strings
y_train (Union[List, np.ndarray]) – Training set target values for representation learning
- Returns:
self – Fitted estimator
- Return type:
Pretrained Molecular Encoders¶
Sequence-based Pretrained Transformers from Hugging Face
- class torch_molecule.encoder.pretrained.modeling_pretrained.HFPretrainedMolecularEncoder(repo_id: str, max_length: int = 128, batch_size: int = 128, add_bos_eos: bool | None = None, model_name: str = 'PretrainedMolecularEncoder', verbose: bool = False, **kwargs)[source]¶
Implements Hugging Face pretrained transformer models as molecular encoders.
This class provides an interface to use pretrained transformer models from Hugging Face as molecular encoders. It handles tokenization and encoding of molecular representations.
Tested models include:
ChemGPT series (1.2B/19M/4.7M): GPT-Neo based models pretrained on PubChem10M dataset with SELFIES strings. Output dimension: 2048.
repo_id: “ncfrey/ChemGPT-1.2B” (https://huggingface.co/ncfrey/ChemGPT-1.2B)
repo_id: “ncfrey/ChemGPT-19M” (https://huggingface.co/ncfrey/ChemGPT-19M)
repo_id: “ncfrey/ChemGPT-4.7M” (https://huggingface.co/ncfrey/ChemGPT-4.7M)
GPT2-ZINC-87M: GPT-2 based model (87M parameters) pretrained on ZINC dataset with ~480M SMILES strings. Output dimension: 768.
repo_id: “entropy/gpt2_zinc_87m” (https://huggingface.co/entropy/gpt2_zinc_87m)
RoBERTa-ZINC-480M: RoBERTa based model (102M parameters) pretrained on ZINC dataset with ~480M SMILES strings. Output dimension: 768.
repo_id: “entropy/roberta_zinc_480m” (https://huggingface.co/entropy/roberta_zinc_480m)
ChemBERTa series: Available in multiple sizes (77M/10M/5M) and training objectives (MTR/MLM). Output dimension: 384.
repo_id: “DeepChem/ChemBERTa-77M-MTR” (https://huggingface.co/DeepChem/ChemBERTa-77M-MTR)
repo_id: “DeepChem/ChemBERTa-77M-MLM” (https://huggingface.co/DeepChem/ChemBERTa-77M-MLM)
repo_id: “DeepChem/ChemBERTa-10M-MTR” (https://huggingface.co/DeepChem/ChemBERTa-10M-MTR)
repo_id: “DeepChem/ChemBERTa-10M-MLM” (https://huggingface.co/DeepChem/ChemBERTa-10M-MLM)
repo_id: “DeepChem/ChemBERTa-5M-MLM” (https://huggingface.co/DeepChem/ChemBERTa-5M-MLM)
repo_id: “DeepChem/ChemBERTa-5M-MTR” (https://huggingface.co/DeepChem/ChemBERTa-5M-MTR)
UniKi/bert-base-smiles: BERT model pretrained on SMILES strings. Output dimension: 768.
repo_id: “unikei/bert-base-smiles” (https://huggingface.co/unikei/bert-base-smiles)
ChemBERTa-zinc-base-v1: RoBERTa model pretrained on ZINC dataset with ~100k SMILES strings. Output dimension: 384.
repo_id: “seyonec/ChemBERTa-zinc-base-v1” (https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1)
Other models accessible through the transformers library have not been explicitly tested but may still be compatible with this interface.
- Parameters:
repo_id (str) – The Hugging Face repository ID of the pretrained model.
max_length (int, default=128) – Maximum sequence length for tokenization. Longer sequences will be truncated.
batch_size (int, default=128) – Batch size used when encoding multiple molecules.
add_bos_eos (Optional[bool], default=None) – Whether to add beginning/end of sequence tokens. If None, models in known_add_bos_eos_list will be set to True. The current known_add_bos_eos_list includes: [“entropy/gpt2_zinc_87m”].
model_name (str, default="PretrainedMolecularEncoder") – Name identifier for the model instance.
verbose (bool, default=False) – Whether to display progress information during encoding.
- encode(X: List[str], return_type: Literal['np', 'pt'] = 'pt') ndarray | Tensor [source]¶
Encode molecules into vector representations.
- Parameters:
X (List[str]) – List of SMILES strings
return_type (Literal["np", "pt"], default="pt") – Return type of the representations
- Returns:
representations – Molecular representations
- Return type:
ndarray or torch.Tensor
- fit() HFPretrainedMolecularEncoder [source]¶
Load the pretrained model from HuggingFace.