Molecular Encoder Models¶

The encoder models inherit from the torch_molecule.base.encoder.BaseMolecularEncoder class and share common methods for model pretraining and encoding, as well as model persistence.

Training and Encoding

fit(X, **kwargs): Pretrain the model on given data, where X contains SMILES strings

Not implemented for: - N/A
encode(X, **kwargs): Encode new SMILES strings and return a dictionary containing encoded representations

Model Persistence

inherited from torch_molecule.base.base.BaseModel

save_to_local(path): Save the trained model to a local file
load_from_local(path): Load a trained model from a local file
save_to_hf(repo_id): Push the model to Hugging Face Hub
load_from_hf(repo_id, local_cache): Load a model from Hugging Face Hub and save it to a local file
save(path, repo_id): Save the model to either local storage or Hugging Face
load(path, repo_id): Load a model from either local storage or Hugging Face

Self-supervised Molecular Representation Learning¶

MoAma for Molecular Representation Learning

class torch_molecule.encoder.moama.modeling_moama.MoamaMolecularEncoder(device: device | None = None, model_name: str = 'MoamaMolecularEncoder', mask_rate: float = 0.15, lw_rec: float = 0.5, num_layer: int = 5, hidden_size: int = 300, drop_ratio: float = 0.5, norm_layer: str = 'batch_norm', encoder_type: str = 'gin-virtual', readout: str = 'sum', batch_size: int = 128, epochs: int = 500, learning_rate: float = 0.001, grad_clip_value: float | None = None, weight_decay: float = 0.0, use_lr_scheduler: bool = False, scheduler_factor: float = 0.5, scheduler_patience: int = 5, verbose: bool = False)[source]¶

Bases: BaseMolecularEncoder

This encoder implements Motif-aware Attribute Masking for Molecular Graph Pre-training.

References

Paper: https://openreview.net/forum?id=uqPnesiGGi
Code: https://github.com/einae-nd/MoAMa-dev

Parameters:

mask_rate (float, default=0.15) – Fraction of nodes to mask in each graph
lw_rec (float, default=0.5) – Weight balancing between reconstruction loss and fingerprint loss. Higher values emphasize reconstruction, lower values emphasize fingerprint matching.
num_layer (int, default=5) – Number of GNN layers.
hidden_size (int, default=300) – Dimension of hidden node features.
drop_ratio (float, default=0.5) – Dropout probability.
norm_layer (str, default="batch_norm") – Type of normalization layer to use. One of [“batch_norm”, “layer_norm”, “instance_norm”, “graph_norm”, “size_norm”, “pair_norm”].
encoder_type (str, default="gin-virtual") – Type of GNN architecture to use. One of [“gin-virtual”, “gcn-virtual”, “gin”, “gcn”].
readout (str, default="sum") – Method for aggregating node features to obtain graph-level representations. One of [“sum”, “mean”, “max”].
batch_size (int, default=128) – Number of samples per batch for training.
epochs (int, default=500) – Maximum number of training epochs.
learning_rate (float, default=0.001) – Learning rate for optimizer.
grad_clip_value (float, optional) – Maximum norm of gradients for gradient clipping.
weight_decay (float, default=0.0) – L2 regularization strength.
use_lr_scheduler (bool, default=False) – Whether to use a learning rate scheduler.
scheduler_factor (float, default=0.5) – Factor by which to reduce the learning rate when plateau is detected.
scheduler_patience (int, default=5) – Number of epochs with no improvement after which learning rate will be reduced.
verbose (bool, default=False) – Whether to print progress information during training.
model_name (str, default="MoamaMolecularEncoder") – Name of the encoder model.

encode(X: List[str], return_type: Literal['np', 'pt'] = 'pt') → ndarray | Tensor[source]¶

Encode molecules into vector representations.

Parameters:

X (List[str]) – List of SMILES strings
return_type (Literal["np", "pt"], default="pt") – Return type of the representations

Returns:

representations – Molecular representations

Return type:

ndarray or torch.Tensor

fit(X_train: List[str]) → MoamaMolecularEncoder[source]¶

Fit the model to the training data with optional validation set.

Parameters:: X_train (List[str]) – Training set input molecular structures as SMILES strings
Returns:: self – Fitted estimator
Return type:: MoamaMolecularEncoder

Attribute Masking for Molecular Representation Learning

class torch_molecule.encoder.attrmask.modeling_attrmask.AttrMaskMolecularEncoder(device: device | None = None, model_name: str = 'AttrMaskMolecularEncoder', mask_num: int = 0, mask_rate: float = 0.15, num_layer: int = 5, hidden_size: int = 300, drop_ratio: float = 0.5, norm_layer: str = 'batch_norm', encoder_type: str = 'gin-virtual', readout: str = 'sum', batch_size: int = 128, epochs: int = 500, learning_rate: float = 0.001, grad_clip_value: float | None = None, weight_decay: float = 0.0, use_lr_scheduler: bool = False, scheduler_factor: float = 0.5, scheduler_patience: int = 5, verbose: bool = False)[source]¶

Bases: BaseMolecularEncoder

This encoder implements a GNN-based model for molecular representation learning using the attribute masking pretraining strategy.

References

Paper: Strategies for Pre-training Graph Neural Networks (ICLR 2020) https://arxiv.org/abs/1905.12265
Code: https://github.com/snap-stanford/pretrain-gnns/tree/master/chem

Parameters:

mask_num (int, default=0) – Number of atom features to mask during pretraining. If set to 0, masking is determined by mask_rate.
mask_rate (float, default=0.15) – Proportion of atoms to mask randomly. Ignored if mask_num is set.
num_layer (int, default=5) – Number of GNN layers.
hidden_size (int, default=300) – Dimension of hidden node features.
drop_ratio (float, default=0.5) – Dropout probability.
norm_layer (str, default="batch_norm") – Type of normalization layer to use. One of [“batch_norm”, “layer_norm”, “instance_norm”, “graph_norm”, “size_norm”, “pair_norm”].
encoder_type (str, default="gin-virtual") – Type of GNN architecture to use. One of [“gin-virtual”, “gcn-virtual”, “gin”, “gcn”].
readout (str, default="sum") – Method for aggregating node features to obtain graph-level representations. One of [“sum”, “mean”, “max”].
batch_size (int, default=128) – Number of samples per batch for training.
epochs (int, default=500) – Maximum number of training epochs.
learning_rate (float, default=0.001) – Learning rate for optimizer.
grad_clip_value (float, optional) – Maximum norm of gradients for gradient clipping.
weight_decay (float, default=0.0) – L2 regularization strength.
use_lr_scheduler (bool, default=False) – Whether to use a learning rate scheduler.
scheduler_factor (float, default=0.5) – Factor by which to reduce the learning rate when plateau is detected.
scheduler_patience (int, default=5) – Number of epochs with no improvement after which learning rate will be reduced.
verbose (bool, default=False) – Whether to print progress information during training.
model_name (str, default="AttrMaskMolecularEncoder") – Name of the encoder model.

encode(X: List[str], return_type: Literal['np', 'pt'] = 'pt') → ndarray | Tensor[source]¶

Encode molecules into vector representations.

Parameters:

X (List[str]) – List of SMILES strings
return_type (Literal["np", "pt"], default="pt") – Return type of the representations

Returns:

representations – Molecular representations

Return type:

ndarray or torch.Tensor

fit(X_train: List[str]) → AttrMaskMolecularEncoder[source]¶

Fit the model to the training data with optional validation set.

Parameters:: X_train (List[str]) – Training set input molecular structures as SMILES strings
Returns:: self – Fitted estimator
Return type:: AttrMaskMolecularEncoder

Graph masked autoencoder

class torch_molecule.encoder.graphmae.modeling_graphmae.GraphMAEMolecularEncoder(device: device | None = None, model_name: str = 'GraphMAEMolecularEncoder', mask_rate: float = 0.15, mask_edge: bool = False, predictor_type: str = 'gin', num_layer: int = 5, hidden_size: int = 300, drop_ratio: float = 0.5, norm_layer: str = 'batch_norm', encoder_type: str = 'gin-virtual', readout: str = 'sum', batch_size: int = 128, epochs: int = 500, learning_rate: float = 0.001, grad_clip_value: float | None = None, weight_decay: float = 0.0, use_lr_scheduler: bool = False, scheduler_factor: float = 0.5, scheduler_patience: int = 5, verbose: bool = False)[source]¶

Bases: BaseMolecularEncoder

GraphMAE: Self-Supervised Masked Graph Autoencoders

References

Paper: https://arxiv.org/abs/2205.10803
Code: https://github.com/THUDM/GraphMAE/tree/main/chem

Parameters:

mask_rate (float, default=0.15) – Fraction of nodes to mask during training.
mask_edge (bool, default=False) – Whether to mask edges in addition to nodes.
predictor_type (str, default="gin") – Type of predictor network to use for reconstruction. Options: [“gin”, “gcn”, “linear”]
num_layer (int, default=5) – Number of message passing layers in the GNN.
hidden_size (int, default=300) – Dimension of hidden node representations.
drop_ratio (float, default=0.5) – Dropout probability.
norm_layer (str, default="batch_norm") – Type of normalization to use. Options: [“batch_norm”, “layer_norm”, “instance_norm”, “graph_norm”, “size_norm”, “pair_norm”]
encoder_type (str, default="gin-virtual") – Type of GNN encoder to use. Options: [“gin-virtual”, “gcn-virtual”, “gin”, “gcn”]
readout (str, default="sum") – Pooling method to use for graph-level representations. Options: [“sum”, “mean”, “max”]
batch_size (int, default=128) – Batch size for training and inference.
epochs (int, default=500) – Number of training epochs.
learning_rate (float, default=0.001) – Learning rate for optimizer.
grad_clip_value (Optional[float], default=None) – Maximum norm of gradients for gradient clipping. No clipping if None.
weight_decay (float, default=0.0) – L2 regularization factor.
use_lr_scheduler (bool, default=False) – Whether to use a learning rate scheduler.
scheduler_factor (float, default=0.5) – Factor by which to reduce learning rate when using scheduler.
scheduler_patience (int, default=5) – Number of epochs with no improvement after which learning rate will be reduced.
verbose (bool, default=False) – Whether to display progress bars and logs.
model_name (str, default="GraphMAEMolecularEncoder") – Name of the model.

Examples

>>> from torch_molecule import GraphMAEMolecularEncoder
>>> encoder = GraphMAEMolecularEncoder(hidden_size=128, epochs=100)
>>> encoder.fit(["CC(=O)OC1=CC=CC=C1C(=O)O", "CCO", "C1=CC=CC=C1"])
>>> representations = encoder.encode(["CCO"])

encode(X: List[str], return_type: Literal['np', 'pt'] = 'pt') → ndarray | Tensor[source]¶

Encode molecules into vector representations.

Parameters:

X (List[str]) – List of SMILES strings
return_type (Literal["np", "pt"], default="pt") – Return type of the representations

Returns:

representations – Molecular representations

Return type:

ndarray or torch.Tensor

fit(X_train: List[str]) → GraphMAEMolecularEncoder[source]¶

Fit the model to the training data with optional validation set.

Parameters:: X_train (List[str]) – Training set input molecular structures as SMILES strings
Returns:: self – Fitted estimator
Return type:: GraphMAEMolecularEncoder

Context Prediction

class torch_molecule.encoder.contextpred.modeling_contextpred.ContextPredMolecularEncoder(device: device | None = None, model_name: str = 'ContextPredMolecularEncoder', mode: str = 'cbow', context_size: int = 2, neg_samples: int = 1, num_layer: int = 3, hidden_size: int = 300, drop_ratio: float = 0.5, norm_layer: str = 'batch_norm', encoder_type: str = 'gin-virtual', readout: str = 'sum', batch_size: int = 128, epochs: int = 500, learning_rate: float = 0.001, grad_clip_value: float | None = None, weight_decay: float = 0.0, use_lr_scheduler: bool = False, scheduler_factor: float = 0.5, scheduler_patience: int = 5, verbose: bool = False)[source]¶

Bases: BaseMolecularEncoder

This encoder implements a GNN-based model for molecular representation learning using the context prediction pretraining strategy.

References

Paper: Strategies for Pre-training Graph Neural Networks (ICLR 2020) https://arxiv.org/abs/1905.12265
Code: https://github.com/snap-stanford/pretrain-gnns/tree/master/chem

Parameters:

mode (str, default="cbow") – Type of context prediction task. One of [“cbow”, “skipgram”].
context_size (int, default=2) – Size of the context window used for predicting node-level features.
neg_samples (int, default=1) – Number of negative samples used in the training objective.
num_layer (int, default=3) – Number of GNN layers.
hidden_size (int, default=300) – Dimension of hidden node features.
drop_ratio (float, default=0.5) – Dropout probability.
norm_layer (str, default="batch_norm") – Type of normalization layer to use. One of [“batch_norm”, “layer_norm”, “instance_norm”, “graph_norm”, “size_norm”, “pair_norm”].
encoder_type (str, default="gin-virtual") – Type of GNN architecture to use. One of [“gin-virtual”, “gcn-virtual”, “gin”, “gcn”].
readout (str, default="sum") – Method for aggregating node features to obtain graph-level representations. One of [“sum”, “mean”, “max”].
batch_size (int, default=128) – Number of samples per batch for training.
epochs (int, default=500) – Maximum number of training epochs.
learning_rate (float, default=0.001) – Learning rate for the optimizer.
grad_clip_value (float, optional) – Maximum norm of gradients for gradient clipping.
weight_decay (float, default=0.0) – L2 regularization strength.
use_lr_scheduler (bool, default=False) – Whether to use a learning rate scheduler during training.
scheduler_factor (float, default=0.5) – Factor by which to reduce learning rate when plateau is reached.
scheduler_patience (int, default=5) – Number of epochs with no improvement after which learning rate will be reduced.
verbose (bool, default=False) – Whether to print progress information during training.

encode(X: List[str], return_type: Literal['np', 'pt'] = 'pt') → ndarray | Tensor[source]¶

Encode molecules into vector representations.

Parameters:

X (List[str]) – List of SMILES strings
return_type (Literal["np", "pt"], default="pt") – Return type of the representations

Returns:

representations – Molecular representations

Return type:

ndarray or torch.Tensor

fit(X_train: List[str]) → ContextPredMolecularEncoder[source]¶

Fit the model to the training data with optional validation set.

Parameters:: X_train (List[str]) – Training set input molecular structures as SMILES strings
Returns:: self – Fitted estimator
Return type:: ContextPredMolecularEncoder

Edge Prediction

class torch_molecule.encoder.edgepred.modeling_edgepred.EdgePredMolecularEncoder(device: device | None = None, model_name: str = 'EdgePredMolecularEncoder', num_layer: int = 5, hidden_size: int = 300, drop_ratio: float = 0.5, norm_layer: str = 'batch_norm', encoder_type: str = 'gin-virtual', readout: str = 'sum', batch_size: int = 128, epochs: int = 500, learning_rate: float = 0.001, grad_clip_value: float | None = None, weight_decay: float = 0.0, use_lr_scheduler: bool = False, scheduler_factor: float = 0.5, scheduler_patience: int = 5, verbose: bool = False)[source]¶

Bases: BaseMolecularEncoder

This encoder implements a GNN-based model for molecular representation learning using the edge prediction.

References

Paper: Strategies for Pre-training Graph Neural Networks (ICLR 2020) https://arxiv.org/abs/1905.12265
Code: https://github.com/snap-stanford/pretrain-gnns/tree/master/chem

Parameters:

num_layer (int, default=5) – Number of GNN layers.
hidden_size (int, default=300) – Dimension of hidden node features.
drop_ratio (float, default=0.5) – Dropout probability.
norm_layer (str, default="batch_norm") – Type of normalization layer to use. One of [“batch_norm”, “layer_norm”, “instance_norm”, “graph_norm”, “size_norm”, “pair_norm”].
encoder_type (str, default="gin-virtual") – Type of GNN architecture to use.
readout (str, default="sum") – Method for aggregating node features to obtain graph-level representations.
batch_size (int, default=128) – Number of samples per batch for training.
epochs (int, default=500) – Maximum number of training epochs.
learning_rate (float, default=0.001) – Learning rate for optimizer.
grad_clip_value (float, optional) – Maximum norm of gradients for gradient clipping.
weight_decay (float, default=0.0) – L2 regularization strength.
use_lr_scheduler (bool, default=False) – Whether to use learning rate scheduler.
scheduler_factor (float, default=0.5) – Factor by which to reduce learning rate when plateau is reached.
scheduler_patience (int, default=5) – Number of epochs with no improvement after which learning rate will be reduced.
verbose (bool, default=False) – Whether to print progress information during training.

encode(X: List[str], return_type: Literal['np', 'pt'] = 'pt') → ndarray | Tensor[source]¶

Encode molecules into vector representations.

Parameters:

X (List[str]) – List of SMILES strings
return_type (Literal["np", "pt"], default="pt") – Return type of the representations

Returns:

representations – Molecular representations

Return type:

ndarray or torch.Tensor

fit(X_train: List[str]) → EdgePredMolecularEncoder[source]¶

Fit the model to the training data with optional validation set.

Parameters:: X_train (List[str]) – Training set input molecular structures as SMILES strings
Returns:: self – Fitted estimator
Return type:: EdgePredMolecularEncoder

InfoGraph

class torch_molecule.encoder.infograph.modeling_infograph.InfoGraphMolecularEncoder(device: device | None = None, model_name: str = 'InfographMolecularEncoder', lw_prior: float = 0.0, embedding_dim: int = 160, num_layer: int = 5, drop_ratio: float = 0.5, norm_layer: str = 'batch_norm', encoder_type: str = 'gin-virtual', readout: str = 'sum', batch_size: int = 128, epochs: int = 500, learning_rate: float = 0.001, grad_clip_value: float | None = None, weight_decay: float = 0.0, use_lr_scheduler: bool = False, scheduler_factor: float = 0.5, scheduler_patience: int = 5, verbose: bool = False)[source]¶

Bases: BaseMolecularEncoder

This encoder implements a InfoGraph for molecular representation learning.

InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization (ICLR 2020)

References

Parameters:

lw_prior (float, default=0.) – Weight for prior loss term.
embedding_dim (int, default=160) – Dimension of final graph embedding. Must be divisible by num_layer.
num_layer (int, default=5) – Number of GNN layers.
drop_ratio (float, default=0.5) – Dropout probability.
norm_layer (str, default="batch_norm") – Type of normalization layer to use. One of [“batch_norm”, “layer_norm”, “instance_norm”, “graph_norm”, “size_norm”, “pair_norm”].
encoder_type (str, default="gin-virtual") – Type of GNN architecture to use. One of [“gin-virtual”, “gcn-virtual”, “gin”, “gcn”].
readout (str, default="sum") – Method for aggregating node features to obtain graph-level representations. One of [“sum”, “mean”, “max”].
batch_size (int, default=128) – Number of samples per batch for training.
epochs (int, default=500) – Maximum number of training epochs.
learning_rate (float, default=0.001) – Learning rate for optimizer.
grad_clip_value (float, optional) – Maximum norm of gradients for gradient clipping.
weight_decay (float, default=0.0) – L2 regularization strength.
use_lr_scheduler (bool, default=False) – Whether to use a learning rate scheduler.
scheduler_factor (float, default=0.5) – Factor by which to reduce the learning rate when plateau is detected.
scheduler_patience (int, default=5) – Number of epochs with no improvement after which learning rate will be reduced.
verbose (bool, default=False) – Whether to print progress information during training.
model_name (str, default="InfographMolecularEncoder") – Name of the encoder model.

encode(X: List[str], return_type: Literal['np', 'pt'] = 'pt') → ndarray | Tensor[source]¶

Encode molecules into vector representations.

Parameters:

X (List[str]) – List of SMILES strings
return_type (Literal["np", "pt"], default="pt") – Return type of the representations

Returns:

representations – Molecular representations

Return type:

ndarray or torch.Tensor

fit(X_train: List[str]) → InfographMolecularEncoder[source]¶

Fit the model to the training data with optional validation set.

Parameters:: X_train (List[str]) – Training set input molecular structures as SMILES strings
Returns:: self – Fitted estimator
Return type:: InfographMolecularEncoder

Supervised Pretraining for Molecules¶

Pretraining with Supervised/Pseudolabeled Data

class torch_molecule.encoder.supervised.modeling_supervised.SupervisedMolecularEncoder(device: device | None = None, model_name: str = 'SupervisedMolecularEncoder', num_task: int | None = None, predefined_task: List[str] | None = None, encoder_type: str = 'gin-virtual', readout: str = 'sum', num_layer: int = 5, hidden_size: int = 300, drop_ratio: float = 0.5, norm_layer: str = 'batch_norm', batch_size: int = 128, epochs: int = 500, learning_rate: float = 0.001, grad_clip_value: float | None = None, weight_decay: float = 0.0, use_lr_scheduler: bool = False, scheduler_factor: float = 0.5, scheduler_patience: int = 5, verbose: bool = False)[source]¶

Bases: BaseMolecularEncoder

This encoder implements a GNN model for supervised molecular representation learning with user-defined or predefined fingerprint/calculated property tasks.

Parameters:

num_task (int, optional) – Number of user-defined tasks for supervised pretraining. If it is specified, user must provide y_train in the fit function.
predefined_task (List[str], optional) – List of predefined tasks to use. Must be from the supported task list [“morgan”, “maccs”, “logP”]. If None and num_task is None, all predefined tasks will be used.
encoder_type (str, default="gin-virtual") – Type of GNN architecture to use. One of [“gin-virtual”, “gcn-virtual”, “gin”, “gcn”].
readout (str, default="sum") – Method for aggregating node features to obtain graph-level representations. One of [“sum”, “mean”, “max”].
num_layer (int, default=5) – Number of GNN layers.
hidden_size (int, default=300) – Dimension of hidden node features.
drop_ratio (float, default=0.5) – Dropout probability.
norm_layer (str, default="batch_norm") – Type of normalization layer to use. One of [“batch_norm”, “layer_norm”, “instance_norm”, “graph_norm”, “size_norm”, “pair_norm”].
batch_size (int, default=128) – Number of samples per batch for training.
epochs (int, default=500) – Maximum number of training epochs.
learning_rate (float, default=0.001) – Learning rate for optimizer.
grad_clip_value (float, optional) – Maximum norm of gradients for gradient clipping.
weight_decay (float, default=0.0) – L2 regularization strength.
use_lr_scheduler (bool, default=False) – Whether to use a learning rate scheduler.
scheduler_factor (float, default=0.5) – Factor by which to reduce the learning rate when plateau is detected.
scheduler_patience (int, default=5) – Number of epochs with no improvement after which learning rate will be reduced.
verbose (bool, default=False) – Whether to print progress information during training.

encode(X: List[str], return_type: Literal['np', 'pt'] = 'pt') → ndarray | Tensor[source]¶

Encode molecules into vector representations.

Parameters:

X (List[str]) – List of SMILES strings
return_type (Literal["np", "pt"], default="pt") – Return type of the representations

Returns:

representations – Molecular representations

Return type:

ndarray or torch.Tensor

fit(X_train: List[str], y_train: List | ndarray | None = None) → SupervisedMolecularEncoder[source]¶

Fit the model to the training data with optional validation set.

Parameters:

X_train (List[str]) – Training set input molecular structures as SMILES strings
y_train (Union[List, np.ndarray]) – Training set target values for representation learning

Returns:

self – Fitted estimator

Return type:

SupervisedMolecularEncoder

Pretrained Molecular Encoders¶

Sequence-based Pretrained Transformers from Hugging Face

class torch_molecule.encoder.pretrained.modeling_pretrained.HFPretrainedMolecularEncoder(repo_id: str, max_length: int = 128, batch_size: int = 128, add_bos_eos: bool | None = None, model_name: str = 'PretrainedMolecularEncoder', verbose: bool = False, **kwargs)[source]¶

Implements Hugging Face pretrained transformer models as molecular encoders.

This class provides an interface to use pretrained transformer models from Hugging Face as molecular encoders. It handles tokenization and encoding of molecular representations.

Tested models include:

ChemGPT series (1.2B/19M/4.7M): GPT-Neo based models pretrained on PubChem10M dataset with SELFIES strings. Output dimension: 2048.

repo_id: “ncfrey/ChemGPT-1.2B” (https://huggingface.co/ncfrey/ChemGPT-1.2B)

repo_id: “ncfrey/ChemGPT-19M” (https://huggingface.co/ncfrey/ChemGPT-19M)

repo_id: “ncfrey/ChemGPT-4.7M” (https://huggingface.co/ncfrey/ChemGPT-4.7M)
GPT2-ZINC-87M: GPT-2 based model (87M parameters) pretrained on ZINC dataset with ~480M SMILES strings. Output dimension: 768.

repo_id: “entropy/gpt2_zinc_87m” (https://huggingface.co/entropy/gpt2_zinc_87m)
RoBERTa-ZINC-480M: RoBERTa based model (102M parameters) pretrained on ZINC dataset with ~480M SMILES strings. Output dimension: 768.

repo_id: “entropy/roberta_zinc_480m” (https://huggingface.co/entropy/roberta_zinc_480m)
ChemBERTa series: Available in multiple sizes (77M/10M/5M) and training objectives (MTR/MLM). Output dimension: 384.

repo_id: “DeepChem/ChemBERTa-77M-MTR” (https://huggingface.co/DeepChem/ChemBERTa-77M-MTR)

repo_id: “DeepChem/ChemBERTa-77M-MLM” (https://huggingface.co/DeepChem/ChemBERTa-77M-MLM)

repo_id: “DeepChem/ChemBERTa-10M-MTR” (https://huggingface.co/DeepChem/ChemBERTa-10M-MTR)

repo_id: “DeepChem/ChemBERTa-10M-MLM” (https://huggingface.co/DeepChem/ChemBERTa-10M-MLM)

repo_id: “DeepChem/ChemBERTa-5M-MLM” (https://huggingface.co/DeepChem/ChemBERTa-5M-MLM)

repo_id: “DeepChem/ChemBERTa-5M-MTR” (https://huggingface.co/DeepChem/ChemBERTa-5M-MTR)
UniKi/bert-base-smiles: BERT model pretrained on SMILES strings. Output dimension: 768.

repo_id: “unikei/bert-base-smiles” (https://huggingface.co/unikei/bert-base-smiles)
ChemBERTa-zinc-base-v1: RoBERTa model pretrained on ZINC dataset with ~100k SMILES strings. Output dimension: 384.

repo_id: “seyonec/ChemBERTa-zinc-base-v1” (https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1)

Other models accessible through the transformers library have not been explicitly tested but may still be compatible with this interface.

Parameters:

repo_id (str) – The Hugging Face repository ID of the pretrained model.
max_length (int, default=128) – Maximum sequence length for tokenization. Longer sequences will be truncated.
batch_size (int, default=128) – Batch size used when encoding multiple molecules.
add_bos_eos (Optional[bool], default=None) – Whether to add beginning/end of sequence tokens. If None, models in known_add_bos_eos_list will be set to True. The current known_add_bos_eos_list includes: [“entropy/gpt2_zinc_87m”].
model_name (str, default="PretrainedMolecularEncoder") – Name identifier for the model instance.
verbose (bool, default=False) – Whether to display progress information during encoding.

encode(X: List[str], return_type: Literal['np', 'pt'] = 'pt') → ndarray | Tensor[source]¶

Encode molecules into vector representations.

Parameters:

X (List[str]) – List of SMILES strings
return_type (Literal["np", "pt"], default="pt") – Return type of the representations

Returns:

representations – Molecular representations

Return type:

ndarray or torch.Tensor

fit() → HFPretrainedMolecularEncoder[source]¶: Load the pretrained model from HuggingFace.

load() → None[source]¶: The same as fit()

load_from_hf() → None[source]¶: The same as fit()