Anvil Reference

Anvil is a workflow engine integrated into openadmet_models that allows users to define a human-readable model specification to reproducibly train and evaluate machine learning models. This is to facilitate large-scale reproducible training and comparisons across different datasets, models, and featurizations. In addition, anvil allows for the training of model ensembles, that can be easily used in downstream applications such as active learning.

Anvil is built around the concept of a “recipe” - a YAML file that specifies all the components of a machine learning workflow, including data loading, featurization, model architecture, training parameters, and evaluation metrics. By defining a recipe, users can easily reproduce experiments, share workflows with others, and systematically explore different modeling approaches. Anvil also makes our (OpenADMET Team) lives easier by handling the boilerplate code associated with setting up and running machine learning experiments, saving us from a twisted jungle of scripts and configuration files.

A full list of available models, featurizers, trainers, and evaluators can be found in the OpenADMET API documentation. Additionally, we maintain a list of canonical recipes we use in production here.

How to Use Anvil

To initiate the anvil workflow, a recipe yaml file must be provided. There are many configuration options available. Each workflow consists of four main sections: metadata, data, procedure, and report.

This guide should help you navigate the anvil workflow and understand the parameters you can set, their types, and how they interact across models and trainers.

Metadata

The metadata section provides essential information about the workflow, such as authorship, versioning, and descriptive tags. This section ensures that workflows are well-documented and easily identifiable. Many of these fields are purely descriptive and do not affect the workflow’s execution.

Parameters .. list-table:

:header-rows: 1
:widths: 20 25 55

* - Name
  - Type
  - Description
* - authors
  - str | list[str]
  - Author(s) of the workflow.
* - email
  - str
  - Contact email.
* - biotargets
  - list[str]
  - List of biological targets associated with the workflow.
* - build_number
  - int
  - Iteration number of the workflow.
* - description
  - str
  - Short description of the workflow.
* - driver
  - str
  - Backend framework for the workflow (e.g., ``pytorch`` or ``sklearn`` or ``pytorch_ensemble``).
* - name
  - str
  - Workflow name.
* - tag
  - str
  - Main tag for the workflow.
* - tags
  - list[str]
  - Additional tags associated with the workflow description.
* - version
  - str
  - Version of the metadata schema. (Currently must be set to ``v1``).

Data

The data section defines how input data is loaded and which columns are used for modeling. You must specify the dataset location, input column, target columns, and optional preprocessing steps. The data loader can read from remote locations as well as local files.

Reading from a local file requires specifying the path to the dataset file in the resource field. Supported file types include CSV, and Parquet. If using resource your dataset will be split into training, validation, and test sets using the specified splitter in the procedure section.

Alternatively, you can also provide separate files for training, validation, and test sets by using the train_resource, val_resource, and test_resource fields, respectively.

A more advanced option is to use an Intake catalog to manage datasets. This is done by specifying a YAML file in the resource field and the catalog entry in the cat_entry field. This allows for more flexible dataset management, especially when dealing with multiple datasets or complex data sources.

Pulling data from a remote location is also possible by specifying a URL in the resource field.

An example of using train, validation, and test resources:

Parameters .. list-table:

:header-rows: 1
:widths: 20 25 55

* - Name
  - Type
  - Description
* - resource
  - str
  - Path to dataset file. Allowed filetypes: YAML, CSV, parquet.
     Can also be a URL to a remote file.
 * - train_resource
   - Optional[str]
   - Path to training dataset file. Allowed filetypes: CSV, parquet.
      Can also be a URL to a remote file.
 * - val_resource
   - Optional[str]
   - Path to validation dataset file. Allowed filetypes: CSV, parquet.
      Can also be a URL to a remote file.
 * - test_resource
   - Optional[str]
   - Path to test dataset file. Allowed filetypes: CSV, parquet.
      Can also be a URL to a remote file.
* - type
  - str, default: ``intake``
  - Loader type. Must be ``intake``. Uses the `Intake`_ data catalog
    system to read datasets.
* - input_col
  - str
  - Column name containing molecular input.
* - target_cols
  - Union[str, list[str]]
  - Name(s) of the target column(s) for the model to predict.
* - dropna
  - Optional[bool]
  - Whether to drop rows with missing values (``NaN``) in the input or
    target columns.
* - cat_entry
  - Optional[str]
  - Used when ``resource`` is a YAML file, to specify which
    catalog entry to load.
* - anvil_dir
  - Optional[str]
  - Allows for ``resource`` to point to a directory path.
    Useful for flexible dataset locations.

Procedure

The procedure section is the core of the workflow, where the data is transformed, models are defined, data splits are configured, and training parameters are set.

  • Featurization: Defines how molecular data is transformed into numerical representations using various available featurizers, specified in the feat subsection.

  • Model: Specifies the model to be used, including loading from saved model weights, under the model subsection.

  • Splitting: Configures how the dataset is divided into training, validation, and test sets using assigned splitter, defined in the split subsection.

  • Training: Sets up the training process, including the trainer type and training parameters, under the train subsection.

Each subsection provides examples and parameter descriptions to help you configure the workflow according to your requirements.

Featurization

The feat section is used to specify among a variety of featurizers, which map molecular data into suitable input formats for the specified model. Below are the available options. Each featurizer has its own set of parameters which can be found in the linked OpenADMET API documentation.

In general we follow the design pattern that all deep learning featurizers return a PyTorch DataLoader as input, while traditional machine learning models return a a 2D NumPy array or pandas DataFrame.

Featurizer

Description

ChemPropFeaturizer

Converts SMILES strings into a ChemProp compatible PyTorch DataLoader.

DescriptorFeaturizer

Uses the molfeat library to compute molecular descriptors.

FingerprintFeaturizer

Uses the molfeat library to compute molecular fingerprints.

FeatureConcatenator

Combines multiple featurizers into a single feature array.

For example, featurization for a traditional machine learning model using fingerprints is easily done by specifying the FingerprintFeaturizer.

You can also combine multiple traditional ML featurizers using the FeatureConcatenator. Here we combine RDKit 2D descriptors and ECFP4 fingerprints.

For deep learning models, architectures require specific featurizers to prepare the data in the correct format. As an example, the ChemPropFeaturizer is selected for ChemProp-family models.

Model

The model section specifies the model to be used in the workflow. It allows you to define the type of model, its parameters, and any additional configurations required for training and evaluation. Each model type has its own set of options, enabling customization to suit specific tasks and datasets. Refer to the linked OpenADMET API documentation for detailed information on each model’s implementation and usage.

Model Type

Description

ChemPropModel

ChemProp Message Passing Neural Network. Also, used when implementing Chemeleon.

CatBoostClassifierModel

Gradient boosting on decision trees for classification using CatBoost.

CatBoostRegressorModel

Gradient boosting on decision trees for regression using CatBoost.

LGBMClassifierModel

LightGBM classifier.

LGBMRegressorModel

LightGBM regressor.

XGBClassifierModel

XGBoost classifier model implementation

XGBRegressorModel

XGBoost regressor model implementation

RFClassifierModel

scikit-learn Random Forest classifier .

RFRegressorModel

scikit-learn Random Forest regressor .

TabPFNClassifierModel

TabPFN classification model using the basic tabpfn implementation.

TabPFNRegressorModel

TabPFN regression model using the basic tabpfn implementation.

TabPFNPostHocClassifierModel

TabPFN classification model using tabpfn-extensions with posthoc ensembling.

TabPFNPostHocRegressorModel

TabPFN regression model using tabpfn-extensions with posthoc ensembling.

DummyClassifierModel

scikit-learn Dummy Classifier for baseline comparisons.

DummyRegressorModel

scikit-learn Dummy Regressor for baseline comparisons.

SVMClassifierModel

scikit-learn Support Vector Machine classifier (SVC) .

SVMRegressorModel

scikit-learn Support Vector Machine regressor (SVR) .

Example

Deep learning models can also be configured to load pretrained weights during initialization, enabling finetuning on new datasets. Classic machine learning models do not have weights or state to load, requiring complete retraining with new data concatenated to the original training set. Paths to a saved model are specified with the serial_path and param_path fields in the model section. Note that any model parameters defined in the params field will be overridden by those loaded from the saved model.

Finally, select components of the deep learning model can be frozen during training to prevent their weights from being updated. This is done by specifying the freeze_weights field in the model section. Here, the message passing layers are frozen, while batch normalization and feedforward network layers remain trainable.

Freezable components vary by model architecture. Please refer to the specific model documentation for details on which parts can be frozen.

Splitting

The split section defines how the dataset is divided into training, validation, and test sets. You can choose from different splitter types, each with its own parameters to control the splitting behavior.

Splitter

Description

ShuffleSplitter

Randomly shuffles and splits the dataset into training, validation, and test sets based on specified proportions.

ScaffoldSplitter

Splits the dataset based on molecular scaffolds to ensure that similar compounds are grouped together in the same set.

MaxDissimilaritySplitter

Splits the dataset based on maximum dissimilarity between training, validation, and test sets, promoting diversity between sets.

PerimeterSplitter

Splits the dataset by selecting compounds at the periphery of the chemical space, ensuring that edge cases are included in the test set.

Example

Training

The train section configures the training process for the selected model. It allows you to specify the trainer type and various training parameters to control the training workflow.

Trainer

Description

LightningTrainer

Trainer for deep learning models using PyTorch Lightning.

SKLearnBasicTrainer

Basic trainer for sklearn models.

SKLearnGridSearchTrainer

Trainer that performs hyperparameter tuning using specifically grid search for sklearn models (GridSearchCV).

SKLearnSearchTrainer

Trainer that performs hyperparameter tuning using specified search object for sklearn models.

Example

Ensemble

There is also an optional ensemble section that allows you to specify if you want to train an ensemble of models. You can define the number of models in the ensemble and the calibration method to be used. Currently we only offer a CommitteeRegressor to measure disagreement among the models in the ensemble as the standard deviation of their predictions.

Models can also be calibrated after training using a scaling factor method to improve uncertainty estimates. This functionality is provided by the uncertainty_toolbox package. See UncertaintyMetrics for more details.

Example

For deep learning ensembles, each model in the ensemble can load its own pretrained weights by specifying a list of paths in the serial_paths and param_paths fields in the ensemble section.

Report

The report section specifies the evaluations to be performed after training the model. You can choose from various evaluation types, each with its own parameters to customize the output. Regression models are only compatible with RegressionMetrics and similarly classification models only with ClassificationMetrics.

Importantly, the report section also allows for cross-validation to be performed to evaluate the robustness of the model. Note that cross-validation can be computationally expensive, especially for deep learning models.

Evaluation

Description

RegressionMetrics

Computes regression statistics.

RegressionPlots

Generates plots of predicted vs true values for regression tasks.

ClassificationMetrics

Computes classification statistics.

ClassificationPlots

Generates plots such as ROC and precision-recall curves for classification tasks.

SKLearnRepeatedKFoldCrossValidation

Performs repeated K-Fold cross-validation for sklearn models. It should be noted that performing cross-validation can be computationally expensive.

PytorchLightningRepeatedKFoldCrossValidation

Performs repeated K-Fold cross-validation for PyTorch Lightning models. It should be noted that performing cross-validation can be computationally expensive.

PosthocBinaryMetrics

Compute posthoc binary metrics. Intended to be used for regression-based models to calculate precision and recall metrics for user-input cutoffs. Not intended for binary models.

UncertaintyMetrics

Evaluate uncertainty metrics using uncertainty_toolbox.

UncertaintyPlots

Generates uncertainty plots.