Data-Driven Splitting Strategies

Scaffold-based data splitting implementations.

class openadmet.models.split.scaffold.MaxDissimilaritySplitter(*, train_size: float = 0.8, val_size: float = 0.0, test_size: float = 0.2, random_state: int = 42)[source]

Bases: SplitterBase

Splits the data based on maximum dissimilarity.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

split(X, y)[source]

Split the data into train, validation, and test sets.

Parameters:
  • X (Iterable[str]) – List or iterable of SMILES strings to split.

  • y (Iterable[float] or pd.Series) – List or iterable of target values corresponding to the SMILES strings.

Returns:

Tuple containing: - X_train: Training set SMILES strings. - X_val: Validation set SMILES strings (or None if val_size=0). - X_test: Test set SMILES strings (or None if test_size=0). - y_train: Training set target values. - y_val: Validation set target values (or None if val_size=0). - y_test: Test set target values (or None if test_size=0).

Return type:

tuple

class openadmet.models.split.scaffold.PerimeterSplitter(*, train_size: float = 0.8, val_size: float = 0.0, test_size: float = 0.2, random_state: int = 42)[source]

Bases: SplitterBase

Splits the data based on the perimeter of the molecules.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

split(X, y)[source]

Split the data into train, validation, and test sets.

Parameters:
  • X (Iterable[str]) – List or iterable of SMILES strings to split.

  • y (Iterable[float] or pd.Series) – List or iterable of target values corresponding to the SMILES strings.

Returns:

Tuple containing: - X_train: Training set SMILES strings. - X_val: Validation set SMILES strings (or None if val_size=0). - X_test: Test set SMILES strings (or None if test_size=0). - y_train: Training set target values. - y_val: Validation set target values (or None if val_size=0). - y_test: Test set target values (or None if test_size=0).

Return type:

tuple

class openadmet.models.split.scaffold.ScaffoldSplitter(*, train_size: float = 0.8, val_size: float = 0.0, test_size: float = 0.2, random_state: int = 42)[source]

Bases: SplitterBase

Splits the data based on the scaffold of the molecules.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

split(X, y)[source]

Split the data into train, validation, and test sets.

Parameters:
  • X (Iterable[str]) – List or iterable of SMILES strings to split.

  • y (Iterable[float] or pd.Series) – List or iterable of target values corresponding to the SMILES strings.

Returns:

Tuple containing: - X_train: Training set SMILES strings. - X_val: Validation set SMILES strings (or None if val_size=0). - X_test: Test set SMILES strings (or None if test_size=0). - y_train: Training set target values. - y_val: Validation set target values (or None if val_size=0). - y_test: Test set target values (or None if test_size=0).

Return type:

tuple

openadmet.models.split.scaffold.safe_index(data, idx)[source]

Correct indexing depending on whether X and y are numpy arrays or pandas series/dataframes.

Parameters:
  • data (nd.array, list, pd.Series, or pd.DataFrame) – X or y data

  • idx (list) – list of integers (positional indices)

Returns:

indexed data

Return type:

nd.array or pd.Series