Data-Driven Splitting Strategies
Scaffold-based data splitting implementations.
- class openadmet.models.split.scaffold.MaxDissimilaritySplitter(*, train_size: float = 0.8, val_size: float = 0.0, test_size: float = 0.2, random_state: int = 42)[source]
Bases:
SplitterBaseSplits the data based on maximum dissimilarity.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- split(X, y)[source]
Split the data into train, validation, and test sets.
- Parameters:
X (Iterable[str]) – List or iterable of SMILES strings to split.
y (Iterable[float] or pd.Series) – List or iterable of target values corresponding to the SMILES strings.
- Returns:
Tuple containing: - X_train: Training set SMILES strings. - X_val: Validation set SMILES strings (or None if val_size=0). - X_test: Test set SMILES strings (or None if test_size=0). - y_train: Training set target values. - y_val: Validation set target values (or None if val_size=0). - y_test: Test set target values (or None if test_size=0).
- Return type:
tuple
- class openadmet.models.split.scaffold.PerimeterSplitter(*, train_size: float = 0.8, val_size: float = 0.0, test_size: float = 0.2, random_state: int = 42)[source]
Bases:
SplitterBaseSplits the data based on the perimeter of the molecules.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- split(X, y)[source]
Split the data into train, validation, and test sets.
- Parameters:
X (Iterable[str]) – List or iterable of SMILES strings to split.
y (Iterable[float] or pd.Series) – List or iterable of target values corresponding to the SMILES strings.
- Returns:
Tuple containing: - X_train: Training set SMILES strings. - X_val: Validation set SMILES strings (or None if val_size=0). - X_test: Test set SMILES strings (or None if test_size=0). - y_train: Training set target values. - y_val: Validation set target values (or None if val_size=0). - y_test: Test set target values (or None if test_size=0).
- Return type:
tuple
- class openadmet.models.split.scaffold.ScaffoldSplitter(*, train_size: float = 0.8, val_size: float = 0.0, test_size: float = 0.2, random_state: int = 42)[source]
Bases:
SplitterBaseSplits the data based on the scaffold of the molecules.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- split(X, y)[source]
Split the data into train, validation, and test sets.
- Parameters:
X (Iterable[str]) – List or iterable of SMILES strings to split.
y (Iterable[float] or pd.Series) – List or iterable of target values corresponding to the SMILES strings.
- Returns:
Tuple containing: - X_train: Training set SMILES strings. - X_val: Validation set SMILES strings (or None if val_size=0). - X_test: Test set SMILES strings (or None if test_size=0). - y_train: Training set target values. - y_val: Validation set target values (or None if val_size=0). - y_test: Test set target values (or None if test_size=0).
- Return type:
tuple
- openadmet.models.split.scaffold.safe_index(data, idx)[source]
Correct indexing depending on whether X and y are numpy arrays or pandas series/dataframes.
- Parameters:
data (nd.array, list, pd.Series, or pd.DataFrame) – X or y data
idx (list) – list of integers (positional indices)
- Returns:
indexed data
- Return type:
nd.array or pd.Series