mlchem.ml.preprocessing.undersampling.undersample

undersample(train_set: DataFrame, test_set: DataFrame, class_column: str, desired_proportion_majority: float, add_dropped_to_test: bool = False, random_seed: int | None = 1) tuple[DataFrame, DataFrame]

Undersample the majority class in a training set to achieve a desired class balance.

Parameters:
  • train_set (pandas.DataFrame) – The training dataset.

  • test_set (pandas.DataFrame) – The test dataset.

  • class_column (str) – Name of the column containing class labels.

  • desired_proportion_majority (float) – Desired proportion of the majority class in the training set.

  • add_dropped_to_test (bool, default=False) – Whether to add the dropped samples to the test set.

  • random_seed (int, optional) – Random seed for reproducibility.

Returns:

The undersampled training set and the updated test set.

Return type:

tuple of pandas.DataFrame