mlchem.ml.feature_selection package

Submodules

mlchem.ml.feature_selection.filters module

collinearity_filter(df: DataFrame, threshold: float, target_variable: str = None, method: Literal['pearson', 'kendall', 'spearman'] = 'pearson', numeric_only: bool = False) DataFrame

Filter features based on collinearity threshold.

Returns a subset of DataFrame columns whose squared correlation (R²) values are below the specified threshold. If a target variable is provided, the function retains the feature with the higher correlation to the target when multiple features are collinear.

Parameters:
  • df (pandas.DataFrame) – The input dataset.

  • threshold (float) – The maximum allowed squared correlation between features.

  • target_variable (str, optional) – The name of the target variable. If provided, it is used to resolve collinearity conflicts.

  • method ({'pearson', 'kendall', 'spearman'}, optional) – The correlation method to use. Default is ‘pearson’.

  • numeric_only (bool, optional) – Whether to include only numeric columns. Default is False.

Returns:

A DataFrame containing the filtered columns.

Return type:

pandas.DataFrame

diversity_filter(df: DataFrame, threshold: float, target_variable: str = None) DataFrame

Filter features based on diversity ratio using Shannon entropy.

Calculates the diversity ratio of each feature by comparing its Shannon entropy to that of an ideal uniform distribution. Retains features with diversity ratios above the specified threshold.

Parameters:
  • df (pandas.DataFrame) – The input dataset.

  • threshold (float) – The minimum diversity ratio required to retain a feature.

  • target_variable (str, optional) – The name of the target variable to retain regardless of its diversity score.

Returns:

A DataFrame containing the filtered columns with diversity higher than the threshold.

Return type:

pandas.DataFrame

mlchem.ml.feature_selection.wrappers module

class CombinatorialSelection

Bases: object

Combinatorial feature selection using a given estimator and metric.

This class performs a two-stage combinatorial feature selection process to identify optimal feature subsets based on model performance.

estimator

The machine learning estimator used to fit the data.

Type:

object

metric

A metric function to evaluate estimator performance. Must accept (y_true, y_pred).

Type:

callable

logic

Determines whether a higher or lower score is considered better.

Type:

{‘greater’, ‘lower’}

task_type

Specifies the type of task.

Type:

{‘classification’, ‘regression’}

Examples

>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.datasets import make_classification
>>> from mlchem.metrics import get_geometric_S
>>> cs = CombinatorialSelection(estimator=LogisticRegression(),
...                              metric=get_geometric_S,
...                              logic='greater')
>>> X, y = make_classification(500, 10, n_informative=4)
>>> X_train, y_train = X[:350], y[:350]
>>> X_test, y_test = X[350:], y[350:]
>>> train_set = pd.DataFrame(X_train, columns=np.arange(X_train.shape[1]))
>>> test_set = pd.DataFrame(X_test, columns=np.arange(X_test.shape[1]))
>>> results_stage_1 = cs.fit_stage_1(train_set, y_train, test_set, y_test,
...                                  train_set.columns, training_threshold=0.7)
>>> results_stage_2 = cs.fit_stage_2(top_n_subsets=10, cv_iter=5)
__init__(estimator, metric, logic: Literal['lower', 'greater'] = 'greater', task_type: Literal['classification', 'regression'] = 'classification') None

Initialise the CombinatorialSelection object.

Parameters:
  • estimator (object) – The machine learning estimator used to fit the data.

  • metric (callable) – A metric function to evaluate estimator performance.

  • logic ({'greater', 'lower'}, optional) – Determines whether a higher or lower score is considered better. Default is ‘greater’.

  • task_type ({'classification', 'regression'}, optional) – Specifies the type of task. Default is ‘classification’.

display_best(row: int = 1) None

Display the best feature subset based on the specified row.

Parameters:

row (int, optional) – Row index of the best feature subset to display. Default is 1.

Return type:

None

Notes

  • Fits the estimator on the selected subset.

  • Displays training, cross-validation, and test scores.

fit_stage_1(train_set: DataFrame, y_train: Iterable, test_set: DataFrame, y_test: Iterable, features: list[str] = [], k: int = 2, training_threshold: float = 0.25, cv_train_ratio: float = 0.7, cv_iter: int = 5) DataFrame

Perform the first stage of combinatorial feature selection.

Parameters:
  • train_set (pandas.DataFrame) – The training dataset.

  • y_train (iterable) – Target values for the training dataset.

  • test_set (pandas.DataFrame) – The testing dataset.

  • y_test (iterable) – Target values for the testing dataset.

  • features (list of str, optional) – List of features to consider. Default is an empty list.

  • k (int, optional) – Number of features to combine. Default is 2.

  • training_threshold (float, optional) – Minimum training score required to consider a subset. Default is 0.25.

  • cv_train_ratio (float, optional) – Minimum ratio of cross-validation to training score. Default is 0.7.

  • cv_iter (int, optional) – Number of cross-validation iterations. Default is 5.

Returns:

A DataFrame containing the results of the first stage of feature selection.

Return type:

pandas.DataFrame

Notes

  • Generates all possible feature subsets of size k.

  • Evaluates each subset using training, cross-validation, and

test scores. - Filters and ranks subsets based on geometric mean of scores.

fit_stage_2(top_n_subsets: int = 10, cv_iter: int = 5) DataFrame

Perform the second stage of combinatorial feature selection.

Parameters:
  • top_n_subsets (int, optional) – Number of top feature subsets from stage 1 to consider. Default is 10.

  • cv_iter (int, optional) – Number of cross-validation iterations. Default is 5.

Returns:

A DataFrame containing the results of the second stage of feature selection.

Return type:

pandas.DataFrame

Notes

  • Identifies most recurrent features from top subsets.

  • Generates new combinations and evaluates them.

  • Filters and ranks based on geometric mean of scores.

class SequentialForwardSelection

Bases: object

Sequential Forward Feature Selection wrapper.

This class performs Sequential Forward Feature Selection by iteratively adding features that yield the highest gain in cross-validation score.

estimator

The scikit-learn estimator used for feature selection.

Type:

object

estimator_string

A string representation of the estimator. If None, it is inferred from the estimator.

Type:

str, optional

metric

A function to evaluate model performance.

Type:

callable

max_features

Maximum number of features to select. Default is 25.

Type:

int, optional

cv_iter

Number of cross-validation iterations. Default is 5.

Type:

int, optional

logic

Whether to minimize or maximize the cross-validation score. Default is ‘greater’.

Type:

{‘lower’, ‘greater’}, optional

task_type

Type of task. Default is ‘classification’.

Type:

{‘classfication’, ‘regression’}, optional

Examples

>>> import pandas as pd
>>> import numpy as np
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.datasets import make_classification
>>> from mlchem.metrics import get_geometric_S
>>> sfs = SequentialForwardSelection(estimator=LogisticRegression(),
...                                  metric=get_geometric_S,
...                                  max_features=5,
...                                  cv_iter=3,
...                                  logic='greater')
>>> X, y = make_classification(300, 10, n_informative=5)
>>> train_size = 0.8
>>> train_samples = int(train_size * len(X))
>>> X_train, y_train = X[:train_samples], y[:train_samples]
>>> X_test, y_test = X[train_samples:], y[train_samples:]
>>> train_set = pd.DataFrame(X_train, columns=np.arange(X_train.shape[1]))
>>> test_set = pd.DataFrame(X_test, columns=np.arange(X_test.shape[1]))
>>> sfs.fit(train_set, y_train, test_set, y_test)
>>> sfs.plot(best_feature='auto')
__init__(estimator, estimator_string: str | None, metric: Callable, max_features: int = 25, cv_iter: int = 5, logic: Literal['lower', 'greater'] = 'greater', task_type: Literal['classfication', 'regression'] = 'classification') None

Initialise the SequentialForwardSelection object.

Parameters:
  • estimator (object) – The scikit-learn estimator used for feature selection.

  • estimator_string (str, optional) – A string representation of the estimator. If None, it is inferred from the estimator.

  • metric (callable) – A function to evaluate model performance.

  • max_features (int, optional) – Maximum number of features to select. Default is 25.

  • cv_iter (int, optional) – Number of cross-validation iterations. Default is 5.

  • logic ({'lower', 'greater'}, optional) – Whether to minimise or maximise the cross-validation score. Default is ‘greater’.

  • task_type ({'classfication', 'regression'}, optional) – Type of task. Default is ‘classification’.

find_best(which: int | None = None) dict

Find the best feature subset based on evaluation criteria.

Parameters:

which (int, optional) – If specified, returns the feature subset at the given index. If None, the best subset is determined automatically using a scoring algorithm.

Returns:

A dictionary containing: - ‘best_score’: float - ‘variability_contribution’: float - ‘geometric_contribution’: float - ‘train_test_difference’: float - ‘best_index’: int - ‘features’: list

Return type:

dict

Notes

The automatic algorithm works as follows:

1. Calculate the standard deviation of the training, cross-validation, and unseen test scores for each feature subset. 2. Initialise the best score and best index to zero. 3. Define coefficients for variability contribution, percentile, and contributions from training, cross-validation, and unseen test scores. 4. Iterate through each feature subset up to the maximum number of features:

  • Add a variability contribution if the standard deviation is

below a certain percentile. - Add a geometric contribution based on the product of the training, cross-validation, and unseen test scores. - Add the absolute difference between the training and unseen test scores. - Update the best score and best index if the current total score is higher than the best score.

fit(train_set: DataFrame, y_train: Iterable, test_set: DataFrame, y_test: Iterable) None

Fit the Sequential Forward Selection model.

Parameters:
  • train_set (pandas.DataFrame) – Training dataset.

  • y_train (iterable) – Target values for the training set.

  • test_set (pandas.DataFrame) – Test dataset.

  • y_test (iterable) – Target values for the test set.

Return type:

None

plot(best_feature: int | None = None, figsize: tuple[int, int] = (10, 6), colours: list[str] = ['steelblue', 'orange', 'green'], title: str | None = None, title_size: int = 20, xlabel: str = '# of features', ylabel: str = 'Score', fontsize: int = 14, legendsize: int = 13, save: bool = False) None

Plot the performance of the Sequential Forward Selection process.

Parameters:
  • best_feature (int, optional) – Index of the best feature subset to highlight. If None, it is determined automatically.

  • figsize (tuple of int, optional) – Size of the plot. Default is (10, 6).

  • colours (list of str, optional) – Colours for training, validation, and test scores. Default is [‘steelblue’, ‘orange’, ‘green’].

  • title (str, optional) – Title of the plot.

  • title_size (int, optional) – Font size of the title. Default is 20.

  • xlabel (str, optional) – Label for the x-axis. Default is ‘# of features’.

  • ylabel (str, optional) – Label for the y-axis. Default is ‘Score’.

  • fontsize (int, optional) – Font size for axis labels. Default is 14.

  • legendsize (int, optional) – Font size for the legend. Default is 13.

  • save (bool, optional) – Whether to save the plot. Default is False.

Return type:

None

Notes

The automatic algorithm for determining the best feature subset is the same as described in find_best.