mlchem.ml.modelling package

Submodules

mlchem.ml.modelling.model_evaluation module

class ApplicabilityDomain

Bases: object

A class to calculate the leverage of data points in a dataset for applicability domain analysis.

The leverage is a measure of the influence of a data point in a regression model. It helps identify data points that have a significant impact on the model’s predictions. This class provides a method to calculate the leverage values for a given dataset and determine whether each data point is within the applicability domain based on a threshold.

Methods:

leverage(X: np.ndarray):

Calculates the leverage values for the given dataset and determines whether each data point is within the applicability domain based on a threshold.

static leverage(X: ndarray) dict[str, list[float] | list[bool] | float]

Calculate leverage values for a dataset and determine applicability domain.

Parameters:

X (numpy.ndarray) – Feature matrix of shape (n_samples, n_features).

Returns:

Dictionary containing: - ‘leverages’: list of float

Leverage values for each data point.

  • ’results’: list of bool

    Boolean flags indicating whether each point is within the domain.

  • ’threshold’: float

    Threshold used to determine domain inclusion.

Return type:

dict of str to list or float

class MajorityVote

Bases: object

MajorityVote(train_set, test_set, y_train, y_test, task_type, estimator_list, column_list, estimator_names=[])

Ensemble model using majority voting (for classification) or averaging (for regression).

This class combines predictions from multiple estimators to improve model performance and robustness by leveraging the strengths of different models.

Parameters:
  • train_set (pandas.DataFrame) – The training dataset.

  • test_set (pandas.DataFrame) – The testing dataset.

  • y_train (iterable) – Target values for the training dataset.

  • y_test (iterable) – Target values for the testing dataset.

  • task_type ({'classification', 'regression'}) – The type of task to perform.

  • estimator_list (list) – A list of fitted scikit-learn estimators.

  • column_list (list of str) – A list of feature columns for each estimator.

  • estimator_names (list of str, optional) – A list of names for the estimators. Defaults to an empty list.

__init__(train_set: DataFrame, test_set: DataFrame, y_train: Iterable, y_test: Iterable, task_type: Literal['classification', 'regression'], estimator_list: list, column_list: list[str], estimator_names: list[str] = []) None
fit() None

Fit the estimators on the training data and store predictions.

For classification tasks, both hard (class labels) and soft (probabilities) predictions are stored. For regression tasks, predicted values are stored.

Return type:

None

predict(metric, metric_name: str, n_estimators_max: int = 5) None

Generate ensemble predictions and evaluate performance using a specified metric.

For classification, both hard and soft voting are evaluated. For regression, predictions are averaged. Results are stored for each combination of estimators up to a specified maximum.

Parameters:
  • metric (callable) – A scoring function that takes (y_true, y_pred) as input and returns a float.

  • metric_name (str) – Name of the metric used for evaluation.

  • n_estimators_max (int, optional) – Maximum number of estimators to consider in combinations. Default is 5.

Return type:

None

crossval(estimator, X: ndarray | DataFrame, y: ndarray | DataFrame, metric_function: Callable, n_fold: int = 5, task_type: Literal['classification', 'regression'] = 'classification', random_state: int | None = None) ndarray

Evaluate an estimator using cross-validation.

This function performs K-fold cross-validation on the given dataset using the specified estimator and metric function. It supports both classification and regression tasks.

Parameters:
  • estimator (object) – A scikit-learn compatible estimator.

  • X (numpy.ndarray or pandas.DataFrame) – Feature matrix of shape (n_samples, n_features).

  • y (numpy.ndarray or pandas.DataFrame) – Target vector of shape (n_samples,) or (n_samples, 1).

  • metric_function (callable) – A scoring function that accepts (y_true, y_pred) as arguments.

  • n_fold (int, optional (default=5)) – Number of folds for cross-validation. If equal to n_samples, performs leave-one-out cross-validation.

  • task_type ({'classification', 'regression'}, optional (default='classification')) – Type of task to determine the cross-validation strategy.

  • random_state (int or None, optional (default=None)) – Random seed for reproducibility.

Returns:

An array of cross-validation scores.

Return type:

numpy.ndarray

y_scrambling(estimator, train_set: ndarray | DataFrame, y_train: Iterable, test_set: ndarray | DataFrame, y_test: Iterable, metric_function: Callable, n_iter: int, plot: bool = True) None

Perform y-scrambling to assess model performance due to chance.

This function evaluates the robustness of a model by randomly shuffling the target variable multiple times and measuring performance on the test set. It compares the distribution of scores from scrambled targets to the actual model performance. More explained at https://doi.org/10.1021/ci700157b.

Parameters:
  • estimator (object) – A scikit-learn compatible estimator.

  • train_set (numpy.ndarray or pandas.DataFrame) – Training feature matrix.

  • y_train (iterable) – Target values for training.

  • test_set (numpy.ndarray or pandas.DataFrame) – Testing feature matrix.

  • y_test (iterable) – Target values for testing.

  • metric_function (callable) – A scoring function that accepts (y_true, y_pred) as arguments.

  • n_iter (int) – Number of shuffling iterations.

  • plot (bool, optional (default=True)) – Whether to display a histogram of the scrambled scores.

Return type:

None

mlchem.ml.modelling.model_interpretation module

class DescriptorExplainer

Bases: object

A class to perform feature selection and model evaluation using combinatorial selection methods.

This class provides methods to fit and evaluate machine learning models using combinatorial feature selection. It supports both classification and regression tasks, offering tools to identify the best feature subsets and visualise model performance.

df_train

The training dataset.

Type:

pandas.DataFrame

df_test

The testing dataset.

Type:

pandas.DataFrame

target_train

The target values for the training dataset.

Type:

pandas.DataFrame

target_test

The target values for the testing dataset.

Type:

pandas.DataFrame

target_name

The name of the target variable.

Type:

str

estimator

The machine learning model to be used for feature selection and evaluation.

Type:

object

metric

The metric function to evaluate the model performance.

Type:

callable

logic

The logic to determine whether to minimise or maximise the metric.

Type:

{‘lower’, ‘greater’}

task_type

The type of task to perform.

Type:

{‘classification’, ‘regression’}

__init__(df_train: DataFrame, df_test: DataFrame, target_train: DataFrame, target_test: DataFrame, estimator, metric, logic: Literal['lower', 'greater'] = 'greater', task_type: Literal['classification', 'regression'] = 'regression') None

Initialise the DescriptorExplainer with training/testing data, model, and evaluation settings.

Parameters:
  • df_train (pandas.DataFrame) – The training dataset.

  • df_test (pandas.DataFrame) – The testing dataset.

  • target_train (pandas.DataFrame) – The target values for the training dataset.

  • target_test (pandas.DataFrame) – The target values for the testing dataset.

  • estimator (object) – The machine learning model to be used.

  • metric (callable) – The evaluation metric function.

  • logic ({'lower', 'greater'}, optional (default='greater')) – Whether to minimise or maximise the metric.

  • task_type ({'classification', 'regression'}, optional (default='regression')) – The type of task to perform.

display(subset_index: int) None

Display the performance of the model using a selected feature subset.

This method visualises model performance using the selected feature subset. For regression tasks, it plots true vs. predicted values and prints model coefficients and adjusted R². For classification, it shows a confusion matrix and prints performance metrics.

Parameters:

subset_index (int) – The index of the feature subset to use for evaluation.

Return type:

None

fit_stage_1(k: int = 2, training_threshold: float = 1.5, cv_train_ratio: float = 0.8, cv_iter: int = 5) None

Perform the first stage of combinatorial feature selection.

This method evaluates combinations of features using cross-validation and filters subsets based on a training score threshold.

Parameters:
  • k (int, optional (default=2)) – The number of features to combine.

  • training_threshold (float, optional (default=1.5)) – The threshold for the training score to consider a subset.

  • cv_train_ratio (float, optional (default=0.8)) – The minimum accepted cv score/train score ratio.

  • cv_iter (int, optional (default=5)) – The number of cross-validation iterations.

Return type:

None

fit_stage_2(top_n_subsets: int = 10, cv_iter: int = 5) None

Perform the second stage of combinatorial feature selection.

This method evaluates the top feature subsets from stage 1 using cross-validation to identify the best-performing combinations.

Parameters:
  • top_n_subsets (int, optional (default=10)) – The number of top feature subsets to consider.

  • cv_iter (int, optional (default=5)) – The number of cross-validation iterations.

Return type:

None

class ShapExplainer

Bases: object

A class to generate SHAP (SHapley Additive exPlanations) values for model interpretability.

This class provides methods to explain the output of machine learning models using SHAP values. It supports both tree-based models and general estimators, and offers a variety of visualisation tools to understand the impact of each feature on model predictions.

model_type

The type of model (‘tree’ or other).

Type:

str

data

The dataset used for generating SHAP values.

Type:

pandas.DataFrame

y

The target values corresponding to the dataset.

Type:

array-like

estimator

The machine learning model to be explained.

Type:

object

explain()

Generates SHAP values for the model.

load(base_values, shap_values)

Loads precomputed SHAP values.

save(path, filename)

Saves the SHAP values to the specified path.

force_plot()

Generates a force plot to visualise SHAP values.

force_plot_single(i)

Generates a force plot for a single instance.

dependence_plot(column)

Generates a dependence plot for a specified feature.

dependence_plot_all()

Generates dependence plots for all features.

summary_plot(plot_type='dot')

Generates a summary plot of SHAP values.

waterfall_plot(i)

Generates a waterfall plot for a single instance.

bar_plot()

Generates a bar plot of SHAP values.

feature_importance()

Calculates and returns feature importance based on SHAP values.

decision_plot(interval_lower, interval_upper)

Generates a decision plot for a specified probability interval.

__init__(estimator, data: DataFrame, y: Iterable, is_tree: bool = False) None

Initialise the ShapExplainer with a model, dataset, and target values.

Parameters:
  • estimator (object) – The machine learning model to be explained.

  • data (pandas.DataFrame) – The dataset used for generating SHAP values.

  • y (Iterable) – The target values corresponding to the dataset.

  • is_tree (bool, optional (default=False)) – Whether the model is a tree-based model (e.g., XGBoost, LightGBM).

bar_plot() None

Generate a bar plot of SHAP values.

Return type:

None

decision_plot(interval_lower: float, interval_upper: float) None

Generate a decision plot for a specified probability interval.

This method creates a SHAP decision plot to visualise how features contribute to model predictions for samples within a given probability interval. It also prints classification statistics for the selected interval.

Note

Currently only supported for tree-based models.

Parameters:
  • interval_lower (float) – The lower bound of the probability interval.

  • interval_upper (float) – The upper bound of the probability interval.

Return type:

None

dependence_plot(column: int) None

Generate a dependence plot for a specified feature.

Parameters:

column (int) – The index of the feature to visualise.

Return type:

None

dependence_plot_all() None

Generate dependence plots for all features.

Return type:

None

explain() None

Generate SHAP values for the model.

This method creates a SHAP explainer based on the model type and computes the SHAP values for the provided dataset.

Return type:

None

feature_importance() DataFrame

Calculate and visualise feature importance based on SHAP values.

This method computes feature importance by summing the absolute SHAP values for each feature. It also calculates the Spearman correlation between each feature and its SHAP values, and derives an overall impact score. The results are visualised using a color-coded bar plot and returned as a DataFrame.

Returns:

A DataFrame containing the feature importance weights, correlations, and overall impact scores for each feature.

Return type:

pandas.DataFrame

force_plot() HTML

Generate a force plot to visualise SHAP values.

Returns:

An HTML object containing the SHAP force plot.

Return type:

IPython.core.display.HTML

force_plot_single(i: int) HTML

Generate a force plot for a single instance.

Parameters:

i (int) – The index of the instance to visualise.

Returns:

An HTML object containing the SHAP force plot for the specified instance.

Return type:

IPython.core.display.HTML

heatmap() None

Generate a heatmap of SHAP values.

This method creates a heatmap to visualise the SHAP values for the model’s predictions. It initialises the SHAP JavaScript visualisation and displays the plot.

Note

Heatmaps are not available for tree-based models.

Return type:

None

load(base_values: ndarray, shap_values: ndarray) None

Load precomputed SHAP values.

Parameters:
  • base_values (numpy.ndarray) – The precomputed base values.

  • shap_values (numpy.ndarray) – The precomputed SHAP values.

Return type:

None

save(path: str, filename: str) None

Save SHAP values to the specified path.

Parameters:
  • path (str) – The directory path where the SHAP values will be saved.

  • filename (str) – The base filename for the saved SHAP values.

Return type:

None

summary_plot(plot_type: Literal['dot', 'bar', 'violin', 'layered_violin', 'compact_dot'] = 'dot') None

Generate a summary plot of SHAP values.

Parameters:
  • plot_type ({'dot', 'bar', 'violin', 'layered_violin', 'compact_dot'},)

  • (default='dot') (optional) – The type of plot to generate.

Return type:

None

waterfall_plot(i: int) None

Generate a waterfall plot for a single instance.

Note: Not available for tree-based models.

Parameters:

i (int) – The index of the instance to visualise.

Return type:

None