mlchem.ml.preprocessing package

Submodules

mlchem.ml.preprocessing.dimensional_reduction module

class Compressor

Bases: object

A class for compressing dataframes using various dimensionality reduction techniques.

This class provides a unified interface to apply multiple dimensionality reduction algorithms such as PCA, t-SNE, UMAP, Spectral Embedding, MDS, LLE, and Isomap. It supports optional exclusion of initial columns from compression and allows passing custom parameters to each algorithm.

dataframe

The input dataframe to be compressed.

Type:

pandas.DataFrame

initial_columns_to_ignore

Number of initial columns to exclude from compression.

Type:

int, optional (default=0)

algorithm

The dimensionality reduction algorithm instance used for compression.

Type:

object

params_

Parameters of the compression algorithm after initialization.

Type:

dict or None

X_compressed

The compressed feature matrix.

Type:

numpy.ndarray

dataframe_compressed

The dataframe after applying dimensionality reduction.

Type:

pandas.DataFrame

compress_PCA(...)

Compress the dataframe using Principal Component Analysis.

compress_TSNE(...)

Compress the dataframe using t-Distributed Stochastic Neighbor Embedding.

compress_SE(...)

Compress the dataframe using Spectral Embedding.

compress_UMAP(...)

Compress the dataframe using Uniform Manifold Approximation and Projection.

compress_MDS(...)

Compress the dataframe using Multidimensional Scaling.

compress_LLE(...)

Compress the dataframe using Locally Linear Embedding.

compress_ISOMAP(...)

Compress the dataframe using Isomap.

Example

>>> from mlchem.chem.calculator.descriptors import get_rdkitDesc
>>> from mlchem.ml.preprocessing import scaling
>>> df = get_rdkitDesc(['CCCC', 'CCN', 'c1ccccc1', 'CF', 'CCO', 'CCCNC(OCCC)CCO'])
>>> df = scaling.scale_df_standard(df, last_columns_to_preserve=0)[0]
>>> c = Compressor(df)
>>> c.compress_PCA(n_components_or_variance=0.6)
>>> df_pca = c.dataframe_compressed
>>> c.compress_TSNE(dataframe=df_pca, random_state=1)
>>> compressed_df = c.dataframe_compressed
__init__(dataframe: DataFrame, initial_columns_to_ignore: int = 0)

Initialize the Compressor with a dataframe and optional column exclusion.

Parameters:
  • dataframe (pandas.DataFrame) – The input dataframe to be compressed.

  • initial_columns_to_ignore (int, optional (default=0)) – Number of initial columns to exclude from compression.

compress_ISOMAP(n_components: int, neighbours_number_or_fraction: float | int, dataframe: DataFrame | None = None, dict_params: dict | None = None) None

Compress the dataframe using Isomap.

This method applies Isomap to reduce the dimensionality of the dataframe by preserving geodesic distances between all points.

Parameters:
  • n_components (int) – Number of dimensions to reduce to.

  • neighbours_number_or_fraction (float or int) – Number or fraction of neighbors to consider for neighborhood graph construction.

  • dataframe (pandas.DataFrame or None, optional) – DataFrame to compress. If None, uses the instance’s dataframe.

  • dict_params (dict or None, optional) – Dictionary of parameters to pass directly to the Isomap constructor.

Return type:

None

compress_LLE(n_components: int, neighbours_number_or_fraction: float | int, dataframe: DataFrame | None = None, random_state: int = 1, dict_params: dict | None = None) None

Compress the dataframe using Locally Linear Embedding (LLE).

This method applies LLE to reduce the dimensionality of the dataframe by preserving local neighborhood relationships.

Parameters:
  • n_components (int) – Number of dimensions to reduce to.

  • neighbours_number_or_fraction (float or int) – Number or fraction of neighbors to consider for local reconstruction.

  • dataframe (pandas.DataFrame or None, optional) – DataFrame to compress. If None, uses the instance’s dataframe.

  • random_state (int, optional (default=1)) – Random seed for reproducibility.

  • dict_params (dict or None, optional) – Dictionary of parameters to pass directly to the LLE constructor.

Return type:

None

compress_MDS(n_components: int, dataframe: DataFrame | None = None, random_state: int = 1, dict_params: dict | None = None) None

Compress the dataframe using Multidimensional Scaling (MDS).

This method applies MDS to reduce the dimensionality of the dataframe based on pairwise distances.

Parameters:
  • n_components (int) – Number of dimensions to reduce to.

  • dataframe (pandas.DataFrame or None, optional) – DataFrame to compress. If None, uses the instance’s dataframe.

  • random_state (int, optional (default=1)) – Random seed for reproducibility.

  • dict_params (dict or None, optional) – Dictionary of parameters to pass directly to the MDS constructor.

Return type:

None

compress_PCA(n_components_or_variance: int | float = 0.8, svd_solver: Literal['auto', 'full', 'arpack', 'randomized'] = 'full', dataframe: DataFrame | None = None, random_state: int = 1, dict_params: dict | None = None) None

Compress the dataframe using Principal Component Analysis (PCA).

This method reduces the dimensionality of the dataframe using PCA, either by specifying the number of components or the amount of variance to retain.

Parameters:
  • n_components_or_variance (int or float, optional (default=0.8)) – Number of components to keep or the amount of variance to retain.

  • svd_solver ({'auto', 'full', 'arpack', 'randomized'}, optional (default='full')) – SVD solver to use for the decomposition.

  • dataframe (pandas.DataFrame or None, optional) – DataFrame to compress. If None, uses the instance’s dataframe.

  • random_state (int, optional (default=1)) – Random seed for reproducibility.

  • dict_params (dict or None, optional) – Dictionary of parameters to pass directly to the PCA constructor.

Return type:

None

compress_SE(n_components: int, neighbours_number_or_fraction: float | int, dataframe: DataFrame | None = None, random_state: int = 1, dict_params: dict | None = None) None

Compress the dataframe using Spectral Embedding.

This method applies Spectral Embedding to reduce the dimensionality based on graph Laplacian eigenmaps.

Parameters:
  • n_components (int) – Number of dimensions to reduce to.

  • neighbours_number_or_fraction (float or int) – Number or fraction of neighbors to consider for graph construction.

  • dataframe (pandas.DataFrame or None, optional) – DataFrame to compress. If None, uses the instance’s dataframe.

  • random_state (int, optional (default=1)) – Random seed for reproducibility.

  • dict_params (dict or None, optional) – Dictionary of parameters to pass directly to the SpectralEmbedding constructor.

Return type:

None

compress_TSNE(n_components: int = 2, neighbours_number_or_fraction: float | int = 0.9, dataframe: DataFrame | None = None, random_state: int = 1, dict_params: dict | None = None) None

Compress the dataframe using t-Distributed Stochastic Neighbor Embedding (t-SNE).

This method applies t-SNE to reduce the dimensionality of the dataframe based on neighborhood probabilities.

Parameters:
  • n_components (int, optional (default=2)) – Number of dimensions to reduce to.

  • neighbours_number_or_fraction (float or int, optional (default=0.9)) – Number or fraction of neighbors to consider for perplexity.

  • dataframe (pandas.DataFrame or None, optional) – DataFrame to compress. If None, uses the instance’s dataframe.

  • random_state (int, optional (default=1)) – Random seed for reproducibility.

  • dict_params (dict or None, optional) – Dictionary of parameters to pass directly to the t-SNE constructor.

Return type:

None

compress_UMAP(n_components: int, neighbours_number_or_fraction: float | int, dataframe: DataFrame | None = None, random_state: int = 1, dict_params: dict | None = None) None

Compress the dataframe using Uniform Manifold Approximation and Projection (UMAP).

This method applies UMAP to reduce the dimensionality of the dataframe while preserving local and global structure.

Parameters:
  • n_components (int) – Number of dimensions to reduce to.

  • neighbours_number_or_fraction (float or int) – Number or fraction of neighbors to consider for local connectivity.

  • dataframe (pandas.DataFrame or None, optional) – DataFrame to compress. If None, uses the instance’s dataframe.

  • random_state (int, optional (default=1)) – Random seed for reproducibility.

  • dict_params (dict or None, optional) – Dictionary of parameters to pass directly to the UMAP constructor.

Return type:

None

mlchem.ml.preprocessing.feature_transformation module

polynomial_expansion(dataframe: DataFrame, degree: int) DataFrame

Expand features of a DataFrame to polynomial features of a given degree.

Parameters:
  • dataframe (pandas.DataFrame) – Input DataFrame containing the original features.

  • degree (int) – Degree of the polynomial expansion.

Returns:

DataFrame containing the expanded polynomial features.

Return type:

pandas.DataFrame

mlchem.ml.preprocessing.scaling module

scale_df_minmax(df: DataFrame, last_columns_to_preserve: int = 0) tuple[DataFrame, MinMaxScaler]

Scale a DataFrame using min-max scaling, preserving specified columns.

Parameters:
  • df (pandas.DataFrame) – The input DataFrame.

  • last_columns_to_preserve (int, default=0) – Number of columns at the end of the DataFrame to exclude from scaling.

Returns:

The scaled DataFrame and the fitted MinMaxScaler.

Return type:

tuple of pandas.DataFrame and MinMaxScaler

scale_df_robust(df: DataFrame, last_columns_to_preserve: int = 0) tuple[DataFrame, RobustScaler]

Scale a DataFrame using robust scaling, preserving specified columns.

Parameters:
  • df (pandas.DataFrame) – The input DataFrame.

  • last_columns_to_preserve (int, default=0) – Number of columns at the end of the DataFrame to exclude from scaling.

Returns:

The scaled DataFrame and the fitted RobustScaler.

Return type:

tuple of pandas.DataFrame and RobustScaler

scale_df_standard(df: DataFrame, last_columns_to_preserve: int = 0) tuple[DataFrame, StandardScaler]

Scale a DataFrame using standard scaling, preserving specified columns.

Parameters:
  • df (pandas.DataFrame) – The input DataFrame.

  • last_columns_to_preserve (int, default=0) – Number of columns at the end of the DataFrame to exclude from scaling.

Returns:

The scaled DataFrame and the fitted StandardScaler.

Return type:

tuple of pandas.DataFrame and StandardScaler

transform_df(df: DataFrame, scaler: StandardScaler | MinMaxScaler | RobustScaler, last_columns_to_preserve: int) tuple[DataFrame, StandardScaler | MinMaxScaler | RobustScaler]

Transform a DataFrame using a provided scaler, preserving specified columns.

Parameters:
  • df (pandas.DataFrame) – The input DataFrame.

  • scaler (StandardScaler or MinMaxScaler or RobustScaler) – The fitted scaler to use for transformation.

  • last_columns_to_preserve (int) – Number of columns at the end of the DataFrame to exclude from transformation.

Returns:

The transformed DataFrame and the scaler used.

Return type:

tuple of pandas.DataFrame and scaler

mlchem.ml.preprocessing.undersampling module

check_class_balance(y_train: Iterable) None

Check and print the class distribution in training labels.

Parameters:

y_train (Iterable) – Training target values.

Return type:

None

undersample(train_set: DataFrame, test_set: DataFrame, class_column: str, desired_proportion_majority: float, add_dropped_to_test: bool = False, random_seed: int | None = 1) tuple[DataFrame, DataFrame]

Undersample the majority class in a training set to achieve a desired class balance.

Parameters:
  • train_set (pandas.DataFrame) – The training dataset.

  • test_set (pandas.DataFrame) – The test dataset.

  • class_column (str) – Name of the column containing class labels.

  • desired_proportion_majority (float) – Desired proportion of the majority class in the training set.

  • add_dropped_to_test (bool, default=False) – Whether to add the dropped samples to the test set.

  • random_seed (int, optional) – Random seed for reproducibility.

Returns:

The undersampled training set and the updated test set.

Return type:

tuple of pandas.DataFrame