mlchem.ml.preprocessing package¶
Submodules¶
mlchem.ml.preprocessing.dimensional_reduction module¶
- class Compressor¶
Bases:
objectA class for compressing dataframes using various dimensionality reduction techniques.
This class provides a unified interface to apply multiple dimensionality reduction algorithms such as PCA, t-SNE, UMAP, Spectral Embedding, MDS, LLE, and Isomap. It supports optional exclusion of initial columns from compression and allows passing custom parameters to each algorithm.
- dataframe¶
The input dataframe to be compressed.
- Type:
pandas.DataFrame
- initial_columns_to_ignore¶
Number of initial columns to exclude from compression.
- Type:
int, optional (default=0)
- algorithm¶
The dimensionality reduction algorithm instance used for compression.
- Type:
object
- params_¶
Parameters of the compression algorithm after initialization.
- Type:
dict or None
- X_compressed¶
The compressed feature matrix.
- Type:
numpy.ndarray
- dataframe_compressed¶
The dataframe after applying dimensionality reduction.
- Type:
pandas.DataFrame
- compress_PCA(...)¶
Compress the dataframe using Principal Component Analysis.
- compress_TSNE(...)¶
Compress the dataframe using t-Distributed Stochastic Neighbor Embedding.
- compress_SE(...)¶
Compress the dataframe using Spectral Embedding.
- compress_UMAP(...)¶
Compress the dataframe using Uniform Manifold Approximation and Projection.
- compress_MDS(...)¶
Compress the dataframe using Multidimensional Scaling.
- compress_LLE(...)¶
Compress the dataframe using Locally Linear Embedding.
- compress_ISOMAP(...)¶
Compress the dataframe using Isomap.
Example
>>> from mlchem.chem.calculator.descriptors import get_rdkitDesc >>> from mlchem.ml.preprocessing import scaling >>> df = get_rdkitDesc(['CCCC', 'CCN', 'c1ccccc1', 'CF', 'CCO', 'CCCNC(OCCC)CCO']) >>> df = scaling.scale_df_standard(df, last_columns_to_preserve=0)[0] >>> c = Compressor(df) >>> c.compress_PCA(n_components_or_variance=0.6) >>> df_pca = c.dataframe_compressed >>> c.compress_TSNE(dataframe=df_pca, random_state=1) >>> compressed_df = c.dataframe_compressed
- __init__(dataframe: DataFrame, initial_columns_to_ignore: int = 0)¶
Initialize the Compressor with a dataframe and optional column exclusion.
- Parameters:
dataframe (pandas.DataFrame) – The input dataframe to be compressed.
initial_columns_to_ignore (int, optional (default=0)) – Number of initial columns to exclude from compression.
- compress_ISOMAP(n_components: int, neighbours_number_or_fraction: float | int, dataframe: DataFrame | None = None, dict_params: dict | None = None) None¶
Compress the dataframe using Isomap.
This method applies Isomap to reduce the dimensionality of the dataframe by preserving geodesic distances between all points.
- Parameters:
n_components (int) – Number of dimensions to reduce to.
neighbours_number_or_fraction (float or int) – Number or fraction of neighbors to consider for neighborhood graph construction.
dataframe (pandas.DataFrame or None, optional) – DataFrame to compress. If None, uses the instance’s dataframe.
dict_params (dict or None, optional) – Dictionary of parameters to pass directly to the Isomap constructor.
- Return type:
None
- compress_LLE(n_components: int, neighbours_number_or_fraction: float | int, dataframe: DataFrame | None = None, random_state: int = 1, dict_params: dict | None = None) None¶
Compress the dataframe using Locally Linear Embedding (LLE).
This method applies LLE to reduce the dimensionality of the dataframe by preserving local neighborhood relationships.
- Parameters:
n_components (int) – Number of dimensions to reduce to.
neighbours_number_or_fraction (float or int) – Number or fraction of neighbors to consider for local reconstruction.
dataframe (pandas.DataFrame or None, optional) – DataFrame to compress. If None, uses the instance’s dataframe.
random_state (int, optional (default=1)) – Random seed for reproducibility.
dict_params (dict or None, optional) – Dictionary of parameters to pass directly to the LLE constructor.
- Return type:
None
- compress_MDS(n_components: int, dataframe: DataFrame | None = None, random_state: int = 1, dict_params: dict | None = None) None¶
Compress the dataframe using Multidimensional Scaling (MDS).
This method applies MDS to reduce the dimensionality of the dataframe based on pairwise distances.
- Parameters:
n_components (int) – Number of dimensions to reduce to.
dataframe (pandas.DataFrame or None, optional) – DataFrame to compress. If None, uses the instance’s dataframe.
random_state (int, optional (default=1)) – Random seed for reproducibility.
dict_params (dict or None, optional) – Dictionary of parameters to pass directly to the MDS constructor.
- Return type:
None
- compress_PCA(n_components_or_variance: int | float = 0.8, svd_solver: Literal['auto', 'full', 'arpack', 'randomized'] = 'full', dataframe: DataFrame | None = None, random_state: int = 1, dict_params: dict | None = None) None¶
Compress the dataframe using Principal Component Analysis (PCA).
This method reduces the dimensionality of the dataframe using PCA, either by specifying the number of components or the amount of variance to retain.
- Parameters:
n_components_or_variance (int or float, optional (default=0.8)) – Number of components to keep or the amount of variance to retain.
svd_solver ({'auto', 'full', 'arpack', 'randomized'}, optional (default='full')) – SVD solver to use for the decomposition.
dataframe (pandas.DataFrame or None, optional) – DataFrame to compress. If None, uses the instance’s dataframe.
random_state (int, optional (default=1)) – Random seed for reproducibility.
dict_params (dict or None, optional) – Dictionary of parameters to pass directly to the PCA constructor.
- Return type:
None
- compress_SE(n_components: int, neighbours_number_or_fraction: float | int, dataframe: DataFrame | None = None, random_state: int = 1, dict_params: dict | None = None) None¶
Compress the dataframe using Spectral Embedding.
This method applies Spectral Embedding to reduce the dimensionality based on graph Laplacian eigenmaps.
- Parameters:
n_components (int) – Number of dimensions to reduce to.
neighbours_number_or_fraction (float or int) – Number or fraction of neighbors to consider for graph construction.
dataframe (pandas.DataFrame or None, optional) – DataFrame to compress. If None, uses the instance’s dataframe.
random_state (int, optional (default=1)) – Random seed for reproducibility.
dict_params (dict or None, optional) – Dictionary of parameters to pass directly to the SpectralEmbedding constructor.
- Return type:
None
- compress_TSNE(n_components: int = 2, neighbours_number_or_fraction: float | int = 0.9, dataframe: DataFrame | None = None, random_state: int = 1, dict_params: dict | None = None) None¶
Compress the dataframe using t-Distributed Stochastic Neighbor Embedding (t-SNE).
This method applies t-SNE to reduce the dimensionality of the dataframe based on neighborhood probabilities.
- Parameters:
n_components (int, optional (default=2)) – Number of dimensions to reduce to.
neighbours_number_or_fraction (float or int, optional (default=0.9)) – Number or fraction of neighbors to consider for perplexity.
dataframe (pandas.DataFrame or None, optional) – DataFrame to compress. If None, uses the instance’s dataframe.
random_state (int, optional (default=1)) – Random seed for reproducibility.
dict_params (dict or None, optional) – Dictionary of parameters to pass directly to the t-SNE constructor.
- Return type:
None
- compress_UMAP(n_components: int, neighbours_number_or_fraction: float | int, dataframe: DataFrame | None = None, random_state: int = 1, dict_params: dict | None = None) None¶
Compress the dataframe using Uniform Manifold Approximation and Projection (UMAP).
This method applies UMAP to reduce the dimensionality of the dataframe while preserving local and global structure.
- Parameters:
n_components (int) – Number of dimensions to reduce to.
neighbours_number_or_fraction (float or int) – Number or fraction of neighbors to consider for local connectivity.
dataframe (pandas.DataFrame or None, optional) – DataFrame to compress. If None, uses the instance’s dataframe.
random_state (int, optional (default=1)) – Random seed for reproducibility.
dict_params (dict or None, optional) – Dictionary of parameters to pass directly to the UMAP constructor.
- Return type:
None
mlchem.ml.preprocessing.feature_transformation module¶
- polynomial_expansion(dataframe: DataFrame, degree: int) DataFrame¶
Expand features of a DataFrame to polynomial features of a given degree.
- Parameters:
dataframe (pandas.DataFrame) – Input DataFrame containing the original features.
degree (int) – Degree of the polynomial expansion.
- Returns:
DataFrame containing the expanded polynomial features.
- Return type:
pandas.DataFrame
mlchem.ml.preprocessing.scaling module¶
- scale_df_minmax(df: DataFrame, last_columns_to_preserve: int = 0) tuple[DataFrame, MinMaxScaler]¶
Scale a DataFrame using min-max scaling, preserving specified columns.
- Parameters:
df (pandas.DataFrame) – The input DataFrame.
last_columns_to_preserve (int, default=0) – Number of columns at the end of the DataFrame to exclude from scaling.
- Returns:
The scaled DataFrame and the fitted MinMaxScaler.
- Return type:
tuple of pandas.DataFrame and MinMaxScaler
- scale_df_robust(df: DataFrame, last_columns_to_preserve: int = 0) tuple[DataFrame, RobustScaler]¶
Scale a DataFrame using robust scaling, preserving specified columns.
- Parameters:
df (pandas.DataFrame) – The input DataFrame.
last_columns_to_preserve (int, default=0) – Number of columns at the end of the DataFrame to exclude from scaling.
- Returns:
The scaled DataFrame and the fitted RobustScaler.
- Return type:
tuple of pandas.DataFrame and RobustScaler
- scale_df_standard(df: DataFrame, last_columns_to_preserve: int = 0) tuple[DataFrame, StandardScaler]¶
Scale a DataFrame using standard scaling, preserving specified columns.
- Parameters:
df (pandas.DataFrame) – The input DataFrame.
last_columns_to_preserve (int, default=0) – Number of columns at the end of the DataFrame to exclude from scaling.
- Returns:
The scaled DataFrame and the fitted StandardScaler.
- Return type:
tuple of pandas.DataFrame and StandardScaler
- transform_df(df: DataFrame, scaler: StandardScaler | MinMaxScaler | RobustScaler, last_columns_to_preserve: int) tuple[DataFrame, StandardScaler | MinMaxScaler | RobustScaler]¶
Transform a DataFrame using a provided scaler, preserving specified columns.
- Parameters:
df (pandas.DataFrame) – The input DataFrame.
scaler (StandardScaler or MinMaxScaler or RobustScaler) – The fitted scaler to use for transformation.
last_columns_to_preserve (int) – Number of columns at the end of the DataFrame to exclude from transformation.
- Returns:
The transformed DataFrame and the scaler used.
- Return type:
tuple of pandas.DataFrame and scaler
mlchem.ml.preprocessing.undersampling module¶
- check_class_balance(y_train: Iterable) None¶
Check and print the class distribution in training labels.
- Parameters:
y_train (Iterable) – Training target values.
- Return type:
None
- undersample(train_set: DataFrame, test_set: DataFrame, class_column: str, desired_proportion_majority: float, add_dropped_to_test: bool = False, random_seed: int | None = 1) tuple[DataFrame, DataFrame]¶
Undersample the majority class in a training set to achieve a desired class balance.
- Parameters:
train_set (pandas.DataFrame) – The training dataset.
test_set (pandas.DataFrame) – The test dataset.
class_column (str) – Name of the column containing class labels.
desired_proportion_majority (float) – Desired proportion of the majority class in the training set.
add_dropped_to_test (bool, default=False) – Whether to add the dropped samples to the test set.
random_seed (int, optional) – Random seed for reproducibility.
- Returns:
The undersampled training set and the updated test set.
- Return type:
tuple of pandas.DataFrame