chem and ml subpackages

mlchem.helper module

add_inchi_to_dataframe(df: DataFrame, loc: int, smiles_column_name: str) DataFrame

Add an InChI column to a DataFrame by converting SMILES strings.

Parameters:
  • df (pandas.DataFrame) – Input DataFrame.

  • loc (int) – Column index to insert the InChI column.

  • smiles_column_name (str) – Name of the column containing SMILES strings.

Returns:

DataFrame with the added InChI column.

Return type:

pandas.DataFrame

assign_sign(x: float | int) str

Return the sign of a number as ‘+’ or ‘-‘.

Parameters:

x (float or int) – Input value.

Returns:

‘+’ if x is non-negative, ‘-’ otherwise.

Return type:

str

bokeh_plot(p: figure, classnames: list[str], dict_datatables: dict[str, DataTable]) None

Display a Bokeh plot with associated DataTables.

Parameters:
  • p (bokeh.plotting.Figure) – Bokeh plot to display.

  • classnames (list of str) – List of class names.

  • dict_datatables (dict of str to DataTable) – Mapping of class names to DataTables.

Return type:

None

compute_alpha(size: int) float

Compute transparency value based on sample size.

Parameters:

size (int) – Sample size.

Returns:

Computed alpha value.

Return type:

float

convert_rgb(rgb_tuple: tuple[int, int, int], mode: Literal['normalise', 'denormalise']) tuple[float, float, float] | tuple[int, int, int]

Convert RGB values between 0-255 and 0-1 ranges.

Parameters:
  • rgb_tuple (tuple of int) – RGB values in the form (R, G, B).

  • mode ({'normalise', 'denormalise'}) – Conversion mode. ‘normalise’ converts from 0-255 to 0-1, ‘denormalise’ converts from 0-1 to 0-255.

Returns:

Converted RGB values.

Return type:

tuple of float or tuple of int

convert_size(size: tuple[float, float] | None = None, pixel_size: tuple[int, int] | None = None, dpi: int = 100) tuple[int, int] | tuple[float, float]

Convert between size in inches and size in pixels.

Parameters:
  • size (tuple of float, optional) – Size in inches as (width, height).

  • pixel_size (tuple of int, optional) – Size in pixels as (width, height).

  • dpi (int, default=100) – Dots per inch used for conversion.

Returns:

Converted size in pixels if size is provided, or in inches if pixel_size is provided.

Return type:

tuple of int or tuple of float

count_features(list_features: Iterable[str]) int

Count the total number of features, including interaction terms.

Parameters:

list_features (iterable of str) – List of feature names, possibly including interaction terms.

Returns:

Total number of features.

Return type:

int

Example

>>> count_features(['a', 'b', 'a b'])
4
>>> count_features(['a', 'a b', 'c^2'])
5
>>> count_features(['a', 'b', 'c', 'c a', 'c b^2', 'a^3'])
11
create_mask(array: ndarray, lower: float | int, upper: float | int) ndarray

Create a boolean mask for values within a specified range.

Parameters:
  • array (numpy.ndarray) – Input array.

  • lower (float or int) – Lower bound.

  • upper (float or int) – Upper bound.

Returns:

Boolean mask array.

Return type:

numpy.ndarray

create_progressive_column_names(serial_name: str, n: int) list[str]

Generate a list of sequential column names.

Parameters:
  • serial_name (str) – Base name for the columns.

  • n (int) – Number of columns to generate.

Returns:

List of column names.

Return type:

list of str

create_smooth_gradient_circle(radius: int, color: tuple[int, int, int], alpha: float) Image

Create a smooth gradient circle with transparency.

Parameters:
  • radius (int) – Radius of the circle.

  • color (tuple of int) – Base RGB color in the range [0, 255].

  • alpha (float) – Transparency level in the range [0, 1].

Returns:

PIL Image object containing the gradient circle.

Return type:

Image.Image

create_structure_files(df: DataFrame, structure_column_name: str, folder_name: str) None

Create PNG structure files for molecules in a DataFrame.

Parameters:
  • df (pandas.DataFrame) – Input DataFrame.

  • structure_column_name (str) – Column containing molecular structures.

  • folder_name (str) – Folder to save the PNG images.

Return type:

None

dfs_to_excel(file_name: str, dfs: Iterable[DataFrame], sheet_names: Iterable[str]) None

Write multiple DataFrames to an Excel file, each on a separate sheet.

Parameters:
  • file_name (str) – Name of the Excel file.

  • dfs (iterable of pandas.DataFrame) – DataFrames to write.

  • sheet_names (iterable of str) – Names of the sheets.

Return type:

None

find_all_occurrences(text: str, substring: str) list[int]

Find all starting indices of a substring in a text.

Parameters:
  • text (str) – Text to search.

  • substring (str) – Substring to find.

Returns:

List of starting indices where the substring occurs.

Return type:

list of int

flatten(args: Any) tuple[Any, ...]

Flatten a nested structure into a single tuple.

Parameters:

args (Any) – The nested structure to flatten.

Returns:

A flattened tuple containing all elements.

Return type:

tuple

generate_combination_cascade(elements: Iterable, n: int) Iterable[Iterable]

Generate all combinations of elements from size 1 to n.

Parameters:
  • elements (iterable) – Input elements to combine.

  • n (int) – Maximum size of combinations.

Returns:

List of combinations of elements.

Return type:

list of list

generate_random_rgb() tuple[float, float, float]

Generate a random RGB color.

Returns:

A tuple of three floats representing an RGB color, with each component in the range [0, 1].

Return type:

tuple of float

identify_df_duplicates(df: DataFrame, column_name: str, keep: Literal['first', 'last'] = 'last') tuple[DataFrame, DataFrame]

Identify and separate duplicate rows in a DataFrame based on a column.

Parameters:
  • df (pandas.DataFrame) – Input DataFrame.

  • column_name (str) – Column to check for duplicates.

  • keep ({'first', 'last'}, default='last') – Which duplicate to keep.

Returns:

Cleaned DataFrame and DataFrame of removed duplicates.

Return type:

tuple of pandas.DataFrame

insert_string_piece(text: str, substring: str, index: int) str

Insert a substring into a string at a specified index.

Parameters:
  • text (str) – Original string.

  • substring (str) – Substring to insert.

  • index (int) – Index at which to insert the substring.

Returns:

Modified string.

Return type:

str

Raises:

ValueError – If index is less than 0.

loadingbar(count: int, total: int, size: int) None

Display a loading bar to indicate progress.

Parameters:
  • count (int) – Current iteration.

  • total (int) – Total number of iterations.

  • size (int) – Length of the loading bar.

Return type:

None

make_rgb_transparent(fg_rgb: tuple[float, float, float], bg_rgb: tuple[float, float, float] = (1.0, 1.0, 1.0), alpha: float = 0.5) tuple[float, float, float]

Apply transparency to a foreground RGB color over a background color.

Parameters:
  • fg_rgb (tuple of float) – Foreground RGB color in the range [0, 1].

  • bg_rgb (tuple of float, optional) – Background RGB color in the range [0, 1]. Default is white (1.0, 1.0, 1.0).

  • alpha (float, default=0.5) – Transparency level, where 0 is fully transparent and 1 is fully opaque.

Returns:

RGB color after applying transparency.

Return type:

tuple of float

merge_dicts_with_duplicates(dict1: dict[str, Any], dict2: dict[str, Any]) dict[str, Any]

Merge two dictionaries, renaming duplicate keys from the second dictionary.

Parameters:
  • dict1 (dict of str to Any) – First dictionary.

  • dict2 (dict of str to Any) – Second dictionary.

Returns:

Merged dictionary with unique keys.

Return type:

dict of str to Any

normalise_iterable(values: Iterable[float | int]) list[float]

Normalise values in an iterable so that the maximum absolute value is 1.

Parameters:

values (iterable of float or int) – Iterable of numerical values.

Returns:

Normalised values.

Return type:

list of float

prepare_dataframe(df: DataFrame, dir_name: str) DataFrame

Add a ‘MOLFILE’ column to a DataFrame with sorted file paths.

Parameters:
  • df (pandas.DataFrame) – Input DataFrame.

  • dir_name (str) – Directory containing the files.

Returns:

DataFrame with the added ‘MOLFILE’ column.

Return type:

pandas.DataFrame

prepare_datatable(df: DataFrame, height: int = 500, width: int = 866) DataTable

Create a Bokeh DataTable from a DataFrame.

Parameters:
  • df (pandas.DataFrame) – Input DataFrame.

  • height (int, default=500) – Height of the DataTable.

  • width (int, default=int(650 / 0.75)) – Width of the DataTable.

Returns:

Bokeh DataTable object.

Return type:

bokeh.models.DataTable

process_custom_string(s: str, target_substring: str, replacement_list: list[str], separator: str = ';') str

Process a string by replacing a target substring with elements from a list and formatting the result.

Parameters:
  • s (str) – The original string.

  • target_substring (str) – The substring to replace.

  • replacement_list (list of str) – List of strings to replace the target substring.

  • separator (str, default=';') – Separator used in formatting the result.

Returns:

The processed and formatted string.

Return type:

str

reset_string(input_string: str) str

Remove punctuation and convert a string to lowercase.

Parameters:

input_string (str) – Input string.

Returns:

Processed string.

Return type:

str

show_png(data: bytes) Image

Display a PNG image from binary data.

Parameters:

data (bytes) – Binary data of the PNG image.

Returns:

PIL Image object.

Return type:

Image.Image

size_ratio(size1: int, size2: int) float

Compute a ratio based on the relative sizes of two values.

Parameters:
  • size1 (int) – First size.

  • size2 (int) – Second size.

Returns:

Computed ratio.

Return type:

float

sort_list_by_other_list(list_1: list, list_2: list[int | float]) tuple[list[str], list[int | float]]

Sort one list based on the absolute values of another list.

Parameters:
  • list_1 (list) – List of elements to sort.

  • list_2 (list of int or float) – List of values used to determine sort order.

Returns:

Sorted list of elements and corresponding sorted values.

Return type:

tuple of list

standardise_path(path: str) str

Convert a Windows-style path to a standardised format.

Parameters:

path (str) – Path string with backslashes.

Returns:

Path string with forward slashes.

Return type:

str

suppress_warnings() None

Suppress all warnings in the current Python session.

Return type:

None

try_except(func: Callable[[], Any], exc: Any = None) Any

Execute a function and return its result or a fallback value on exception.

Parameters:
  • func (callable) – Function to execute.

  • exc (any, optional) – Value to return if an exception occurs. Default is None.

Returns:

Result of the function or fallback value.

Return type:

any

visualise_colour(rgb_tuple: tuple[float, float, float]) None

Display a single RGB color.

Parameters:

rgb_tuple (tuple of float) – RGB values in the range [0, 1].

Return type:

None

visualise_colour_grid(colour_dictionary: dict[str, tuple[float, float, float]], save: bool = False, filename: str = '', figsize: tuple[int, int] = (20, 20)) None

Display a grid of RGB colors.

Parameters:
  • colour_dictionary (dict of str to tuple of float) – Dictionary mapping color names to RGB tuples.

  • save (bool, default=False) – Whether to save the figure.

  • filename (str, optional) – Filename to save the figure if save is True.

  • figsize (tuple of int, default=(20, 20)) – Size of the figure.

Return type:

None

mlchem.importables module

Useful collections to perform various tasks:
  1. metal_list: List with all metals in SMILES notation.

  2. chemical_dictionary: dictionary in the form {‘fragment’: int} to create a bag of fragments useful in chem.manipulation.

  3. colour_dictionary: dictionary of RGB tuples for a number of predefined colours.

  4. chemotype_dictionary: collection of many pattern recognition functions that will be used by chem.calculator.descriptors.get_chemotypes().

  5. bokeh_dictionary: collection of predefined bokeh plotting parameters.

  6. bokeh_tooltips: predefined HMTL bokeh tooltips to interactively visualise chemical space.

  7. interpretable_descriptors_rdkit: list of rdkit descriptors with a simple meaning.

  8. interpretable_descriptors_mordred: list of mordred descriptors with a simple meaning.

  9. similarity_metric_dictionary: dictionary in the form {metric_name: func} collecting various similarity metric functions (functions can be called from mlchem.metrics module as well).

mlchem.metrics module

AllBitSimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect) float

Compute the AllBit similarity between two fingerprints.

This function is a shortcut for the RDKit method DataStructs.AllBitSimilarity.

Parameters:
  • fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first molecular fingerprint.

  • fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second molecular fingerprint.

Returns:

The AllBit similarity, considering both on and off bits in the fingerprints.

Return type:

float

AsymmetricSimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect) float

Compute the Asymmetric similarity between two fingerprints.

This function is a shortcut for the RDKit method DataStructs.AsymmetricSimilarity.

Parameters:
  • fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first molecular fingerprint.

  • fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second molecular fingerprint.

Returns:

The Asymmetric similarity, emphasizing features present in the first fingerprint.

Return type:

float

BraunBlanquetSimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect) float

Compute the Braun-Blanquet similarity between two fingerprints.

This function is a shortcut for the RDKit method DataStructs.BraunBlanquetSimilarity.

Parameters:
  • fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first molecular fingerprint.

  • fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second molecular fingerprint.

Returns:

The Braun-Blanquet similarity, calculated as the intersection over the maximum bit count.

Return type:

float

CosineSimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect) float

Compute the Cosine similarity between two fingerprints.

This function is a shortcut for the RDKit method DataStructs.CosineSimilarity.

Parameters:
  • fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first molecular fingerprint.

  • fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second molecular fingerprint.

Returns:

The Cosine similarity, measuring the cosine of the angle between two bit vectors.

Return type:

float

DiceSimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect) float

Compute the Dice similarity between two fingerprints.

This function is a shortcut for the RDKit method DataStructs.DiceSimilarity.

Parameters:
  • fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first molecular fingerprint.

  • fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second molecular fingerprint.

Returns:

The Dice similarity coefficient, ranging from 0 (no similarity) to 1 (identical).

Return type:

float

FingerprintSimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect, metric: callable) float

Compute the fingerprint similarity using a specified similarity metric.

This function is a shortcut for the RDKit method DataStructs.FingerprintSimilarity.

Parameters:
  • fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first fingerprint.

  • fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second fingerprint.

  • metric (callable) – The similarity metric function to use.

Returns:

The fingerprint similarity between the two fingerprints.

Return type:

float

KulczynskiSimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect) float

Compute the Kulczynski similarity between two fingerprints.

This function is a shortcut for the RDKit method DataStructs.KulczynskiSimilarity.

Parameters:
  • fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first molecular fingerprint.

  • fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second molecular fingerprint.

Returns:

The Kulczynski similarity, a symmetric measure of bit overlap.

Return type:

float

McConnaugheySimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect) float

Compute the McConnaughey similarity between two fingerprints.

This function is a shortcut for the RDKit method DataStructs.McConnaugheySimilarity.

Parameters:
  • fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first molecular fingerprint.

  • fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second molecular fingerprint.

Returns:

The McConnaughey similarity, a measure of structural similarity based on bit patterns.

Return type:

float

OffBitProjSimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect) float

Compute the OffBitProj similarity between two fingerprints.

This function is a shortcut for the RDKit method DataStructs.OffBitProjSimilarity.

Parameters:
  • fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first molecular fingerprint.

  • fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second molecular fingerprint.

Returns:

The OffBitProj similarity, based on the projection of off bits between fingerprints.

Return type:

float

OnBitSimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect) float

Compute the OnBit similarity between two fingerprints.

This function is a shortcut for the RDKit method DataStructs.OnBitSimilarity.

Parameters:
  • fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first molecular fingerprint.

  • fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second molecular fingerprint.

Returns:

The OnBit similarity, based on the number of bits set in both fingerprints.

Return type:

float

RogotGoldbergSimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect) float

Compute the Rogot-Goldberg similarity between two fingerprints.

This function is a shortcut for the RDKit method DataStructs.RogotGoldbergSimilarity.

Parameters:
  • fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first molecular fingerprint.

  • fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second molecular fingerprint.

Returns:

The Rogot-Goldberg similarity, a weighted measure of bit agreement.

Return type:

float

RusselSimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect) float

Compute the Cosine similarity between two fingerprints.

This function is a shortcut for the RDKit method DataStructs.CosineSimilarity.

Parameters:
  • fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first molecular fingerprint.

  • fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second molecular fingerprint.

Returns:

The Cosine similarity, measuring the cosine of the angle between two bit vectors.

Return type:

float

SokalSimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect) float

Compute the Sokal similarity between two fingerprints.

This function is a shortcut for the RDKit method DataStructs.SokalSimilarity.

Parameters:
  • fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first molecular fingerprint.

  • fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second molecular fingerprint.

Returns:

The Sokal similarity coefficient, a normalized measure of bit overlap.

Return type:

float

TanimotoSimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect) float

Compute the Tanimoto similarity between two fingerprints.

This function is a shortcut for the RDKit method DataStructs.TanimotoSimilarity.

Parameters:
  • fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first molecular fingerprint.

  • fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second molecular fingerprint.

Returns:

The Tanimoto similarity coefficient, commonly used for chemical structure comparison.

Return type:

float

TverskySimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect, a: float = 0.5, b: float = 0.5) float

Compute the Tversky similarity between two fingerprints.

This function is a shortcut for the RDKit method DataStructs.TverskySimilarity.

Parameters:
  • fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first fingerprint.

  • fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second fingerprint.

  • a (float, optional) – Weight for features in fp1. Default is 0.5.

  • b (float, optional) – Weight for features in fp2. Default is 0.5.

Returns:

The Tversky similarity between the two fingerprints.

Return type:

float

get_geometric_S(y_true: Iterable[int | str], y_pred: Iterable[int | str], labels: Iterable[int | str] | None = None) float

Compute the geometric mean of sensitivity and specificity.

Parameters:
  • y_true (Iterable[int or str]) – True labels.

  • y_pred (Iterable[int or str]) – Predicted labels.

  • labels (Iterable[int or str], optional) – Label names to include in the calculation. If None, all labels are used.

Returns:

The geometric mean of sensitivity and specificity.

Return type:

float

get_mcc(y_true: Iterable[int | str], y_pred: Iterable[int | str]) float

Compute the Matthews Correlation Coefficient (MCC).

Parameters:
  • y_true (Iterable[int or str]) – True labels.

  • y_pred (Iterable[int or str]) – Predicted labels.

Returns:

The Matthews Correlation Coefficient.

Return type:

float

get_r2(y_true: Iterable[float | int], y_pred: Iterable[float | int]) float

Compute the R-squared value using Pearson’s correlation coefficient.

Parameters:
  • y_true (Iterable[float or int]) – True values.

  • y_pred (Iterable[float or int]) – Predicted values.

Returns:

The R-squared value.

Return type:

float

get_rmse(y_true: Iterable[float | int], y_pred: Iterable[float | int]) float

Compute the root mean squared error (RMSE) of a prediction.

Parameters:
  • y_true (Iterable[float or int]) – True values.

  • y_pred (Iterable[float or int]) – Predicted values.

Returns:

The root mean squared error.

Return type:

float

get_sensitivity(y_true: Iterable[int | str], y_pred: Iterable[int | str], labels: Iterable[int | str] | None = None) float

Compute the sensitivity (recall) of a prediction.

Parameters:
  • y_true (Iterable[int or str]) – True labels.

  • y_pred (Iterable[int or str]) – Predicted labels.

  • labels (Iterable[int or str], optional) – Label names to include in the calculation. If None, all labels are used.

Returns:

The sensitivity (recall) of the prediction.

Return type:

float

get_specificity(y_true: Iterable[int | str], y_pred: Iterable[int | str], labels: Iterable[int | str] | None = None) float

Compute the specificity of a prediction.

Parameters:
  • y_true (Iterable[int or str]) – True labels.

  • y_pred (Iterable[int or str]) – Predicted labels.

  • labels (Iterable[int or str], optional) – Label names to include in the calculation. If None, all labels are used.

Returns:

The specificity of the prediction.

Return type:

float

rmse_to_std_ratio(y_true: Iterable[float | int], y_pred: Iterable[float | int]) float

Compute the ratio of the standard deviation of true values to RMSE.

Parameters:
  • y_true (Iterable[float or int]) – True values.

  • y_pred (Iterable[float or int]) – Predicted values.

Returns:

The ratio of standard deviation to RMSE.

Return type:

float