chem and ml subpackages¶

mlchem.helper module¶

add_inchi_to_dataframe(df: DataFrame, loc: int, smiles_column_name: str) → DataFrame¶

Add an InChI column to a DataFrame by converting SMILES strings.

Parameters:

df (pandas.DataFrame) – Input DataFrame.
loc (int) – Column index to insert the InChI column.
smiles_column_name (str) – Name of the column containing SMILES strings.

Returns:

DataFrame with the added InChI column.

Return type:

pandas.DataFrame

assign_sign(x: float | int) → str¶

Return the sign of a number as ‘+’ or ‘-‘.

Parameters:: x (float or int) – Input value.
Returns:: ‘+’ if x is non-negative, ‘-’ otherwise.
Return type:: str

bokeh_plot(p: figure, classnames: list[str], dict_datatables: dict[str, DataTable]) → None¶

Display a Bokeh plot with associated DataTables.

Parameters:

p (bokeh.plotting.Figure) – Bokeh plot to display.
classnames (list of str) – List of class names.
dict_datatables (dict of str to DataTable) – Mapping of class names to DataTables.

Return type:

None

compute_alpha(size: int) → float¶

Compute transparency value based on sample size.

Parameters:: size (int) – Sample size.
Returns:: Computed alpha value.
Return type:: float

convert_rgb(rgb_tuple: tuple[int, int, int], mode: Literal['normalise', 'denormalise']) → tuple[float, float, float] | tuple[int, int, int]¶

Convert RGB values between 0-255 and 0-1 ranges.

Parameters:

rgb_tuple (tuple of int) – RGB values in the form (R, G, B).
mode ({'normalise', 'denormalise'}) – Conversion mode. ‘normalise’ converts from 0-255 to 0-1, ‘denormalise’ converts from 0-1 to 0-255.

Returns:

Converted RGB values.

Return type:

tuple of float or tuple of int

convert_size(size: tuple[float, float] | None = None, pixel_size: tuple[int, int] | None = None, dpi: int = 100) → tuple[int, int] | tuple[float, float]¶

Convert between size in inches and size in pixels.

Parameters:

size (tuple of float, optional) – Size in inches as (width, height).
pixel_size (tuple of int, optional) – Size in pixels as (width, height).
dpi (int, default=100) – Dots per inch used for conversion.

Returns:

Converted size in pixels if size is provided, or in inches if pixel_size is provided.

Return type:

tuple of int or tuple of float

count_features(list_features: Iterable[str]) → int¶

Count the total number of features, including interaction terms.

Parameters:: list_features (iterable of str) – List of feature names, possibly including interaction terms.
Returns:: Total number of features.
Return type:: int

Example

>>> count_features(['a', 'b', 'a b'])
4
>>> count_features(['a', 'a b', 'c^2'])
5
>>> count_features(['a', 'b', 'c', 'c a', 'c b^2', 'a^3'])
11

create_mask(array: ndarray, lower: float | int, upper: float | int) → ndarray¶

Create a boolean mask for values within a specified range.

Parameters:

array (numpy.ndarray) – Input array.
lower (float or int) – Lower bound.
upper (float or int) – Upper bound.

Returns:

Boolean mask array.

Return type:

numpy.ndarray

create_progressive_column_names(serial_name: str, n: int) → list[str]¶

Generate a list of sequential column names.

Parameters:

serial_name (str) – Base name for the columns.
n (int) – Number of columns to generate.

Returns:

List of column names.

Return type:

list of str

create_smooth_gradient_circle(radius: int, color: tuple[int, int, int], alpha: float) → Image¶

Create a smooth gradient circle with transparency.

Parameters:

radius (int) – Radius of the circle.
color (tuple of int) – Base RGB color in the range [0, 255].
alpha (float) – Transparency level in the range [0, 1].

Returns:

PIL Image object containing the gradient circle.

Return type:

Image.Image

create_structure_files(df: DataFrame, structure_column_name: str, folder_name: str) → None¶

Create PNG structure files for molecules in a DataFrame.

Parameters:

df (pandas.DataFrame) – Input DataFrame.
structure_column_name (str) – Column containing molecular structures.
folder_name (str) – Folder to save the PNG images.

Return type:

None

dfs_to_excel(file_name: str, dfs: Iterable[DataFrame], sheet_names: Iterable[str]) → None¶

Write multiple DataFrames to an Excel file, each on a separate sheet.

Parameters:

file_name (str) – Name of the Excel file.
dfs (iterable of pandas.DataFrame) – DataFrames to write.
sheet_names (iterable of str) – Names of the sheets.

Return type:

None

find_all_occurrences(text: str, substring: str) → list[int]¶

Find all starting indices of a substring in a text.

Parameters:

text (str) – Text to search.
substring (str) – Substring to find.

Returns:

List of starting indices where the substring occurs.

Return type:

list of int

flatten(args: Any) → tuple[Any, ...]¶

Flatten a nested structure into a single tuple.

Parameters:: args (Any) – The nested structure to flatten.
Returns:: A flattened tuple containing all elements.
Return type:: tuple

generate_combination_cascade(elements: Iterable, n: int) → Iterable[Iterable]¶

Generate all combinations of elements from size 1 to n.

Parameters:

elements (iterable) – Input elements to combine.
n (int) – Maximum size of combinations.

Returns:

List of combinations of elements.

Return type:

list of list

generate_random_rgb() → tuple[float, float, float]¶

Generate a random RGB color.

Returns:: A tuple of three floats representing an RGB color, with each component in the range [0, 1].
Return type:: tuple of float

identify_df_duplicates(df: DataFrame, column_name: str, keep: Literal['first', 'last'] = 'last') → tuple[DataFrame, DataFrame]¶

Identify and separate duplicate rows in a DataFrame based on a column.

Parameters:

df (pandas.DataFrame) – Input DataFrame.
column_name (str) – Column to check for duplicates.
keep ({'first', 'last'}, default='last') – Which duplicate to keep.

Returns:

Cleaned DataFrame and DataFrame of removed duplicates.

Return type:

tuple of pandas.DataFrame

insert_string_piece(text: str, substring: str, index: int) → str¶

Insert a substring into a string at a specified index.

Parameters:

text (str) – Original string.
substring (str) – Substring to insert.
index (int) – Index at which to insert the substring.

Returns:

Modified string.

Return type:

str

Raises:

ValueError – If index is less than 0.

loadingbar(count: int, total: int, size: int) → None¶

Display a loading bar to indicate progress.

Parameters:

count (int) – Current iteration.
total (int) – Total number of iterations.
size (int) – Length of the loading bar.

Return type:

None

make_rgb_transparent(fg_rgb: tuple[float, float, float], bg_rgb: tuple[float, float, float] = (1.0, 1.0, 1.0), alpha: float = 0.5) → tuple[float, float, float]¶

Apply transparency to a foreground RGB color over a background color.

Parameters:

fg_rgb (tuple of float) – Foreground RGB color in the range [0, 1].
bg_rgb (tuple of float, optional) – Background RGB color in the range [0, 1]. Default is white (1.0, 1.0, 1.0).
alpha (float, default=0.5) – Transparency level, where 0 is fully transparent and 1 is fully opaque.

Returns:

RGB color after applying transparency.

Return type:

tuple of float

merge_dicts_with_duplicates(dict1: dict[str, Any], dict2: dict[str, Any]) → dict[str, Any]¶

Merge two dictionaries, renaming duplicate keys from the second dictionary.

Parameters:

dict1 (dict of str to Any) – First dictionary.
dict2 (dict of str to Any) – Second dictionary.

Returns:

Merged dictionary with unique keys.

Return type:

dict of str to Any

normalise_iterable(values: Iterable[float | int]) → list[float]¶

Normalise values in an iterable so that the maximum absolute value is 1.

Parameters:: values (iterable of float or int) – Iterable of numerical values.
Returns:: Normalised values.
Return type:: list of float

prepare_dataframe(df: DataFrame, dir_name: str) → DataFrame¶

Add a ‘MOLFILE’ column to a DataFrame with sorted file paths.

Parameters:

df (pandas.DataFrame) – Input DataFrame.
dir_name (str) – Directory containing the files.

Returns:

DataFrame with the added ‘MOLFILE’ column.

Return type:

pandas.DataFrame

prepare_datatable(df: DataFrame, height: int = 500, width: int = 866) → DataTable¶

Create a Bokeh DataTable from a DataFrame.

Parameters:

df (pandas.DataFrame) – Input DataFrame.
height (int, default=500) – Height of the DataTable.
width (int, default=int(650 / 0.75)) – Width of the DataTable.

Returns:

Bokeh DataTable object.

Return type:

bokeh.models.DataTable

process_custom_string(s: str, target_substring: str, replacement_list: list[str], separator: str = ';') → str¶

Process a string by replacing a target substring with elements from a list and formatting the result.

Parameters:

s (str) – The original string.
target_substring (str) – The substring to replace.
replacement_list (list of str) – List of strings to replace the target substring.
separator (str, default=';') – Separator used in formatting the result.

Returns:

The processed and formatted string.

Return type:

str

reset_string(input_string: str) → str¶

Remove punctuation and convert a string to lowercase.

Parameters:: input_string (str) – Input string.
Returns:: Processed string.
Return type:: str

show_png(data: bytes) → Image¶

Display a PNG image from binary data.

Parameters:: data (bytes) – Binary data of the PNG image.
Returns:: PIL Image object.
Return type:: Image.Image

size_ratio(size1: int, size2: int) → float¶

Compute a ratio based on the relative sizes of two values.

Parameters:

size1 (int) – First size.
size2 (int) – Second size.

Returns:

Computed ratio.

Return type:

float

sort_list_by_other_list(list_1: list, list_2: list[int | float]) → tuple[list[str], list[int | float]]¶

Sort one list based on the absolute values of another list.

Parameters:

list_1 (list) – List of elements to sort.
list_2 (list of int or float) – List of values used to determine sort order.

Returns:

Sorted list of elements and corresponding sorted values.

Return type:

tuple of list

standardise_path(path: str) → str¶

Convert a Windows-style path to a standardised format.

Parameters:: path (str) – Path string with backslashes.
Returns:: Path string with forward slashes.
Return type:: str

suppress_warnings() → None¶

Suppress all warnings in the current Python session.

Return type:: None

try_except(func: Callable[[], Any], exc: Any = None) → Any¶

Execute a function and return its result or a fallback value on exception.

Parameters:

func (callable) – Function to execute.
exc (any, optional) – Value to return if an exception occurs. Default is None.

Returns:

Result of the function or fallback value.

Return type:

any

visualise_colour(rgb_tuple: tuple[float, float, float]) → None¶

Display a single RGB color.

Parameters:: rgb_tuple (tuple of float) – RGB values in the range [0, 1].
Return type:: None

visualise_colour_grid(colour_dictionary: dict[str, tuple[float, float, float]], save: bool = False, filename: str = '', figsize: tuple[int, int] = (20, 20)) → None¶

Display a grid of RGB colors.

Parameters:

colour_dictionary (dict of str to tuple of float) – Dictionary mapping color names to RGB tuples.
save (bool, default=False) – Whether to save the figure.
filename (str, optional) – Filename to save the figure if save is True.
figsize (tuple of int, default=(20, 20)) – Size of the figure.

Return type:

None

mlchem.importables module¶

Useful collections to perform various tasks:

metal_list: List with all metals in SMILES notation.
chemical_dictionary: dictionary in the form {‘fragment’: int} to create a bag of fragments useful in chem.manipulation.
colour_dictionary: dictionary of RGB tuples for a number of predefined colours.
chemotype_dictionary: collection of many pattern recognition functions that will be used by chem.calculator.descriptors.get_chemotypes().
bokeh_dictionary: collection of predefined bokeh plotting parameters.
bokeh_tooltips: predefined HMTL bokeh tooltips to interactively visualise chemical space.
interpretable_descriptors_rdkit: list of rdkit descriptors with a simple meaning.
interpretable_descriptors_mordred: list of mordred descriptors with a simple meaning.
similarity_metric_dictionary: dictionary in the form {metric_name: func} collecting various similarity metric functions (functions can be called from mlchem.metrics module as well).

mlchem.metrics module¶

AllBitSimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect) → float¶

Compute the AllBit similarity between two fingerprints.

This function is a shortcut for the RDKit method DataStructs.AllBitSimilarity.

Parameters:

fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first molecular fingerprint.
fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second molecular fingerprint.

Returns:

The AllBit similarity, considering both on and off bits in the fingerprints.

Return type:

float

AsymmetricSimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect) → float¶

Compute the Asymmetric similarity between two fingerprints.

This function is a shortcut for the RDKit method DataStructs.AsymmetricSimilarity.

Parameters:

fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first molecular fingerprint.
fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second molecular fingerprint.

Returns:

The Asymmetric similarity, emphasizing features present in the first fingerprint.

Return type:

float

BraunBlanquetSimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect) → float¶

Compute the Braun-Blanquet similarity between two fingerprints.

This function is a shortcut for the RDKit method DataStructs.BraunBlanquetSimilarity.

Parameters:

fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first molecular fingerprint.
fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second molecular fingerprint.

Returns:

The Braun-Blanquet similarity, calculated as the intersection over the maximum bit count.

Return type:

float

CosineSimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect) → float¶

Compute the Cosine similarity between two fingerprints.

This function is a shortcut for the RDKit method DataStructs.CosineSimilarity.

Parameters:

fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first molecular fingerprint.
fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second molecular fingerprint.

Returns:

The Cosine similarity, measuring the cosine of the angle between two bit vectors.

Return type:

float

DiceSimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect) → float¶

Compute the Dice similarity between two fingerprints.

This function is a shortcut for the RDKit method DataStructs.DiceSimilarity.

Parameters:

fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first molecular fingerprint.
fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second molecular fingerprint.

Returns:

The Dice similarity coefficient, ranging from 0 (no similarity) to 1 (identical).

Return type:

float

FingerprintSimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect, metric: callable) → float¶

Compute the fingerprint similarity using a specified similarity metric.

This function is a shortcut for the RDKit method DataStructs.FingerprintSimilarity.

Parameters:

fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first fingerprint.
fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second fingerprint.
metric (callable) – The similarity metric function to use.

Returns:

The fingerprint similarity between the two fingerprints.

Return type:

float

KulczynskiSimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect) → float¶

Compute the Kulczynski similarity between two fingerprints.

This function is a shortcut for the RDKit method DataStructs.KulczynskiSimilarity.

Parameters:

fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first molecular fingerprint.
fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second molecular fingerprint.

Returns:

The Kulczynski similarity, a symmetric measure of bit overlap.

Return type:

float

McConnaugheySimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect) → float¶

Compute the McConnaughey similarity between two fingerprints.

This function is a shortcut for the RDKit method DataStructs.McConnaugheySimilarity.

Parameters:

fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first molecular fingerprint.
fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second molecular fingerprint.

Returns:

The McConnaughey similarity, a measure of structural similarity based on bit patterns.

Return type:

float

OffBitProjSimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect) → float¶

Compute the OffBitProj similarity between two fingerprints.

This function is a shortcut for the RDKit method DataStructs.OffBitProjSimilarity.

Parameters:

fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first molecular fingerprint.
fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second molecular fingerprint.

Returns:

The OffBitProj similarity, based on the projection of off bits between fingerprints.

Return type:

float

OnBitSimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect) → float¶

Compute the OnBit similarity between two fingerprints.

This function is a shortcut for the RDKit method DataStructs.OnBitSimilarity.

Parameters:

fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first molecular fingerprint.
fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second molecular fingerprint.

Returns:

The OnBit similarity, based on the number of bits set in both fingerprints.

Return type:

float

RogotGoldbergSimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect) → float¶

Compute the Rogot-Goldberg similarity between two fingerprints.

This function is a shortcut for the RDKit method DataStructs.RogotGoldbergSimilarity.

Parameters:

fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first molecular fingerprint.
fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second molecular fingerprint.

Returns:

The Rogot-Goldberg similarity, a weighted measure of bit agreement.

Return type:

float

RusselSimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect) → float¶

Compute the Cosine similarity between two fingerprints.

This function is a shortcut for the RDKit method DataStructs.CosineSimilarity.

Parameters:

fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first molecular fingerprint.
fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second molecular fingerprint.

Returns:

The Cosine similarity, measuring the cosine of the angle between two bit vectors.

Return type:

float

SokalSimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect) → float¶

Compute the Sokal similarity between two fingerprints.

This function is a shortcut for the RDKit method DataStructs.SokalSimilarity.

Parameters:

fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first molecular fingerprint.
fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second molecular fingerprint.

Returns:

The Sokal similarity coefficient, a normalized measure of bit overlap.

Return type:

float

TanimotoSimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect) → float¶

Compute the Tanimoto similarity between two fingerprints.

This function is a shortcut for the RDKit method DataStructs.TanimotoSimilarity.

Parameters:

fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first molecular fingerprint.
fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second molecular fingerprint.

Returns:

The Tanimoto similarity coefficient, commonly used for chemical structure comparison.

Return type:

float

TverskySimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect, a: float = 0.5, b: float = 0.5) → float¶

Compute the Tversky similarity between two fingerprints.

This function is a shortcut for the RDKit method DataStructs.TverskySimilarity.

Parameters:

fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first fingerprint.
fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second fingerprint.
a (float, optional) – Weight for features in fp1. Default is 0.5.
b (float, optional) – Weight for features in fp2. Default is 0.5.

Returns:

The Tversky similarity between the two fingerprints.

Return type:

float

get_geometric_S(y_true: Iterable[int | str], y_pred: Iterable[int | str], labels: Iterable[int | str] | None = None) → float¶

Compute the geometric mean of sensitivity and specificity.

Parameters:

y_true (Iterable[int or str]) – True labels.
y_pred (Iterable[int or str]) – Predicted labels.
labels (Iterable[int or str], optional) – Label names to include in the calculation. If None, all labels are used.

Returns:

The geometric mean of sensitivity and specificity.

Return type:

float

get_mcc(y_true: Iterable[int | str], y_pred: Iterable[int | str]) → float¶

Compute the Matthews Correlation Coefficient (MCC).

Parameters:

y_true (Iterable[int or str]) – True labels.
y_pred (Iterable[int or str]) – Predicted labels.

Returns:

The Matthews Correlation Coefficient.

Return type:

float

get_r2(y_true: Iterable[float | int], y_pred: Iterable[float | int]) → float¶

Compute the R-squared value using Pearson’s correlation coefficient.

Parameters:

y_true (Iterable[float or int]) – True values.
y_pred (Iterable[float or int]) – Predicted values.

Returns:

The R-squared value.

Return type:

float

get_rmse(y_true: Iterable[float | int], y_pred: Iterable[float | int]) → float¶

Compute the root mean squared error (RMSE) of a prediction.

Parameters:

y_true (Iterable[float or int]) – True values.
y_pred (Iterable[float or int]) – Predicted values.

Returns:

The root mean squared error.

Return type:

float

get_sensitivity(y_true: Iterable[int | str], y_pred: Iterable[int | str], labels: Iterable[int | str] | None = None) → float¶

Compute the sensitivity (recall) of a prediction.

Parameters:

y_true (Iterable[int or str]) – True labels.
y_pred (Iterable[int or str]) – Predicted labels.
labels (Iterable[int or str], optional) – Label names to include in the calculation. If None, all labels are used.

Returns:

The sensitivity (recall) of the prediction.

Return type:

float

get_specificity(y_true: Iterable[int | str], y_pred: Iterable[int | str], labels: Iterable[int | str] | None = None) → float¶

Compute the specificity of a prediction.

Parameters:

y_true (Iterable[int or str]) – True labels.
y_pred (Iterable[int or str]) – Predicted labels.
labels (Iterable[int or str], optional) – Label names to include in the calculation. If None, all labels are used.

Returns:

The specificity of the prediction.

Return type:

float

rmse_to_std_ratio(y_true: Iterable[float | int], y_pred: Iterable[float | int]) → float¶

Compute the ratio of the standard deviation of true values to RMSE.

Parameters:

y_true (Iterable[float or int]) – True values.
y_pred (Iterable[float or int]) – Predicted values.

Returns:

The ratio of standard deviation to RMSE.

Return type:

float

chem and ml subpackages¶

mlchem.helper module¶

mlchem.importables module¶

mlchem.metrics module¶

Table of Contents

Previous topic

Next topic

This Page