chem and ml subpackages¶
mlchem.helper module¶
- add_inchi_to_dataframe(df: DataFrame, loc: int, smiles_column_name: str) DataFrame¶
Add an InChI column to a DataFrame by converting SMILES strings.
- Parameters:
df (pandas.DataFrame) – Input DataFrame.
loc (int) – Column index to insert the InChI column.
smiles_column_name (str) – Name of the column containing SMILES strings.
- Returns:
DataFrame with the added InChI column.
- Return type:
pandas.DataFrame
- assign_sign(x: float | int) str¶
Return the sign of a number as ‘+’ or ‘-‘.
- Parameters:
x (float or int) – Input value.
- Returns:
‘+’ if x is non-negative, ‘-’ otherwise.
- Return type:
str
- bokeh_plot(p: figure, classnames: list[str], dict_datatables: dict[str, DataTable]) None¶
Display a Bokeh plot with associated DataTables.
- Parameters:
p (bokeh.plotting.Figure) – Bokeh plot to display.
classnames (list of str) – List of class names.
dict_datatables (dict of str to DataTable) – Mapping of class names to DataTables.
- Return type:
None
- compute_alpha(size: int) float¶
Compute transparency value based on sample size.
- Parameters:
size (int) – Sample size.
- Returns:
Computed alpha value.
- Return type:
float
- convert_rgb(rgb_tuple: tuple[int, int, int], mode: Literal['normalise', 'denormalise']) tuple[float, float, float] | tuple[int, int, int]¶
Convert RGB values between 0-255 and 0-1 ranges.
- Parameters:
rgb_tuple (tuple of int) – RGB values in the form (R, G, B).
mode ({'normalise', 'denormalise'}) – Conversion mode. ‘normalise’ converts from 0-255 to 0-1, ‘denormalise’ converts from 0-1 to 0-255.
- Returns:
Converted RGB values.
- Return type:
tuple of float or tuple of int
- convert_size(size: tuple[float, float] | None = None, pixel_size: tuple[int, int] | None = None, dpi: int = 100) tuple[int, int] | tuple[float, float]¶
Convert between size in inches and size in pixels.
- Parameters:
size (tuple of float, optional) – Size in inches as (width, height).
pixel_size (tuple of int, optional) – Size in pixels as (width, height).
dpi (int, default=100) – Dots per inch used for conversion.
- Returns:
Converted size in pixels if size is provided, or in inches if pixel_size is provided.
- Return type:
tuple of int or tuple of float
- count_features(list_features: Iterable[str]) int¶
Count the total number of features, including interaction terms.
- Parameters:
list_features (iterable of str) – List of feature names, possibly including interaction terms.
- Returns:
Total number of features.
- Return type:
int
Example
>>> count_features(['a', 'b', 'a b']) 4 >>> count_features(['a', 'a b', 'c^2']) 5 >>> count_features(['a', 'b', 'c', 'c a', 'c b^2', 'a^3']) 11
- create_mask(array: ndarray, lower: float | int, upper: float | int) ndarray¶
Create a boolean mask for values within a specified range.
- Parameters:
array (numpy.ndarray) – Input array.
lower (float or int) – Lower bound.
upper (float or int) – Upper bound.
- Returns:
Boolean mask array.
- Return type:
numpy.ndarray
- create_progressive_column_names(serial_name: str, n: int) list[str]¶
Generate a list of sequential column names.
- Parameters:
serial_name (str) – Base name for the columns.
n (int) – Number of columns to generate.
- Returns:
List of column names.
- Return type:
list of str
- create_smooth_gradient_circle(radius: int, color: tuple[int, int, int], alpha: float) Image¶
Create a smooth gradient circle with transparency.
- Parameters:
radius (int) – Radius of the circle.
color (tuple of int) – Base RGB color in the range [0, 255].
alpha (float) – Transparency level in the range [0, 1].
- Returns:
PIL Image object containing the gradient circle.
- Return type:
Image.Image
- create_structure_files(df: DataFrame, structure_column_name: str, folder_name: str) None¶
Create PNG structure files for molecules in a DataFrame.
- Parameters:
df (pandas.DataFrame) – Input DataFrame.
structure_column_name (str) – Column containing molecular structures.
folder_name (str) – Folder to save the PNG images.
- Return type:
None
- dfs_to_excel(file_name: str, dfs: Iterable[DataFrame], sheet_names: Iterable[str]) None¶
Write multiple DataFrames to an Excel file, each on a separate sheet.
- Parameters:
file_name (str) – Name of the Excel file.
dfs (iterable of pandas.DataFrame) – DataFrames to write.
sheet_names (iterable of str) – Names of the sheets.
- Return type:
None
- find_all_occurrences(text: str, substring: str) list[int]¶
Find all starting indices of a substring in a text.
- Parameters:
text (str) – Text to search.
substring (str) – Substring to find.
- Returns:
List of starting indices where the substring occurs.
- Return type:
list of int
- flatten(args: Any) tuple[Any, ...]¶
Flatten a nested structure into a single tuple.
- Parameters:
args (Any) – The nested structure to flatten.
- Returns:
A flattened tuple containing all elements.
- Return type:
tuple
- generate_combination_cascade(elements: Iterable, n: int) Iterable[Iterable]¶
Generate all combinations of elements from size 1 to n.
- Parameters:
elements (iterable) – Input elements to combine.
n (int) – Maximum size of combinations.
- Returns:
List of combinations of elements.
- Return type:
list of list
- generate_random_rgb() tuple[float, float, float]¶
Generate a random RGB color.
- Returns:
A tuple of three floats representing an RGB color, with each component in the range [0, 1].
- Return type:
tuple of float
- identify_df_duplicates(df: DataFrame, column_name: str, keep: Literal['first', 'last'] = 'last') tuple[DataFrame, DataFrame]¶
Identify and separate duplicate rows in a DataFrame based on a column.
- Parameters:
df (pandas.DataFrame) – Input DataFrame.
column_name (str) – Column to check for duplicates.
keep ({'first', 'last'}, default='last') – Which duplicate to keep.
- Returns:
Cleaned DataFrame and DataFrame of removed duplicates.
- Return type:
tuple of pandas.DataFrame
- insert_string_piece(text: str, substring: str, index: int) str¶
Insert a substring into a string at a specified index.
- Parameters:
text (str) – Original string.
substring (str) – Substring to insert.
index (int) – Index at which to insert the substring.
- Returns:
Modified string.
- Return type:
str
- Raises:
ValueError – If index is less than 0.
- loadingbar(count: int, total: int, size: int) None¶
Display a loading bar to indicate progress.
- Parameters:
count (int) – Current iteration.
total (int) – Total number of iterations.
size (int) – Length of the loading bar.
- Return type:
None
- make_rgb_transparent(fg_rgb: tuple[float, float, float], bg_rgb: tuple[float, float, float] = (1.0, 1.0, 1.0), alpha: float = 0.5) tuple[float, float, float]¶
Apply transparency to a foreground RGB color over a background color.
- Parameters:
fg_rgb (tuple of float) – Foreground RGB color in the range [0, 1].
bg_rgb (tuple of float, optional) – Background RGB color in the range [0, 1]. Default is white (1.0, 1.0, 1.0).
alpha (float, default=0.5) – Transparency level, where 0 is fully transparent and 1 is fully opaque.
- Returns:
RGB color after applying transparency.
- Return type:
tuple of float
- merge_dicts_with_duplicates(dict1: dict[str, Any], dict2: dict[str, Any]) dict[str, Any]¶
Merge two dictionaries, renaming duplicate keys from the second dictionary.
- Parameters:
dict1 (dict of str to Any) – First dictionary.
dict2 (dict of str to Any) – Second dictionary.
- Returns:
Merged dictionary with unique keys.
- Return type:
dict of str to Any
- normalise_iterable(values: Iterable[float | int]) list[float]¶
Normalise values in an iterable so that the maximum absolute value is 1.
- Parameters:
values (iterable of float or int) – Iterable of numerical values.
- Returns:
Normalised values.
- Return type:
list of float
- prepare_dataframe(df: DataFrame, dir_name: str) DataFrame¶
Add a ‘MOLFILE’ column to a DataFrame with sorted file paths.
- Parameters:
df (pandas.DataFrame) – Input DataFrame.
dir_name (str) – Directory containing the files.
- Returns:
DataFrame with the added ‘MOLFILE’ column.
- Return type:
pandas.DataFrame
- prepare_datatable(df: DataFrame, height: int = 500, width: int = 866) DataTable¶
Create a Bokeh DataTable from a DataFrame.
- Parameters:
df (pandas.DataFrame) – Input DataFrame.
height (int, default=500) – Height of the DataTable.
width (int, default=int(650 / 0.75)) – Width of the DataTable.
- Returns:
Bokeh DataTable object.
- Return type:
bokeh.models.DataTable
- process_custom_string(s: str, target_substring: str, replacement_list: list[str], separator: str = ';') str¶
Process a string by replacing a target substring with elements from a list and formatting the result.
- Parameters:
s (str) – The original string.
target_substring (str) – The substring to replace.
replacement_list (list of str) – List of strings to replace the target substring.
separator (str, default=';') – Separator used in formatting the result.
- Returns:
The processed and formatted string.
- Return type:
str
- reset_string(input_string: str) str¶
Remove punctuation and convert a string to lowercase.
- Parameters:
input_string (str) – Input string.
- Returns:
Processed string.
- Return type:
str
- show_png(data: bytes) Image¶
Display a PNG image from binary data.
- Parameters:
data (bytes) – Binary data of the PNG image.
- Returns:
PIL Image object.
- Return type:
Image.Image
- size_ratio(size1: int, size2: int) float¶
Compute a ratio based on the relative sizes of two values.
- Parameters:
size1 (int) – First size.
size2 (int) – Second size.
- Returns:
Computed ratio.
- Return type:
float
- sort_list_by_other_list(list_1: list, list_2: list[int | float]) tuple[list[str], list[int | float]]¶
Sort one list based on the absolute values of another list.
- Parameters:
list_1 (list) – List of elements to sort.
list_2 (list of int or float) – List of values used to determine sort order.
- Returns:
Sorted list of elements and corresponding sorted values.
- Return type:
tuple of list
- standardise_path(path: str) str¶
Convert a Windows-style path to a standardised format.
- Parameters:
path (str) – Path string with backslashes.
- Returns:
Path string with forward slashes.
- Return type:
str
- suppress_warnings() None¶
Suppress all warnings in the current Python session.
- Return type:
None
- try_except(func: Callable[[], Any], exc: Any = None) Any¶
Execute a function and return its result or a fallback value on exception.
- Parameters:
func (callable) – Function to execute.
exc (any, optional) – Value to return if an exception occurs. Default is None.
- Returns:
Result of the function or fallback value.
- Return type:
any
- visualise_colour(rgb_tuple: tuple[float, float, float]) None¶
Display a single RGB color.
- Parameters:
rgb_tuple (tuple of float) – RGB values in the range [0, 1].
- Return type:
None
- visualise_colour_grid(colour_dictionary: dict[str, tuple[float, float, float]], save: bool = False, filename: str = '', figsize: tuple[int, int] = (20, 20)) None¶
Display a grid of RGB colors.
- Parameters:
colour_dictionary (dict of str to tuple of float) – Dictionary mapping color names to RGB tuples.
save (bool, default=False) – Whether to save the figure.
filename (str, optional) – Filename to save the figure if save is True.
figsize (tuple of int, default=(20, 20)) – Size of the figure.
- Return type:
None
mlchem.importables module¶
- Useful collections to perform various tasks:
metal_list: List with all metals in SMILES notation.
chemical_dictionary: dictionary in the form {‘fragment’: int} to create a bag of fragments useful in chem.manipulation.
colour_dictionary: dictionary of RGB tuples for a number of predefined colours.
chemotype_dictionary: collection of many pattern recognition functions that will be used by
chem.calculator.descriptors.get_chemotypes().bokeh_dictionary: collection of predefined bokeh plotting parameters.
bokeh_tooltips: predefined HMTL bokeh tooltips to interactively visualise chemical space.
interpretable_descriptors_rdkit: list of rdkit descriptors with a simple meaning.
interpretable_descriptors_mordred: list of mordred descriptors with a simple meaning.
similarity_metric_dictionary: dictionary in the form {metric_name: func} collecting various similarity metric functions (functions can be called from mlchem.metrics module as well).
mlchem.metrics module¶
- AllBitSimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect) float¶
Compute the AllBit similarity between two fingerprints.
This function is a shortcut for the RDKit method DataStructs.AllBitSimilarity.
- Parameters:
fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first molecular fingerprint.
fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second molecular fingerprint.
- Returns:
The AllBit similarity, considering both on and off bits in the fingerprints.
- Return type:
float
- AsymmetricSimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect) float¶
Compute the Asymmetric similarity between two fingerprints.
This function is a shortcut for the RDKit method DataStructs.AsymmetricSimilarity.
- Parameters:
fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first molecular fingerprint.
fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second molecular fingerprint.
- Returns:
The Asymmetric similarity, emphasizing features present in the first fingerprint.
- Return type:
float
- BraunBlanquetSimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect) float¶
Compute the Braun-Blanquet similarity between two fingerprints.
This function is a shortcut for the RDKit method DataStructs.BraunBlanquetSimilarity.
- Parameters:
fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first molecular fingerprint.
fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second molecular fingerprint.
- Returns:
The Braun-Blanquet similarity, calculated as the intersection over the maximum bit count.
- Return type:
float
- CosineSimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect) float¶
Compute the Cosine similarity between two fingerprints.
This function is a shortcut for the RDKit method DataStructs.CosineSimilarity.
- Parameters:
fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first molecular fingerprint.
fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second molecular fingerprint.
- Returns:
The Cosine similarity, measuring the cosine of the angle between two bit vectors.
- Return type:
float
- DiceSimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect) float¶
Compute the Dice similarity between two fingerprints.
This function is a shortcut for the RDKit method DataStructs.DiceSimilarity.
- Parameters:
fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first molecular fingerprint.
fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second molecular fingerprint.
- Returns:
The Dice similarity coefficient, ranging from 0 (no similarity) to 1 (identical).
- Return type:
float
- FingerprintSimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect, metric: callable) float¶
Compute the fingerprint similarity using a specified similarity metric.
This function is a shortcut for the RDKit method DataStructs.FingerprintSimilarity.
- Parameters:
fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first fingerprint.
fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second fingerprint.
metric (callable) – The similarity metric function to use.
- Returns:
The fingerprint similarity between the two fingerprints.
- Return type:
float
- KulczynskiSimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect) float¶
Compute the Kulczynski similarity between two fingerprints.
This function is a shortcut for the RDKit method DataStructs.KulczynskiSimilarity.
- Parameters:
fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first molecular fingerprint.
fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second molecular fingerprint.
- Returns:
The Kulczynski similarity, a symmetric measure of bit overlap.
- Return type:
float
- McConnaugheySimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect) float¶
Compute the McConnaughey similarity between two fingerprints.
This function is a shortcut for the RDKit method DataStructs.McConnaugheySimilarity.
- Parameters:
fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first molecular fingerprint.
fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second molecular fingerprint.
- Returns:
The McConnaughey similarity, a measure of structural similarity based on bit patterns.
- Return type:
float
- OffBitProjSimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect) float¶
Compute the OffBitProj similarity between two fingerprints.
This function is a shortcut for the RDKit method DataStructs.OffBitProjSimilarity.
- Parameters:
fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first molecular fingerprint.
fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second molecular fingerprint.
- Returns:
The OffBitProj similarity, based on the projection of off bits between fingerprints.
- Return type:
float
- OnBitSimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect) float¶
Compute the OnBit similarity between two fingerprints.
This function is a shortcut for the RDKit method DataStructs.OnBitSimilarity.
- Parameters:
fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first molecular fingerprint.
fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second molecular fingerprint.
- Returns:
The OnBit similarity, based on the number of bits set in both fingerprints.
- Return type:
float
- RogotGoldbergSimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect) float¶
Compute the Rogot-Goldberg similarity between two fingerprints.
This function is a shortcut for the RDKit method DataStructs.RogotGoldbergSimilarity.
- Parameters:
fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first molecular fingerprint.
fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second molecular fingerprint.
- Returns:
The Rogot-Goldberg similarity, a weighted measure of bit agreement.
- Return type:
float
- RusselSimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect) float¶
Compute the Cosine similarity between two fingerprints.
This function is a shortcut for the RDKit method DataStructs.CosineSimilarity.
- Parameters:
fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first molecular fingerprint.
fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second molecular fingerprint.
- Returns:
The Cosine similarity, measuring the cosine of the angle between two bit vectors.
- Return type:
float
- SokalSimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect) float¶
Compute the Sokal similarity between two fingerprints.
This function is a shortcut for the RDKit method DataStructs.SokalSimilarity.
- Parameters:
fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first molecular fingerprint.
fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second molecular fingerprint.
- Returns:
The Sokal similarity coefficient, a normalized measure of bit overlap.
- Return type:
float
- TanimotoSimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect) float¶
Compute the Tanimoto similarity between two fingerprints.
This function is a shortcut for the RDKit method DataStructs.TanimotoSimilarity.
- Parameters:
fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first molecular fingerprint.
fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second molecular fingerprint.
- Returns:
The Tanimoto similarity coefficient, commonly used for chemical structure comparison.
- Return type:
float
- TverskySimilarity(fp1: ExplicitBitVect, fp2: ExplicitBitVect, a: float = 0.5, b: float = 0.5) float¶
Compute the Tversky similarity between two fingerprints.
This function is a shortcut for the RDKit method DataStructs.TverskySimilarity.
- Parameters:
fp1 (DataStructs.cDataStructs.ExplicitBitVect) – The first fingerprint.
fp2 (DataStructs.cDataStructs.ExplicitBitVect) – The second fingerprint.
a (float, optional) – Weight for features in fp1. Default is 0.5.
b (float, optional) – Weight for features in fp2. Default is 0.5.
- Returns:
The Tversky similarity between the two fingerprints.
- Return type:
float
- get_geometric_S(y_true: Iterable[int | str], y_pred: Iterable[int | str], labels: Iterable[int | str] | None = None) float¶
Compute the geometric mean of sensitivity and specificity.
- Parameters:
y_true (Iterable[int or str]) – True labels.
y_pred (Iterable[int or str]) – Predicted labels.
labels (Iterable[int or str], optional) – Label names to include in the calculation. If None, all labels are used.
- Returns:
The geometric mean of sensitivity and specificity.
- Return type:
float
- get_mcc(y_true: Iterable[int | str], y_pred: Iterable[int | str]) float¶
Compute the Matthews Correlation Coefficient (MCC).
- Parameters:
y_true (Iterable[int or str]) – True labels.
y_pred (Iterable[int or str]) – Predicted labels.
- Returns:
The Matthews Correlation Coefficient.
- Return type:
float
- get_r2(y_true: Iterable[float | int], y_pred: Iterable[float | int]) float¶
Compute the R-squared value using Pearson’s correlation coefficient.
- Parameters:
y_true (Iterable[float or int]) – True values.
y_pred (Iterable[float or int]) – Predicted values.
- Returns:
The R-squared value.
- Return type:
float
- get_rmse(y_true: Iterable[float | int], y_pred: Iterable[float | int]) float¶
Compute the root mean squared error (RMSE) of a prediction.
- Parameters:
y_true (Iterable[float or int]) – True values.
y_pred (Iterable[float or int]) – Predicted values.
- Returns:
The root mean squared error.
- Return type:
float
- get_sensitivity(y_true: Iterable[int | str], y_pred: Iterable[int | str], labels: Iterable[int | str] | None = None) float¶
Compute the sensitivity (recall) of a prediction.
- Parameters:
y_true (Iterable[int or str]) – True labels.
y_pred (Iterable[int or str]) – Predicted labels.
labels (Iterable[int or str], optional) – Label names to include in the calculation. If None, all labels are used.
- Returns:
The sensitivity (recall) of the prediction.
- Return type:
float
- get_specificity(y_true: Iterable[int | str], y_pred: Iterable[int | str], labels: Iterable[int | str] | None = None) float¶
Compute the specificity of a prediction.
- Parameters:
y_true (Iterable[int or str]) – True labels.
y_pred (Iterable[int or str]) – Predicted labels.
labels (Iterable[int or str], optional) – Label names to include in the calculation. If None, all labels are used.
- Returns:
The specificity of the prediction.
- Return type:
float
- rmse_to_std_ratio(y_true: Iterable[float | int], y_pred: Iterable[float | int]) float¶
Compute the ratio of the standard deviation of true values to RMSE.
- Parameters:
y_true (Iterable[float or int]) – True values.
y_pred (Iterable[float or int]) – Predicted values.
- Returns:
The ratio of standard deviation to RMSE.
- Return type:
float