mlchem.chem.calculator package

Submodules

mlchem.chem.calculator.descriptors module

get_EHT_descriptors(mol_input: Mol, conf_id: int = -1) dict

Calculate quantum chemistry descriptors using Extended Hückel Theory (EHT).

This function computes various quantum chemistry properties for a 3D-embedded molecule using RDKit’s EHT implementation. It includes orbital energies, overlap matrices, and Mulliken charges.

More information: https://dasher.wustl.edu/chem478/reading/extended-huckel-lowe.pdf

Parameters:
  • mol_input (rdkit.Chem.rdchem.Mol) – RDKit Mol object with at least one conformer.

  • conf_id (int, optional) – Conformer ID to use. Default is -1 (use the first conformer).

Returns:

Dictionary containing quantum chemistry descriptors: - AtomicCharges - Hamiltonian - OrbitalEnergies - OverlapMatrix - ReducedChargeMatrix - ReducedOverlapPopulationMatrix - FermiEnergy - NumElectrons - NumOrbitals - TotalEnergy

Return type:

dict

Raises:

ValueError – If the molecule has no conformers.

Examples

>>> get_EHT_descriptors(mol_with_conformer)
get_allDesc(mol_input_list: list[str | Mol] | ndarray[str | Mol], include_3D: bool = False) DataFrame

Calculate both Mordred and RDKit descriptors for a list of molecules.

This function computes both Mordred and RDKit descriptors for each molecule in the input list. If include_3D is True, 3D descriptors are included in both sets.

Parameters:
  • mol_input_list (list or np.ndarray of str or rdkit.Chem.rdchem.Mol) – List or array of molecules in SMILES format or as RDKit Mol objects.

  • include_3D (bool, optional) – Whether to include 3D descriptors. Default is False.

Returns:

DataFrame containing the combined descriptors for each molecule.

Return type:

pd.DataFrame

Examples

>>> get_allDesc(["CCO", "c1ccccc1"], include_3D=True)
get_atomicDesc(mol_input: str | Mol, atom_index: int) DataFrame

Calculate atomic descriptors for a specific atom in a molecule.

This function computes a comprehensive set of atomic-level descriptors for a given atom in a molecule. These include properties related to bond types, hybridisation, charges, ring membership, and statistics on neighbouring atoms up to the third order.

Parameters:
  • mol_input (str or rdkit.Chem.rdchem.Mol) – Molecule in SMILES format or as an RDKit Mol object.

  • atom_index (int) – Index of the atom for which descriptors are calculated.

Returns:

A DataFrame containing the descriptors for the specified atom.

Return type:

pd.DataFrame

Raises:
  • RuntimeError – If the molecule cannot be created from the input.

  • IndexError – If the atom index is out of bounds.

Examples

>>> get_atomicDesc("CC(=O)O", atom_index=1)
get_chemotypes(mol_input_list: list | ndarray[str | Mol], chemotype_dict: dict | None = None) DataFrame

Identify chemotypes for a list of molecules.

This function applies a dictionary of chemotype definitions to each molecule in the input list. Each chemotype is defined by a function and its arguments. If no dictionary is provided, a default one is used.

Parameters:
  • mol_input_list (list or np.ndarray of str or rdkit.Chem.rdchem.Mol) – List or array of molecules in SMILES format or as RDKit Mol objects.

  • chemotype_dict (dict, optional) – Dictionary of chemotype definitions. Each entry should be a key with a tuple of (function, argument_dict). If None, a default dictionary is used.

Returns:

DataFrame containing the identified chemotypes for each molecule.

Return type:

pd.DataFrame

Examples

>>> get_chemotypes(["CCO", "c1ccccc1"])
get_fingerprint(mol_input: Mol | str, fp_type: Literal['m', 'ap', 'rk', 'tt', 'mac'] = 'm', radius: int = 2, nBits: int = 2048, include_chirality: bool = False, include_bit_info: bool = False) tuple | Mol

Generate a molecular fingerprint using RDKit.

This function generates a fingerprint for a molecule using one of several RDKit-supported types. Optionally, bit information can be returned for interpretability.

Parameters:
  • mol_input (str or rdkit.Chem.rdchem.Mol) – Molecule in SMILES format or as an RDKit Mol object.

  • fp_type ({'m', 'ap', 'rk', 'tt', 'mac'}, optional) – Type of fingerprint to generate: - ‘m’: Morgan - ‘ap’: Atom Pair - ‘rk’: RDKit - ‘tt’: Topological Torsion - ‘mac’: MACCS keys Default is ‘m’.

  • radius (int, optional) – Radius or path length depending on fingerprint type. Default is 2.

  • nBits (int, optional) – Size of the fingerprint. Default is 2048.

  • include_chirality (bool, optional) – Whether to include chirality. Default is False.

  • include_bit_info (bool, optional) – Whether to return bit information. Default is False.

Returns:

Fingerprint of the molecule. If include_bit_info is True, returns a tuple (fingerprint, bit_info_dict).

Return type:

tuple or rdkit.DataStructs.cDataStructs.ExplicitBitVect

Examples

>>> get_fingerprint("CCO", fp_type='m', include_bit_info=True)
get_fingerprint_df(mol_input_list: list[str | Mol] | ndarray[str | Mol], fp_type: Literal['m', 'ap', 'rk', 'tt', 'mac'] = 'm', radius: int = 2, nBits: int = 2048, include_chirality: bool = False, include_bit_info: bool = False) DataFrame | tuple[DataFrame, dict]

Generate a DataFrame of fingerprints for a list of molecules.

This function computes fingerprints for each molecule in the input list and returns them as a DataFrame. Optionally, bit information can also be returned.

Parameters:
  • mol_input_list (list or np.ndarray of str or rdkit.Chem.rdchem.Mol) – List or array of molecules in SMILES format or as RDKit Mol objects.

  • fp_type ({'m', 'ap', 'rk', 'tt', 'mac'}, optional) – Type of fingerprint to generate. Default is ‘m’.

  • radius (int, optional) – Radius or path length depending on fingerprint type. Default is 2.

  • nBits (int, optional) – Size of the fingerprint. Default is 2048.

  • include_chirality (bool, optional) – Whether to include chirality. Default is False.

  • include_bit_info (bool, optional) – Whether to return bit information. Default is False.

Returns:

DataFrame of fingerprints. If include_bit_info is True, also returns a dictionary of bit information.

Return type:

pd.DataFrame or tuple of (pd.DataFrame, dict)

Examples

>>> get_fingerprint_df(["CCO", "c1ccccc1"], fp_type='m')
get_mordredDesc(mol_input_list: list | ndarray[str | Mol], include_3D: bool = False) DataFrame

Calculate Mordred descriptors for a list of molecules.

This function computes Mordred descriptors for each molecule in the input list. If include_3D is True, 3D descriptors are included.

Parameters:
  • mol_input_list (list or np.ndarray of str or rdkit.Chem.rdchem.Mol) – List or array of molecules in SMILES format or as RDKit Mol objects.

  • include_3D (bool, optional) – Whether to include 3D descriptors. Default is False.

Returns:

DataFrame containing the descriptors for each molecule.

Return type:

pd.DataFrame

Examples

>>> get_mordredDesc(["CCO", "c1ccccc1"], include_3D=True)
get_rdkitDesc(mol_input_list: Iterable[str | Mol], include_3D: bool = False) DataFrame

Calculate RDKit descriptors for a list of molecules.

This function computes 2D descriptors for each molecule in the input list. If include_3D is True, it also calculates 3D descriptors and merges them with the 2D descriptors.

Parameters:
  • mol_input_list (Iterable[str or rdkit.Chem.rdchem.Mol]) – List of molecules in SMILES format or as RDKit Mol objects.

  • include_3D (bool, optional) – Whether to include 3D descriptors. Default is False.

Returns:

DataFrame containing the descriptors for each molecule.

Return type:

pd.DataFrame

Examples

>>> get_rdkitDesc(["CCO", "c1ccccc1"], include_3D=False)

mlchem.chem.calculator.tools module

bernoulli(n: int, k: int, p: float) float

Calculate the Bernoulli probability of k successes in n trials.

This uses the binomial distribution formula.

Parameters:
  • n (int) – Number of trials.

  • k (int) – Number of successes.

  • p (float) – Probability of success on a single trial.

Returns:

Bernoulli probability of k successes in n trials.

Return type:

float

Examples

>>> bernoulli(10, 3, 0.5)
boltzmann_probability(energy_levels: Iterable[float], temperature: int | float, energy_unit: Literal['eV', 'J', 'cal', 'kJ', 'kcal', 'kJ/mol', 'kcal/mol'] = 'kcal/mol') list[float]

Calculate the Boltzmann probability for a set of energy levels at a given temperature.

Parameters:
  • energy_levels (Iterable[float]) – A list or array of energy levels.

  • temperature (float or int) – Temperature in Kelvin.

  • energy_unit (str, optional) – Unit of energy levels. Default is ‘kcal/mol’. Supported units: ‘eV’, ‘J’, ‘cal’, ‘kJ’, ‘kcal’, ‘kJ/mol’, ‘kcal/mol’.

Returns:

Boltzmann probabilities for the given energy levels.

Return type:

list[float]

Examples

>>> boltzmann_probability([0, 1, 2], 298, 'kcal/mol')
calc_centroid(coordinates: ndarray, masses: Iterable | None = None) ndarray

Calculate the centroid of a set of points, optionally weighted by masses.

Parameters:
  • coordinates (np.ndarray) – A 2D array of shape (N, D) where N is the number of points.

  • masses (Iterable, optional) – An iterable of length N representing the masses of each point.

Returns:

The coordinates of the centroid.

Return type:

np.ndarray

Examples

>>> calc_centroid(np.array([[0, 0], [2, 0], [1, 2]]))
calc_gyration_tensor(coordinates: ndarray, masses: Iterable | None = None) ndarray

Calculate the gyration tensor of a set of coordinates.

Parameters:
  • coordinates (np.ndarray) – A 2D array of shape (N, 3) representing spatial coordinates.

  • masses (Iterable, optional) – An iterable of length N representing the masses of each point.

Returns:

The 3x3 gyration tensor.

Return type:

np.ndarray

Examples

>>> calc_gyration_tensor(np.random.rand(5, 3))
calc_logD_HH(pH: float, logP: float, pKa: float, behaviour: Literal['acid', 'base']) tuple

Calculate the distribution coefficient (logD) at a given pH using the Henderson-Hasselbalch equation.

Parameters:
  • pH (float) – The pH at which to calculate the distribution coefficient.

  • logP (float) – The logarithm of the partition coefficient.

  • pKa (float) – The acid dissociation constant.

  • behaviour ({'acid', 'base'}) – The behaviour of the molecule.

Returns:

A tuple containing: - Ion-neutral ratio (float) - Ionised percentage (float) - logD (float)

Return type:

tuple

Examples

>>> calc_logD_HH(7.4, 3.0, 4.5, 'acid')
calc_shape_descriptors_from_gyration_tensor(gyration_tensor: ndarray) dict

Calculate shape descriptors from a 3x3 gyration tensor.

Parameters:

gyration_tensor (np.ndarray) – A 3x3 gyration tensor.

Returns:

A dictionary containing: - ‘moments_of_inertia’ - ‘principal_axes’ - ‘asphericity’ - ‘acylindricity’ - ‘relative_shape_anisotropy’

Return type:

dict

Examples

>>> tensor = calc_gyration_tensor(np.random.rand(5, 3))
>>> calc_shape_descriptors_from_gyration_tensor(tensor)
logit_to_proba(logit: float) float

Convert a logit value to probability.

This applies the logistic (sigmoid) function.

Parameters:

logit (float) – Logit value to be converted.

Returns:

Corresponding probability.

Return type:

float

Examples

>>> logit_to_proba(0)
0.5
pairwise_euclidean_distance(matrix: ndarray) ndarray

Calculate the pairwise Euclidean distances between rows of a matrix.

Uses SciPy’s pdist and squareform functions.

Parameters:

matrix (np.ndarray) – Input 2D array of shape (N, D) where N is the number of points.

Returns:

2D array of pairwise Euclidean distances.

Return type:

np.ndarray

Examples

>>> pairwise_euclidean_distance(np.array([[0, 0], [1, 0], [0, 1]]))
shannon_entropy(vector: ndarray) float

Calculate Shannon entropy of a vector.

This function computes the entropy based on the frequency of unique elements in the input array.

Parameters:

vector (np.ndarray) – Input array for which Shannon entropy is calculated.

Returns:

Shannon entropy of the input vector.

Return type:

float

Examples

>>> shannon_entropy(np.array([1, 1, 2, 2, 3, 3]))