Skills › Research & Science › Bioinformatics & life science
datamol
Pythonic wrapper around RDKit with simplified interface and sensible defaults. Preferred for standard drug discovery including SMILES parsing, standardization, descriptors, fingerprints, clustering, 3D conformers, parallel processing. Returns native rdkit.Chem.Mol objects. For advanced control or custom parameters, use rdkit directly.
Tools: datamol
The full skill
—
name: datamol
description: Pythonic wrapper around RDKit with simplified interface and sensible defaults. Preferred for standard drug discovery including SMILES parsing, standardization, descriptors, fingerprints, clustering, 3D conformers, parallel processing. Returns native rdkit.Chem.Mol objects. For advanced control or custom parameters, use rdkit directly.
license: Apache-2.0 license
metadata:
skill-author: K-Dense Inc.
—
# Datamol Cheminformatics Skill
## Overview
Datamol is a Python library that provides a lightweight, Pythonic abstraction layer over RDKit for molecular cheminformatics. Simplify complex molecular operations with sensible defaults, efficient parallelization, and modern I/O capabilities. All molecular objects are native `rdkit.Chem.Mol` instances, ensuring full compatibility with the RDKit ecosystem.
**Key capabilities**:
– Molecular format conversion (SMILES, SELFIES, InChI)
– Structure standardization and sanitization
– Molecular descriptors and fingerprints
– 3D conformer generation and analysis
– Clustering and diversity selection
– Scaffold and fragment analysis
– Chemical reaction application
– Visualization and alignment
– Batch processing with parallelization
– Cloud storage support via fsspec
## Installation and Setup
Guide users to install datamol:
“`bash
uv pip install datamol
“`
**Import convention**:
“`python
import datamol as dm
“`
## Core Workflows
### 1. Basic Molecule Handling
**Creating molecules from SMILES**:
“`python
import datamol as dm
# Single molecule
mol = dm.to_mol("CCO") # Ethanol
# From list of SMILES
smiles_list = ["CCO", "c1ccccc1", "CC(=O)O"]
mols = [dm.to_mol(smi) for smi in smiles_list]
# Error handling
mol = dm.to_mol("invalid_smiles") # Returns None
if mol is None:
print("Failed to parse SMILES")
“`
**Converting molecules to SMILES**:
“`python
# Canonical SMILES
smiles = dm.to_smiles(mol)
# Isomeric SMILES (includes stereochemistry)
smiles = dm.to_smiles(mol, isomeric=True)
# Other formats
inchi = dm.to_inchi(mol)
inchikey = dm.to_inchikey(mol)
selfies = dm.to_selfies(mol)
“`
**Standardization and sanitization** (always recommend for user-provided molecules):
“`python
# Sanitize molecule
mol = dm.sanitize_mol(mol)
# Full standardization (recommended for datasets)
mol = dm.standardize_mol(
mol,
disconnect_metals=True,
normalize=True,
reionize=True
)
# For SMILES strings directly
clean_smiles = dm.standardize_smiles(smiles)
“`
### 2. Reading and Writing Molecular Files
Refer to `references/io_module.md` for comprehensive I/O documentation.
**Reading files**:
“`python
# SDF files (most common in chemistry)
df = dm.read_sdf("compounds.sdf", mol_column='mol')
# SMILES files
df = dm.read_smi("molecules.smi", smiles_column='smiles', mol_column='mol')
# CSV with SMILES column
df = dm.read_csv("data.csv", smiles_column="SMILES", mol_column="mol")
# Excel files
df = dm.read_excel("compounds.xlsx", sheet_name=0, mol_column="mol")
# Universal reader (auto-detects format)
df = dm.open_df("file.sdf") # Works with .sdf, .csv, .xlsx, .parquet, .json
“`
**Writing files**:
“`python
# Save as SDF
dm.to_sdf(mols, "output.sdf")
# Or from DataFrame
dm.to_sdf(df, "output.sdf", mol_column="mol")
# Save as SMILES file
dm.to_smi(mols, "output.smi")
# Excel with rendered molecule images
dm.to_xlsx(df, "output.xlsx", mol_columns=["molf:T139e,# Datamol Conformers Module Reference
The `datamol.conformers` module provides tools for generating and analyzing 3D molecular conformations.
## Conformer Generation
### `dm.conformers.generate(mol, n_confs=None, rms_cutoff=None, minimize_energy=True, method='ETKDGv3', add_hs=True, …)`
Generate 3D molecular conformers.
– **Parameters**:
– `mol`: Input molecule
– `n_confs`: Number of conformers to generate (auto-determined based on rotatable bonds if None)
– `rms_cutoff`: RMS threshold in Ã
ngströms for filtering similar conformers (removes duplicates)
– `minimize_energy`: Apply UFF energy minimization (default: True)
– `method`: Embedding method – options:
– `'ETDG'` – Experimental Torsion Distance Geometry
– `'ETKDG'` – ETDG with additional basic knowledge
– `'ETKDGv2'` – Enhanced version 2
– `'ETKDGv3'` – Enhanced version 3 (default, recommended)
– `add_hs`: Add hydrogens before embedding (default: True, critical for quality)
– `random_seed`: Set for reproducibility
– **Returns**: Molecule with embedded conformers
– **Example**:
“`python
mol = dm.to_mol("CCO")
mol_3d = dm.conformers.generate(mol, n_confs=10, rms_cutoff=0.5)
conformers = mol_3d.GetConformers() # Access all conformers
“`
## Conformer Clustering
### `dm.conformers.cluster(mol, rms_cutoff=1.0, already_aligned=False, centroids=False)`
Group conformers by RMS distance.
– **Parameters**:
– `rms_cutoff`: Clustering threshold in Ã
ngströms (default: 1.0)
– `already_aligned`: Whether conformers are pre-aligned
– `centroids`: Return centroid conformers (True) or cluster groups (False)
– **Returns**: Cluster information or centroid conformers
– **Use case**: Identify distinct conformational families
### `dm.conformers.return_centroids(mol, conf_clusters, centroids=True)`
Extract representative conformers from clusters.
– **Parameters**:
– `conf_clusters`: Sequence of cluster indices from `cluster()`
– `centroids`: Return single molecule (True) or list of molecules (False)
– **Returns**: Centroid conformer(s)
## Conformer Analysis
### `dm.conformers.rmsd(mol)`
Calculate pairwise RMSD matrix across all conformers.
– **Requirements**: Minimum 2 conformers
– **Returns**: NxN matrix of RMSD values
– **Use case**: Quantify conformer diversity
### `dm.conformers.sasa(mol, n_jobs=1, …)`
Calculate Solvent Accessible Surface Area (SASA) using FreeSASA.
– **Parameters**:
– `n_jobs`: Parallelization for multiple conformers
– **Returns**: Array of SASA values (one per conformer)
– **Storage**: Values stored in each conformer as property `'rdkit_free_sasa'`
– **Example**:
“`python
sasa_values = dm.conformers.sasa(mol_3d)
# Or access from conformer properties
conf = mol_3d.GetConformer(0)
sasa = conf.GetDoubleProp('rdkit_free_sasa')
“`
## Low-Level Conformer Manipulation
### `dm.conformers.center_of_mass(mol, conf_id=-1, use_atoms=True, round_coord=None)`
Calculate molecular center.
– **Parameters**:
– `conf_id`: Conformer index (-1 for first conformer)
– `use_atoms`: Use atomic masses (True) or geometric center (False)
– `round_coord`: Decimal precision for rounding
– **Returns**: 3D coordinates of center
– **Use case**: Centering molecules for visualization or alignment
### `dm.conformers.get_coords(mol, conf_id=-1)`
Retrieve atomic coordinates from a conformer.
– **Returns**: Nx3 numpy array of atomic positions
– **Example**:
“`python
positions = dm.conformers.get_coords(mol_3d, conf_id=0)
# positions.shape: (num_atoms, 3)
“`
### `dm.conformers.translate(mol, conf_id=-1, transform_matrix=None)`
Reposition conformer using transformation matrix.
– **Modification**: Operates in-place
– **Use case**: Aligning or repositioning molecules
## Workflow Example
“`python
import datamol as dm
# 1. Create molecule and generate conformers
mol = dm.to_mol("CC(C)CCO") # Isopentanol
mol_3d = dm.conformers.generate(
mol,
n_confs=50, # Generate 50 initial conformers
rms_cutoff=0.5, # Filter similar conformers
minimize_energy=True # Minimize energy
)
# 2. Analyze conformers
n_conformers = mol_3d.GetNumConformers()
print(f"Generated {n_conformers} unique conformers")
# 3. Calculate SASA
sasa_values = dm.conformers.sasa(mol_3d)
# 4. Cluster conformers
clusters = dm.conformers.cluster(mol_3d, rms_cutoff=1.0, centroids=False)
# 5. Get representative conformers
centroids = dm.conformers.return_centroids(mol_3d, clusters)
# 6. Access 3D coordinates
coords = dm.conformers.get_coords(mol_3d, conf_id=0)
“`
## Key Concepts
– **Distance Geometry**: Method for generating 3D structures from connectivity information
– **ETKDG**: Uses experimental torsion angle preferences and additional chemical knowledge
– **RMS Cutoff**: Lower values = more unique conformers; higher values = fewer, more distinct conformers
– **Energy Minimization**: Relaxes structures to nearest local energy minimum
– **Hydrogens**: Critical for accurate 3D geometry – always include during embedding
20:T1092,# Datamol Core API Reference
This document covers the main functions available in the datamol namespace.
## Molecule Creation and Conversion
### `to_mol(mol, …)`
Convert SMILES string or other molecular representations to RDKit molecule objects.
– **Parameters**: Accepts SMILES strings, InChI, or other molecular formats
– **Returns**: `rdkit.Chem.Mol` object
– **Common usage**: `mol = dm.to_mol("CCO")`
### `from_inchi(inchi)`
Convert InChI string to molecule object.
### `from_smarts(smarts)`
Convert SMARTS pattern to molecule object.
### `from_selfies(selfies)`
Convert SELFIES string to molecule object.
### `copy_mol(mol)`
Create a copy of a molecule object to avoid modifying the original.
## Molecule Export
### `to_smiles(mol, …)`
Convert molecule object to SMILES string.
– **Common parameters**: `canonical=True`, `isomeric=True`
### `to_inchi(mol, …)`
Convert molecule to InChI string representation.
### `to_inchikey(mol)`
Convert molecule to InChI key (fixed-length hash).
### `to_smarts(mol)`
Convert molecule to SMARTS pattern.
### `to_selfies(mol)`
Convert molecule to SELFIES (Self-Referencing Embedded Strings) format.
## Sanitization and Standardization
### `sanitize_mol(mol, …)`
Enhanced version of RDKit's sanitize operation using molâSMILESâmol conversion and aromatic nitrogen fixing.
– **Purpose**: Fix common molecular structure issues
– **Returns**: Sanitized molecule or None if sanitization fails
### `standardize_mol(mol, disconnect_metals=False, normalize=True, reionize=True, …)`
Apply comprehensive standardization procedures including:
– Metal disconnection
– Normalization (charge corrections)
– Reionization
– Fragment handling (largest fragment selection)
### `standardize_smiles(smiles, …)`
Apply SMILES standardization procedures directly to a SMILES string.
### `fix_mol(mol)`
Attempt to fix molecular structure issues automatically.
### `fix_valence(mol)`
Correct valence errors in molecular structures.
## Molecular Properties
### `reorder_atoms(mol, …)`
Ensure consistent atom ordering for the same molecule regardless of original SMILES representation.
– **Purpose**: Maintain reproducible feature generation
### `remove_hs(mol, …)`
Remove hydrogen atoms from molecular structure.
### `add_hs(mol, …)`
Add explicit hydrogen atoms to molecular structure.
## Fingerprints and Similarity
### `to_fp(mol, fp_type='ecfp', …)`
Generate molecular fingerprints for similarity calculations.
– **Fingerprint types**:
– `'ecfp'` – Extended Connectivity Fingerprints (Morgan)
– `'fcfp'` – Functional Connectivity Fingerprints
– `'maccs'` – MACCS keys
– `'topological'` – Topological fingerprints
– `'atompair'` – Atom pair fingerprints
– **Common parameters**: `n_bits`, `radius`
– **Returns**: Numpy array or RDKit fingerprint object
### `pdist(mols, …)`
Calculate pairwise Tanimoto distances between all molecules in a list.
– **Supports**: Parallel processing via `n_jobs` parameter
– **Returns**: Distance matrix
### `cdist(mols1, mols2, …)`
Calculate Tanimoto distances between two sets of molecules.
## Clustering and Diversity
### `cluster_mols(mols, cutoff=0.2, feature_fn=None, n_jobs=1)`
Cluster molecules using Butina clustering algorithm.
– **Parameters**:
– `cutoff`: Distance threshold (default 0.2)
– `feature_fn`: Custom function for molecular features
– `n_jobs`: Parallelization (-1 for all cores)
– **Important**: Builds full distance matrix – suitable for ~1000 structures, not for 10,000+
– **Returns**: List of clusters (each cluster is a list of molecule indices)
### `pick_diverse(mols, npick, …)`
Select diverse subset of molecules based on fingerprint diversity.
### `pick_centroids(mols, npick, …)`
Select centroid molecules representing clusters.
## Graph Operations
### `to_graph(mol)`
Convert molecule to graph representation for graph-based analysis.
### `get_all_path_between(mol, start, end)`
Find all paths between two atoms in molecular structure.
## DataFrame Integration
### `to_df(mols, smiles_column='smiles', mol_column='mol')`
Convert list of molecules to pandas DataFrame.
### `from_df(df, smiles_column='smiles', mol_column='mol')`
Convert pandas DataFrame to list of molecules.
21:T1cc6,# Datamol Descriptors and Visualization Reference
## Descriptors Module (`datamol.descriptors`)
The descriptors module provides tools for computing molecular properties and descriptors.
### Specialized Descriptor Functions
#### `dm.descriptors.n_aromatic_atoms(mol)`
Calculate the number of aromatic atoms.
– **Returns**: Integer count
– **Use case**: Aromaticity analysis
#### `dm.descriptors.n_aromatic_atoms_proportion(mol)`
Calculate ratio of aromatic atoms to total heavy atoms.
– **Returns**: Float between 0 and 1
– **Use case**: Quantifying aromatic character
#### `dm.descriptors.n_charged_atoms(mol)`
Count atoms with nonzero formal charge.
– **Returns**: Integer count
– **Use case**: Charge distribution analysis
#### `dm.descriptors.n_rigid_bonds(mol)`
Count non-rotatable bonds (neither single bonds nor ring bonds).
– **Returns**: Integer count
– **Use case**: Molecular flexibility assessment
#### `dm.descriptors.n_stereo_centers(mol)`
Count stereogenic centers (chiral centers).
– **Returns**: Integer count
– **Use case**: Stereochemistry analysis
#### `dm.descriptors.n_stereo_centers_unspecified(mol)`
Count stereocenters lacking stereochemical specification.
– **Returns**: Integer count
– **Use case**: Identifying incomplete stereochemistry
### Batch Descriptor Computation
#### `dm.descriptors.compute_many_descriptors(mol, properties_fn=None, add_properties=True)`
Compute multiple molecular properties for a single molecule.
– **Parameters**:
– `properties_fn`: Custom list of descriptor functions
– `add_properties`: Include additional computed properties
– **Returns**: Dictionary of descriptor name â value pairs
– **Default descriptors include**:
– Molecular weight, LogP, number of H-bond donors/acceptors
– Aromatic atoms, stereocenters, rotatable bonds
– TPSA (Topological Polar Surface Area)
– Ring count, heteroatom count
– **Example**:
“`python
mol = dm.to_mol("CCO")
descriptors = dm.descriptors.compute_many_descriptors(mol)
# Returns: {'mw': 46.07, 'logp': -0.03, 'hbd': 1, 'hba': 1, …}
“`
#### `dm.descriptors.batch_compute_many_descriptors(mols, properties_fn=None, add_properties=True, n_jobs=1, batch_size=None, progress=False)`
Compute descriptors for multiple molecules in parallel.
– **Parameters**:
– `mols`: List of molecules
– `n_jobs`: Number of parallel jobs (-1 for all cores)
– `batch_size`: Chunk size for parallel processing
– `progress`: Show progress bar
– **Returns**: Pandas DataFrame with one row per molecule
– **Example**:
“`python
mols = [dm.to_mol(smi) for smi in smiles_list]
df = dm.descriptors.batch_compute_many_descriptors(
mols,
n_jobs=-1,
progress=True
)
“`
### RDKit Descriptor Access
#### `dm.descriptors.any_rdkit_descriptor(name)`
Retrieve any descriptor function from RDKit by name.
– **Parameters**: `name` – Descriptor function name (e.g., 'MolWt', 'TPSA')
– **Returns**: RDKit descriptor function
– **Available descriptors**: From `rdkit.Chem.Descriptors` and `rdkit.Chem.rdMolDescriptors`
– **Example**:
“`python
tpsa_fn = dm.descriptors.any_rdkit_descriptor('TPSA')
tpsa_value = tpsa_fn(mol)
“`
### Common Use Cases
**Drug-likeness Filtering (Lipinski's Rule of Five)**:
“`python
descriptors = dm.descriptors.compute_many_descriptors(mol)
is_druglike = (
descriptors['mw'] <= 500 and
descriptors['logp'] <= 5 and
descriptors['hbd'] <= 5 and
descriptors['hba'] <= 10
)
“`
**ADME Property Analysis**:
“`python
df = dm.descriptors.batch_compute_many_descriptors(compound_library)
# Filter by TPSA for blood-brain barrier penetration
bbb_candidates = df[df['tpsa'] < 90]
“`
—
## Visualization Module (`datamol.viz`)
The viz module provides tools for rendering molecules and conformers as images.
### Main Visualization Function
#### `dm.viz.to_image(mols, legends=None, n_cols=4, use_svg=False, mol_size=(200, 200), highlight_atom=None, highlight_bond=None, outfile=None, max_mols=None, copy=True, indices=False, …)`
Generate image grid from molecules.
– **Parameters**:
– `mols`: Single molecule or list of molecules
– `legends`: String or list of strings as labels (one per molecule)
– `n_cols`: Number of molecules per row (default: 4)
– `use_svg`: Output SVG format (True) or PNG (False, default)
– `mol_size`: Tuple (width, height) or single int for square images
– `highlight_atom`: Atom indices to highlight (list or dict)
– `highlight_bond`: Bond indices to highlight (list or dict)
– `outfile`: Save path (local or remote, supports fsspec)
– `max_mols`: Maximum number of molecules to display
– `indices`: Draw atom indices on structures (default: False)
– `align`: Align molecules using MCS (Maximum Common Substructure)
– **Returns**: Image object (can be displayed in Jupyter) or saves to file
– **Example**:
“`python
# Basic grid
dm.viz.to_image(mols[:10], legends=[dm.to_smiles(m) for m in mols[:10]])
# Save to file
dm.viz.to_image(mols, outfile="molecules.png", n_cols=5)
# Highlight substructure
dm.viz.to_image(mol, highlight_atom=[0, 1, 2], highlight_bond=[0, 1])
# Aligned visualization
dm.viz.to_image(mols, align=True, legends=activity_labels)
“`
### Conformer Visualization
#### `dm.viz.conformers(mol, n_confs=None, align_conf=True, n_cols=3, sync_views=True, remove_hs=True, …)`
Display multiple conformers in grid layout.
– **Parameters**:
– `mol`: Molecule with embedded conformers
– `n_confs`: Number or list of conformer indices to display (None = all)
– `align_conf`: Align conformers for comparison (default: True)
– `n_cols`: Grid columns (default: 3)
– `sync_views`: Synchronize 3D views when interactive (default: True)
– `remove_hs`: Remove hydrogens for clarity (default: True)
– **Returns**: Grid of conformer visualizations
– **Use case**: Comparing conformational diversity
– **Example**:
“`python
mol_3d = dm.conformers.generate(mol, n_confs=20)
dm.viz.conformers(mol_3d, n_confs=10, align_conf=True)
“`
### Circle Grid Visualization
#### `dm.viz.circle_grid(center_mol, circle_mols, mol_size=200, circle_margin=50, act_mapper=None, …)`
Create concentric ring visualization with central molecule.
– **Parameters**:
– `center_mol`: Molecule at center
– `circle_mols`: List of molecule lists (one list per ring)
– `mol_size`: Image size per molecule
– `circle_margin`: Spacing between rings (default: 50)
– `act_mapper`: Activity mapping dictionary for color-coding
– **Returns**: Circular grid image
– **Use case**: Visualizing molecular neighborhoods, SAR analysis, similarity networks
– **Example**:
“`python
# Show a reference molecule surrounded by similar compounds
dm.viz.circle_grid(
center_mol=reference,
circle_mols=[nearest_neighbors, second_tier]
)
“`
### Visualization Best Practices
1. **Use legends for clarity**: Always label molecules with SMILES, IDs, or activity values
2. **Align related molecules**: Use `align=True` in `to