Skill

SkillsResearch & Science › Bioinformatics & life science

datamol

Pythonic wrapper around RDKit with simplified interface and sensible defaults. Preferred for standard drug discovery including SMILES parsing, standardization, descriptors, fingerprints, clustering, 3D conformers, parallel processing. Returns native rdkit.Chem.Mol objects. For advanced control or custom parameters, use rdkit directly.

Freerisk: low
datamolpythonpandasnumpysquare

Tools: datamol

The full skill

— name: datamol description: Pythonic wrapper around RDKit with simplified interface and sensible defaults. Preferred for standard drug discovery including SMILES parsing, standardization, descriptors, fingerprints, clustering, 3D conformers, parallel processing. Returns native rdkit.Chem.Mol objects. For advanced control or custom parameters, use rdkit directly. license: Apache-2.0 license metadata: skill-author: K-Dense Inc. — # Datamol Cheminformatics Skill ## Overview Datamol is a Python library that provides a lightweight, Pythonic abstraction layer over RDKit for molecular cheminformatics. Simplify complex molecular operations with sensible defaults, efficient parallelization, and modern I/O capabilities. All molecular objects are native `rdkit.Chem.Mol` instances, ensuring full compatibility with the RDKit ecosystem. **Key capabilities**: – Molecular format conversion (SMILES, SELFIES, InChI) – Structure standardization and sanitization – Molecular descriptors and fingerprints – 3D conformer generation and analysis – Clustering and diversity selection – Scaffold and fragment analysis – Chemical reaction application – Visualization and alignment – Batch processing with parallelization – Cloud storage support via fsspec ## Installation and Setup Guide users to install datamol: “`bash uv pip install datamol “` **Import convention**: “`python import datamol as dm “` ## Core Workflows ### 1. Basic Molecule Handling **Creating molecules from SMILES**: “`python import datamol as dm # Single molecule mol = dm.to_mol("CCO") # Ethanol # From list of SMILES smiles_list = ["CCO", "c1ccccc1", "CC(=O)O"] mols = [dm.to_mol(smi) for smi in smiles_list] # Error handling mol = dm.to_mol("invalid_smiles") # Returns None if mol is None: print("Failed to parse SMILES") “` **Converting molecules to SMILES**: “`python # Canonical SMILES smiles = dm.to_smiles(mol) # Isomeric SMILES (includes stereochemistry) smiles = dm.to_smiles(mol, isomeric=True) # Other formats inchi = dm.to_inchi(mol) inchikey = dm.to_inchikey(mol) selfies = dm.to_selfies(mol) “` **Standardization and sanitization** (always recommend for user-provided molecules): “`python # Sanitize molecule mol = dm.sanitize_mol(mol) # Full standardization (recommended for datasets) mol = dm.standardize_mol( mol, disconnect_metals=True, normalize=True, reionize=True ) # For SMILES strings directly clean_smiles = dm.standardize_smiles(smiles) “` ### 2. Reading and Writing Molecular Files Refer to `references/io_module.md` for comprehensive I/O documentation. **Reading files**: “`python # SDF files (most common in chemistry) df = dm.read_sdf("compounds.sdf", mol_column='mol') # SMILES files df = dm.read_smi("molecules.smi", smiles_column='smiles', mol_column='mol') # CSV with SMILES column df = dm.read_csv("data.csv", smiles_column="SMILES", mol_column="mol") # Excel files df = dm.read_excel("compounds.xlsx", sheet_name=0, mol_column="mol") # Universal reader (auto-detects format) df = dm.open_df("file.sdf") # Works with .sdf, .csv, .xlsx, .parquet, .json “` **Writing files**: “`python # Save as SDF dm.to_sdf(mols, "output.sdf") # Or from DataFrame dm.to_sdf(df, "output.sdf", mol_column="mol") # Save as SMILES file dm.to_smi(mols, "output.smi") # Excel with rendered molecule images dm.to_xlsx(df, "output.xlsx", mol_columns=["molf:T139e,# Datamol Conformers Module Reference The `datamol.conformers` module provides tools for generating and analyzing 3D molecular conformations. ## Conformer Generation ### `dm.conformers.generate(mol, n_confs=None, rms_cutoff=None, minimize_energy=True, method='ETKDGv3', add_hs=True, …)` Generate 3D molecular conformers. – **Parameters**: – `mol`: Input molecule – `n_confs`: Number of conformers to generate (auto-determined based on rotatable bonds if None) – `rms_cutoff`: RMS threshold in Ångströms for filtering similar conformers (removes duplicates) – `minimize_energy`: Apply UFF energy minimization (default: True) – `method`: Embedding method – options: – `'ETDG'` – Experimental Torsion Distance Geometry – `'ETKDG'` – ETDG with additional basic knowledge – `'ETKDGv2'` – Enhanced version 2 – `'ETKDGv3'` – Enhanced version 3 (default, recommended) – `add_hs`: Add hydrogens before embedding (default: True, critical for quality) – `random_seed`: Set for reproducibility – **Returns**: Molecule with embedded conformers – **Example**: “`python mol = dm.to_mol("CCO") mol_3d = dm.conformers.generate(mol, n_confs=10, rms_cutoff=0.5) conformers = mol_3d.GetConformers() # Access all conformers “` ## Conformer Clustering ### `dm.conformers.cluster(mol, rms_cutoff=1.0, already_aligned=False, centroids=False)` Group conformers by RMS distance. – **Parameters**: – `rms_cutoff`: Clustering threshold in Ångströms (default: 1.0) – `already_aligned`: Whether conformers are pre-aligned – `centroids`: Return centroid conformers (True) or cluster groups (False) – **Returns**: Cluster information or centroid conformers – **Use case**: Identify distinct conformational families ### `dm.conformers.return_centroids(mol, conf_clusters, centroids=True)` Extract representative conformers from clusters. – **Parameters**: – `conf_clusters`: Sequence of cluster indices from `cluster()` – `centroids`: Return single molecule (True) or list of molecules (False) – **Returns**: Centroid conformer(s) ## Conformer Analysis ### `dm.conformers.rmsd(mol)` Calculate pairwise RMSD matrix across all conformers. – **Requirements**: Minimum 2 conformers – **Returns**: NxN matrix of RMSD values – **Use case**: Quantify conformer diversity ### `dm.conformers.sasa(mol, n_jobs=1, …)` Calculate Solvent Accessible Surface Area (SASA) using FreeSASA. – **Parameters**: – `n_jobs`: Parallelization for multiple conformers – **Returns**: Array of SASA values (one per conformer) – **Storage**: Values stored in each conformer as property `'rdkit_free_sasa'` – **Example**: “`python sasa_values = dm.conformers.sasa(mol_3d) # Or access from conformer properties conf = mol_3d.GetConformer(0) sasa = conf.GetDoubleProp('rdkit_free_sasa') “` ## Low-Level Conformer Manipulation ### `dm.conformers.center_of_mass(mol, conf_id=-1, use_atoms=True, round_coord=None)` Calculate molecular center. – **Parameters**: – `conf_id`: Conformer index (-1 for first conformer) – `use_atoms`: Use atomic masses (True) or geometric center (False) – `round_coord`: Decimal precision for rounding – **Returns**: 3D coordinates of center – **Use case**: Centering molecules for visualization or alignment ### `dm.conformers.get_coords(mol, conf_id=-1)` Retrieve atomic coordinates from a conformer. – **Returns**: Nx3 numpy array of atomic positions – **Example**: “`python positions = dm.conformers.get_coords(mol_3d, conf_id=0) # positions.shape: (num_atoms, 3) “` ### `dm.conformers.translate(mol, conf_id=-1, transform_matrix=None)` Reposition conformer using transformation matrix. – **Modification**: Operates in-place – **Use case**: Aligning or repositioning molecules ## Workflow Example “`python import datamol as dm # 1. Create molecule and generate conformers mol = dm.to_mol("CC(C)CCO") # Isopentanol mol_3d = dm.conformers.generate( mol, n_confs=50, # Generate 50 initial conformers rms_cutoff=0.5, # Filter similar conformers minimize_energy=True # Minimize energy ) # 2. Analyze conformers n_conformers = mol_3d.GetNumConformers() print(f"Generated {n_conformers} unique conformers") # 3. Calculate SASA sasa_values = dm.conformers.sasa(mol_3d) # 4. Cluster conformers clusters = dm.conformers.cluster(mol_3d, rms_cutoff=1.0, centroids=False) # 5. Get representative conformers centroids = dm.conformers.return_centroids(mol_3d, clusters) # 6. Access 3D coordinates coords = dm.conformers.get_coords(mol_3d, conf_id=0) “` ## Key Concepts – **Distance Geometry**: Method for generating 3D structures from connectivity information – **ETKDG**: Uses experimental torsion angle preferences and additional chemical knowledge – **RMS Cutoff**: Lower values = more unique conformers; higher values = fewer, more distinct conformers – **Energy Minimization**: Relaxes structures to nearest local energy minimum – **Hydrogens**: Critical for accurate 3D geometry – always include during embedding 20:T1092,# Datamol Core API Reference This document covers the main functions available in the datamol namespace. ## Molecule Creation and Conversion ### `to_mol(mol, …)` Convert SMILES string or other molecular representations to RDKit molecule objects. – **Parameters**: Accepts SMILES strings, InChI, or other molecular formats – **Returns**: `rdkit.Chem.Mol` object – **Common usage**: `mol = dm.to_mol("CCO")` ### `from_inchi(inchi)` Convert InChI string to molecule object. ### `from_smarts(smarts)` Convert SMARTS pattern to molecule object. ### `from_selfies(selfies)` Convert SELFIES string to molecule object. ### `copy_mol(mol)` Create a copy of a molecule object to avoid modifying the original. ## Molecule Export ### `to_smiles(mol, …)` Convert molecule object to SMILES string. – **Common parameters**: `canonical=True`, `isomeric=True` ### `to_inchi(mol, …)` Convert molecule to InChI string representation. ### `to_inchikey(mol)` Convert molecule to InChI key (fixed-length hash). ### `to_smarts(mol)` Convert molecule to SMARTS pattern. ### `to_selfies(mol)` Convert molecule to SELFIES (Self-Referencing Embedded Strings) format. ## Sanitization and Standardization ### `sanitize_mol(mol, …)` Enhanced version of RDKit's sanitize operation using mol→SMILES→mol conversion and aromatic nitrogen fixing. – **Purpose**: Fix common molecular structure issues – **Returns**: Sanitized molecule or None if sanitization fails ### `standardize_mol(mol, disconnect_metals=False, normalize=True, reionize=True, …)` Apply comprehensive standardization procedures including: – Metal disconnection – Normalization (charge corrections) – Reionization – Fragment handling (largest fragment selection) ### `standardize_smiles(smiles, …)` Apply SMILES standardization procedures directly to a SMILES string. ### `fix_mol(mol)` Attempt to fix molecular structure issues automatically. ### `fix_valence(mol)` Correct valence errors in molecular structures. ## Molecular Properties ### `reorder_atoms(mol, …)` Ensure consistent atom ordering for the same molecule regardless of original SMILES representation. – **Purpose**: Maintain reproducible feature generation ### `remove_hs(mol, …)` Remove hydrogen atoms from molecular structure. ### `add_hs(mol, …)` Add explicit hydrogen atoms to molecular structure. ## Fingerprints and Similarity ### `to_fp(mol, fp_type='ecfp', …)` Generate molecular fingerprints for similarity calculations. – **Fingerprint types**: – `'ecfp'` – Extended Connectivity Fingerprints (Morgan) – `'fcfp'` – Functional Connectivity Fingerprints – `'maccs'` – MACCS keys – `'topological'` – Topological fingerprints – `'atompair'` – Atom pair fingerprints – **Common parameters**: `n_bits`, `radius` – **Returns**: Numpy array or RDKit fingerprint object ### `pdist(mols, …)` Calculate pairwise Tanimoto distances between all molecules in a list. – **Supports**: Parallel processing via `n_jobs` parameter – **Returns**: Distance matrix ### `cdist(mols1, mols2, …)` Calculate Tanimoto distances between two sets of molecules. ## Clustering and Diversity ### `cluster_mols(mols, cutoff=0.2, feature_fn=None, n_jobs=1)` Cluster molecules using Butina clustering algorithm. – **Parameters**: – `cutoff`: Distance threshold (default 0.2) – `feature_fn`: Custom function for molecular features – `n_jobs`: Parallelization (-1 for all cores) – **Important**: Builds full distance matrix – suitable for ~1000 structures, not for 10,000+ – **Returns**: List of clusters (each cluster is a list of molecule indices) ### `pick_diverse(mols, npick, …)` Select diverse subset of molecules based on fingerprint diversity. ### `pick_centroids(mols, npick, …)` Select centroid molecules representing clusters. ## Graph Operations ### `to_graph(mol)` Convert molecule to graph representation for graph-based analysis. ### `get_all_path_between(mol, start, end)` Find all paths between two atoms in molecular structure. ## DataFrame Integration ### `to_df(mols, smiles_column='smiles', mol_column='mol')` Convert list of molecules to pandas DataFrame. ### `from_df(df, smiles_column='smiles', mol_column='mol')` Convert pandas DataFrame to list of molecules. 21:T1cc6,# Datamol Descriptors and Visualization Reference ## Descriptors Module (`datamol.descriptors`) The descriptors module provides tools for computing molecular properties and descriptors. ### Specialized Descriptor Functions #### `dm.descriptors.n_aromatic_atoms(mol)` Calculate the number of aromatic atoms. – **Returns**: Integer count – **Use case**: Aromaticity analysis #### `dm.descriptors.n_aromatic_atoms_proportion(mol)` Calculate ratio of aromatic atoms to total heavy atoms. – **Returns**: Float between 0 and 1 – **Use case**: Quantifying aromatic character #### `dm.descriptors.n_charged_atoms(mol)` Count atoms with nonzero formal charge. – **Returns**: Integer count – **Use case**: Charge distribution analysis #### `dm.descriptors.n_rigid_bonds(mol)` Count non-rotatable bonds (neither single bonds nor ring bonds). – **Returns**: Integer count – **Use case**: Molecular flexibility assessment #### `dm.descriptors.n_stereo_centers(mol)` Count stereogenic centers (chiral centers). – **Returns**: Integer count – **Use case**: Stereochemistry analysis #### `dm.descriptors.n_stereo_centers_unspecified(mol)` Count stereocenters lacking stereochemical specification. – **Returns**: Integer count – **Use case**: Identifying incomplete stereochemistry ### Batch Descriptor Computation #### `dm.descriptors.compute_many_descriptors(mol, properties_fn=None, add_properties=True)` Compute multiple molecular properties for a single molecule. – **Parameters**: – `properties_fn`: Custom list of descriptor functions – `add_properties`: Include additional computed properties – **Returns**: Dictionary of descriptor name → value pairs – **Default descriptors include**: – Molecular weight, LogP, number of H-bond donors/acceptors – Aromatic atoms, stereocenters, rotatable bonds – TPSA (Topological Polar Surface Area) – Ring count, heteroatom count – **Example**: “`python mol = dm.to_mol("CCO") descriptors = dm.descriptors.compute_many_descriptors(mol) # Returns: {'mw': 46.07, 'logp': -0.03, 'hbd': 1, 'hba': 1, …} “` #### `dm.descriptors.batch_compute_many_descriptors(mols, properties_fn=None, add_properties=True, n_jobs=1, batch_size=None, progress=False)` Compute descriptors for multiple molecules in parallel. – **Parameters**: – `mols`: List of molecules – `n_jobs`: Number of parallel jobs (-1 for all cores) – `batch_size`: Chunk size for parallel processing – `progress`: Show progress bar – **Returns**: Pandas DataFrame with one row per molecule – **Example**: “`python mols = [dm.to_mol(smi) for smi in smiles_list] df = dm.descriptors.batch_compute_many_descriptors( mols, n_jobs=-1, progress=True ) “` ### RDKit Descriptor Access #### `dm.descriptors.any_rdkit_descriptor(name)` Retrieve any descriptor function from RDKit by name. – **Parameters**: `name` – Descriptor function name (e.g., 'MolWt', 'TPSA') – **Returns**: RDKit descriptor function – **Available descriptors**: From `rdkit.Chem.Descriptors` and `rdkit.Chem.rdMolDescriptors` – **Example**: “`python tpsa_fn = dm.descriptors.any_rdkit_descriptor('TPSA') tpsa_value = tpsa_fn(mol) “` ### Common Use Cases **Drug-likeness Filtering (Lipinski's Rule of Five)**: “`python descriptors = dm.descriptors.compute_many_descriptors(mol) is_druglike = ( descriptors['mw'] <= 500 and descriptors['logp'] <= 5 and descriptors['hbd'] <= 5 and descriptors['hba'] <= 10 ) “` **ADME Property Analysis**: “`python df = dm.descriptors.batch_compute_many_descriptors(compound_library) # Filter by TPSA for blood-brain barrier penetration bbb_candidates = df[df['tpsa'] < 90] “` — ## Visualization Module (`datamol.viz`) The viz module provides tools for rendering molecules and conformers as images. ### Main Visualization Function #### `dm.viz.to_image(mols, legends=None, n_cols=4, use_svg=False, mol_size=(200, 200), highlight_atom=None, highlight_bond=None, outfile=None, max_mols=None, copy=True, indices=False, …)` Generate image grid from molecules. – **Parameters**: – `mols`: Single molecule or list of molecules – `legends`: String or list of strings as labels (one per molecule) – `n_cols`: Number of molecules per row (default: 4) – `use_svg`: Output SVG format (True) or PNG (False, default) – `mol_size`: Tuple (width, height) or single int for square images – `highlight_atom`: Atom indices to highlight (list or dict) – `highlight_bond`: Bond indices to highlight (list or dict) – `outfile`: Save path (local or remote, supports fsspec) – `max_mols`: Maximum number of molecules to display – `indices`: Draw atom indices on structures (default: False) – `align`: Align molecules using MCS (Maximum Common Substructure) – **Returns**: Image object (can be displayed in Jupyter) or saves to file – **Example**: “`python # Basic grid dm.viz.to_image(mols[:10], legends=[dm.to_smiles(m) for m in mols[:10]]) # Save to file dm.viz.to_image(mols, outfile="molecules.png", n_cols=5) # Highlight substructure dm.viz.to_image(mol, highlight_atom=[0, 1, 2], highlight_bond=[0, 1]) # Aligned visualization dm.viz.to_image(mols, align=True, legends=activity_labels) “` ### Conformer Visualization #### `dm.viz.conformers(mol, n_confs=None, align_conf=True, n_cols=3, sync_views=True, remove_hs=True, …)` Display multiple conformers in grid layout. – **Parameters**: – `mol`: Molecule with embedded conformers – `n_confs`: Number or list of conformer indices to display (None = all) – `align_conf`: Align conformers for comparison (default: True) – `n_cols`: Grid columns (default: 3) – `sync_views`: Synchronize 3D views when interactive (default: True) – `remove_hs`: Remove hydrogens for clarity (default: True) – **Returns**: Grid of conformer visualizations – **Use case**: Comparing conformational diversity – **Example**: “`python mol_3d = dm.conformers.generate(mol, n_confs=20) dm.viz.conformers(mol_3d, n_confs=10, align_conf=True) “` ### Circle Grid Visualization #### `dm.viz.circle_grid(center_mol, circle_mols, mol_size=200, circle_margin=50, act_mapper=None, …)` Create concentric ring visualization with central molecule. – **Parameters**: – `center_mol`: Molecule at center – `circle_mols`: List of molecule lists (one list per ring) – `mol_size`: Image size per molecule – `circle_margin`: Spacing between rings (default: 50) – `act_mapper`: Activity mapping dictionary for color-coding – **Returns**: Circular grid image – **Use case**: Visualizing molecular neighborhoods, SAR analysis, similarity networks – **Example**: “`python # Show a reference molecule surrounded by similar compounds dm.viz.circle_grid( center_mol=reference, circle_mols=[nearest_neighbors, second_tier] ) “` ### Visualization Best Practices 1. **Use legends for clarity**: Always label molecules with SMILES, IDs, or activity values 2. **Align related molecules**: Use `align=True` in `to