Skill

SkillsResearch & Science › Bioinformatics & life science

Pubchem Database

"Query PubChem via PUG-REST API/PubChemPy (110M+ compounds). Search by name/CID/SMILES, retrieve properties, similarity/substructure searches, bioactivity, for cheminformatics."

Freerisk: low
pubchemdatabasepythonpandas

Tools: pubchempy,requests,pandas

The full skill

— name: pubchem-database description: "Query PubChem via PUG-REST API/PubChemPy (110M+ compounds). Search by name/CID/SMILES, retrieve properties, similarity/substructure searches, bioactivity, for cheminformatics." — # PubChem Database ## Overview PubChem is the world's largest freely available chemical database with 110M+ compounds and 270M+ bioactivities. Query chemical structures by name, CID, or SMILES, retrieve molecular properties, perform similarity and substructure searches, access bioactivity data using PUG-REST API and PubChemPy. ## When to Use This Skill This skill should be used when: – Searching for chemical compounds by name, structure (SMILES/InChI), or molecular formula – Retrieving molecular properties (MW, LogP, TPSA, hydrogen bonding descriptors) – Performing similarity searches to find structurally related compounds – Conducting substructure searches for specific chemical motifs – Accessing bioactivity data from screening assays – Converting between chemical identifier formats (CID, SMILES, InChI) – Batch processing multiple compounds for drug-likeness screening or property analysis ## Core Capabilities ### 1. Chemical Structure Search Search for compounds using multiple identifier types: **By Chemical Name**: “`python import pubchempy as pcp compounds = pcp.get_compounds('aspirin', 'name') compound = compounds[0] “` **By CID (Compound ID)**: “`python compound = pcp.Compound.from_cid(2244) # Aspirin “` **By SMILES**: “`python compound = pcp.get_compounds('CC(=O)OC1=CC=CC=C1C(=O)O', 'smiles')[0] “` **By InChI**: “`python compound = pcp.get_compounds('InChI=1S/C9H8O4/…', 'inchi')[0] “` **By Molecular Formula**: “`python compounds = pcp.get_compounds('C9H8O4', 'formula') # Returns all compounds matching this formula “` ### 2. Property Retrieval Retrieve molecular properties for compounds using either high-level or low-level approaches: **Using PubChemPy (Recommended)**: “`python import pubchempy as pcp # Get compound object with all properties compound = pcp.get_compounds('caffeine', 'name')[0] # Access individual properties molecular_formula = compound.molecular_formula molecular_weight = compound.molecular_weight iupac_name = compound.iupac_name smiles = compound.canonical_smiles inchi = compound.inchi xlogp = compound.xlogp # Partition coefficient tpsa = compound.tpsa # Topological polar surface area “` **Get Specific Properties**: “`python # Request only specific properties properties = pcp.get_properties( ['MolecularFormula', 'MolecularWeight', 'CanonicalSMILES', 'XLogP'], 'aspirin', 'name' ) # Returns list of dictionaries “` **Batch Property Retrieval**: “`python import pandas as pd compound_names = ['aspirin', 'ibuprofen', 'paracetamol'] all_properties = [] for name in compound_names: props = pcp.get_properties( ['MolecularFormula', 'MolecularWeight', 'XLogP'], name, 'name' ) all_properties.extend(props) df = pd.DataFrame(all_properties) “` **Available Properties**: MolecularFormula, MolecularWeight, CanonicalSMILES, IsomericSMILES, InChI, InChIKey, IUPACName, XLogP, TPSA, HBondDonorCount, HBondAcceptorCount, RotatableBondCount, Complexity, Charge, and many more (see `references/api_reference.md` for complete list). ### 3. Similarity Search Find structurally similar compounds using Tanimoto similarity: “`python import pubchempy as pcp # Start with a query compound query_compound = pcp.get_compounds('gefitinib', 'name')[0] query_smiles = query_compound.canonical_smiles # Perform similarity search similar_compounds = pcp.get_compounds( query_smiles, 'smiles', searchtype='similarity', Threshold=85, # Similarity threshold (0-100) MaxRecords=50 ) # Process results for compound in similar_compounds[:10]: print(f"CID {compound.cid}: {compound.iupac_name}") print(f" MW: {compound.molecular_weight}") “` **Note**: Similarity searches are asynchronous for large queries and may take 15-30 seconds to complete. PubChemPy handles the asynchronous pattern automatically. ### 4. Substructure Search Find compounds containing a specific structural motif: “`python import pubchempy as pcp # Search for compounds containing pyridine ring pyridine_smiles = 'c1ccncc1' matches = pcp.get_compounds( pyridine_smiles, 'smiles', searchtype='substructure', MaxRecords=100 ) print(f"Found {len(matches)} compounds containing pyridine") “` **Common Substructures**: – Benzene ring: `c1ccccc1` – Pyridine: `c1ccncc1` – Phenol: `c1ccc(O)cc1` – Carboxylic acid: `C(=O)O` ### 5. Format Conversion Convert between different chemical structure formats: “`python import pubchempy as pcp compound = pcp.get_compounds('aspirin', 'name')[0] # Convert to different formats smiles = compound.canonical_smiles inchi = compound.inchi inchikey = compound.inchikey cid = compound.cid # Download structure files pcp.download('SDF', 'aspirin', 'name', 'aspirin.sdf', overwrite=True) pcp.download('JSON', '2244', 'cid', 'aspirin.json', overwrite=True) “` ### 6. Structure Visualization Generate 2D structure images: “`python import pubchempy as pcp # Download compound structure as PNG pcp.download('PNG', 'caffeine', 'name', 'caffeine.png', overwrite=True) # Using direct URL (via requests) import requests cid = 2244 # Aspirin url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/PNG?image_size=large" response = requests.get(url) with open('structure.png', 'wb') as f: f.write(response.content) “` ### 7. Synonym Retrieval Get all known names and synonyms for a compound: “`python import pubchempy as pcp synonyms_data = pcp.get_synonyms('aspirin', 'name') if synonyms_data: cid = synonyms_data[0]['CID'] synonyms = synonyms_data[0]['Synonym'] print(f"CID {cid} has {len(synonyms)} synonyms:") for syn in synonyms[:10]: # First 10 print(f" – {syn}") “` ### 8. Bioactivity Data Access Retrieve biological activity data from assays: “`python import requests import json # Get bioassay summary for a compound cid = 2244 # Aspirin url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/assaysummary/JSON" response = requests.get(url) if response.status_code == 200: data = response.json() # Process bioassay information table = data.get('Table', {}) rows = table.get('Row', []) print(f"Found {len(rows)} bioassay records") “` **For more complex bioactivity queries**, use the `scripts/bioactivity_query.py` helper script which provides: – Bioassay summaries with activity outcome filtering – Assay target identification – Search for compounds by biological target – Active compound lists for specific assays ### 9. Comprehensive Compound Annotations Access detailed compound information through PUG-View: “`python import requests cid = 2244 url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{cid}/JSON" response = requests.get(url) if response.status_code == 200: annotations = response.json() # Contains extensive data including: # – Chemical and Physical Properties # – Drug and Medication Information # – Pharmacology and Biochemistry # – Safety and Hazards # – Toxicity # – Literature references # – Patents “` **Get Specific Section**: “`python # Get only drug information url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{cid}/JSON?heading=Drug and Medication Information" “` ## Installation Requirements Install PubChemPy for Python-based access: “`bash uv pip install pubchempy “` For direct API access and bioactivity queries: “`bash uv pip install requests “` Optional for data analysis: “`bash uv pip install pandas “` ## Helper Scripts This skill includes Python scripts for common PubChem tasks: ### scripts/compound_search.py Provides utility functions for searching and retrieving compound information: **Key Functions**: – `search_by_name(name, max_results=10)`: Search compounds by name – `search_by_smiles(smiles)`: Search by SMILES string – `get_compound_by_cid(cid)`: Retrieve compound by CID – `get_compound_properties(identifier, namespace, properties)`: Get specific properties – `similarity_search(smiles, threshold, max_records)`: Perform similarity search – `substructure_search(smiles, max_records)`: Perform substructure search – `get_synonyms(identifier, namespace)`: Get all synonyms – `batch_search(identifiers, namespace, properties)`: Batch search multiple compounds – `download_structure(identifier, namespace, format, filename)`: Download structures – `print_compound_info(compound)`: Print formatted compound information **Usage**: “`python from scripts.compound_search import search_by_name, get_compound_properties # Search for a compound compounds = search_by_name('ibuprofen') # Get specific properties props = get_compound_properties('aspirin', 'name', ['MolecularWeight', 'XLogP']) “` ### scripts/bioactivity_query.py Provides functions for retrieving biological activity data: **Key Functions**: – `get_bioassay_summary(cid)`: Get bioassay summary for compound – `get_compound_bioactivities(cid, activity_outcome)`: Get filtered bioactivities – `get_assay_description(aid)`: Get detailed assay information – `get_assay_targets(aid)`: Get biological targets for assay – `search_assays_by_target(target_name, max_results)`: Find assays by target – `get_active_compounds_in_assay(aid, max_results)`: Get active compounds – `get_compound_annotations(cid, section)`: Get PUG-View annotations – `summarize_bioactivities(cid)`: Generate bioactivity summary statistics – `find_compounds_by_bioactivity(target, threshold, max_compounds)`: Find compounds by target **Usage**: “`python from scripts.bioactivity_query import get_bioassay_summary, summarize_bioactivities # Get bioactivity summary summary = summarize_bioactivities(2244) # Aspirin print(f"Total assays: {summary['total_assays']}") print(f"Active: {summary['active']}, Inactive: {summary['inactive']}") “` ## API Rate Limits and Best Practices **Rate Limits**: – Maximum 5 requests per second – Maximum 400 requests per minute – Maximum 300 seconds running time per minute **Best Practices**: 1. **Use CIDs for repeated queries**: CIDs are more efficient than names or structures 2. **Cache results locally**: Store frequently accessed data 3. **Batch requests**: Combine multiple queries when possible 4. **Implement delays**: Add 0.2-0.3 second delays between requests 5. **Handle errors gracefully**: Check for HTTP errors and missing data 6. **Use PubChemPy**: Higher-level abstraction handles many edge cases 7. **Leverage asynchronous pattern**: For large similarity/substructure searches 8. **Specify MaxRecords**: Limit results to avoid timeouts **Error Handling**: “`python from pubchempy import BadRequestError, NotFoundError, TimeoutError try: compound = pcp.get_compounds('query', 'name')[0] except NotFoundError: print("Compound not found") except BadRequestError: print("Invalid request format") except TimeoutError: print("Request timed out – try reducing scope") except IndexError: print("No results returned") “` ## Common Workflows ### Workflow 1: Chemical Identifier Conversion Pipeline Convert between different chemical identifiers: “`python import pubchempy as pcp # Start with any identifier type compound = pcp.get_compounds('caffeine', 'name')[0] # Extract all identifier formats identifiers = { 'CID': compound.cid, 'Name': compound.iupac_name, 'SMILES': compound.canonical_smiles, 'InChI': compound.inchi, 'InChIKey': compound.inchikey, 'Formula': compound.molecular_formula } “` ### Workflow 2: Drug-Like Property Screening Screen compounds using Lipinski's Rule of Five: “`python import pubchempy as pcp def check_drug_likeness(compound_name): compound = pcp.get_compounds(compound_name, 'name')[0] # Lipinski's Rule of Five rules = { 'MW <= 500': compound.molecular_weight <= 500, 'LogP <= 5': compound.xlogp <= 5 if compound.xlogp else None, 'HBD <= 5': compound.h_bond_donor_count <= 5, 'HBA <= 10': compound.h_bond_acceptor_count <= 10 } violations = sum(1 for v in rules.values() if v is False) return rules, violations rules, violations = check_drug_likeness('aspirin') print(f"Lipinski violations: {violations}") “` ### Workflow 3: Finding Similar Drug Candidates Identify structurally similar compounds to a known drug: “`python import pubchempy as pcp # Start with known drug reference_drug = pcp.get_compounds('imatinib', 'name')[0] reference_smiles = reference_drug.canonical_smiles # Find similar compounds similar = pcp.get_compounds( reference_smiles, 'smiles', searchtype='similarity', Threshold=85, MaxRecords=20 ) # Filter by drug-like properties candidates = [] for comp in similar: if comp.molecular_weight and 200 <= comp.molecular_weight <= 600: if comp.xlogp and -1 <= comp.xlogp <= 5: candidates.append(comp) print(f"Found {len(candidates)} drug-like candidates") “` ### Workflow 4: Batch Compound Property Comparison Compare properties across multiple compounds: “`python import pubchempy as pcp import pandas as pd compound_list = ['aspirin', 'ibuprofen', 'naproxen', 'celecoxib'] properties_list = [] for name in compound_list: try: compound = pcp.get_compounds(name, 'name')[0] properties_list.append({ 'Name': name, 'CID': compound.cid, 'Formula': compound.molecular_formula, 'MW': compound.molecular_weight, 'LogP': compound.xlogp, 'TPSA': compound.tpsa, 'HBD': compound.h_bond_donor_count, 'HBA': compound.h_bond_acceptor_count }) except Exception as e: print(f"Error processing {name}: {e}") df = pd.DataFrame(properties_list) print(df.to_string(index=False)) “` ### Workflow 5: Substructure-Based Virtual Screening Screen for compounds containing specific pharmacophores: “`python import pubchempy as pcp # Define pharmacophore (e.g., sulfonamide group) pharmacophore_smiles = 'S(=O)(=O)N' # Search for compounds containing this substructure hits = pcp.get_compounds( pharmacophore_smiles, 'smiles', searchtype='substructure', MaxRecords=100 ) # Further filter by properties filtered_hits = [ comp for comp in hits if comp.molecular_weight and comp.molecular_weight < 500 ] print(f"Found {len(filtered_hits)} compounds with desired substructure") “` ## Reference Documentation For detailed API documentation, including complete property lists, URL patterns, advanced query options, and more examples, consult `references/api_reference.md`. This comprehensive reference includes: – Complete PUG-REST API endpoint documentation – Full list of available molecular properties – Asynchronous request handling patterns – PubChemPy API reference – PUG-View API for annotations – Common workflows and use cases – Links to official PubChem documentation ## Troubleshooting **Compound Not Found**: – Try alternative names or synonyms – Use CID if known – Check spelling and chemical name format **Timeout Errors**: – Reduce MaxRecords parameter – Add delays between requests – Use CIDs instead of names for faster queries **Empty Property Values**: – Not all properties are available for all compounds – Check if property exists before accessing: `if compound.xlogp:` – Some properties only available for certain compound types **Rate Limit Exceeded**: – Implement delays (0.2-0.3 seconds) between requests – Use batch operations where possible – Consider caching results locally **Similarity/Substructure Search Hangs**: – These are asynchronous operations that may take 15-30 seconds – PubChemPy handles polling automatically – Reduce MaxRecords if timing out ## Additional Resources – PubChem Home: https://pubchem.ncbi.nlm.nih.gov/ – PUG-REST Documentation: https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest – PUG-REST Tutorial: https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest-tutorial – PubChemPy Documentation: https://pubchempy.readthedocs.io/ – PubChemPy GitHub: https://github.com/mcs07/PubChemPy