Skill

SkillsResearch & Science › Bioinformatics & life science

Gwas Database

"Query NHGRI-EBI GWAS Catalog for SNP-trait associations. Search variants by rs ID, disease/trait, gene, retrieve p-values and summary statistics, for genetic epidemiology and polygenic risk scores."

Freerisk: low
gwasdatabasepubmedpythonpandas

Tools: requests,gzip,pandas

The full skill

— name: gwas-database description: "Query NHGRI-EBI GWAS Catalog for SNP-trait associations. Search variants by rs ID, disease/trait, gene, retrieve p-values and summary statistics, for genetic epidemiology and polygenic risk scores." — # GWAS Catalog Database ## Overview The GWAS Catalog is a comprehensive repository of published genome-wide association studies maintained by the National Human Genome Research Institute (NHGRI) and the European Bioinformatics Institute (EBI). The catalog contains curated SNP-trait associations from thousands of GWAS publications, including genetic variants, associated traits and diseases, p-values, effect sizes, and full summary statistics for many studies. ## When to Use This Skill This skill should be used when queries involve: – **Genetic variant associations**: Finding SNPs associated with diseases or traits – **SNP lookups**: Retrieving information about specific genetic variants (rs IDs) – **Trait/disease searches**: Discovering genetic associations for phenotypes – **Gene associations**: Finding variants in or near specific genes – **GWAS summary statistics**: Accessing complete genome-wide association data – **Study metadata**: Retrieving publication and cohort information – **Population genetics**: Exploring ancestry-specific associations – **Polygenic risk scores**: Identifying variants for risk prediction models – **Functional genomics**: Understanding variant effects and genomic context – **Systematic reviews**: Comprehensive literature synthesis of genetic associations ## Core Capabilities ### 1. Understanding GWAS Catalog Data Structure The GWAS Catalog is organized around four core entities: – **Studies**: GWAS publications with metadata (PMID, author, cohort details) – **Associations**: SNP-trait associations with statistical evidence (p ≤ 5×10⁻⁸) – **Variants**: Genetic markers (SNPs) with genomic coordinates and alleles – **Traits**: Phenotypes and diseases (mapped to EFO ontology terms) **Key Identifiers:** – Study accessions: `GCST` IDs (e.g., GCST001234) – Variant IDs: `rs` numbers (e.g., rs7903146) or `variant_id` format – Trait IDs: EFO terms (e.g., EFO_0001360 for type 2 diabetes) – Gene symbols: HGNC approved names (e.g., TCF7L2) ### 2. Web Interface Searches The web interface at https://www.ebi.ac.uk/gwas/ supports multiple search modes: **By Variant (rs ID):** “` rs7903146 “` Returns all trait associations for this SNP. **By Disease/Trait:** “` type 2 diabetes Parkinson disease body mass index “` Returns all associated genetic variants. **By Gene:** “` APOE TCF7L2 “` Returns variants in or near the gene region. **By Chromosomal Region:** “` 10:114000000-115000000 “` Returns variants in the specified genomic interval. **By Publication:** “` PMID:20581827 Author: McCarthy MI GCST001234 “` Returns study details and all reported associations. ### 3. REST API Access The GWAS Catalog provides two REST APIs for programmatic access: **Base URLs:** – GWAS Catalog API: `https://www.ebi.ac.uk/gwas/rest/api` – Summary Statistics API: `https://www.ebi.ac.uk/gwas/summary-statistics/api` **API Documentation:** – Main API docs: https://www.ebi.ac.uk/gwas/rest/docs/api – Summary stats docs: https://www.ebi.ac.uk/gwas/summary-statistics/docs/ **Core Endpoints:** 1. **Studies endpoint** – `/studies/{accessionID}` “`python import requests # Get a specific study url = "https://www.ebi.ac.uk/gwas/rest/api/studies/GCST001795" response = requests.get(url, headers={"Content-Type": "application/json"}) study = response.json() “` 2. **Associations endpoint** – `/associations` “`python # Find associations for a variant variant = "rs7903146" url = f"https://www.ebi.ac.uk/gwas/rest/api/singleNucleotidePolymorphisms/{variant}/associations" params = {"projection": "associationBySnp"} response = requests.get(url, params=params, headers={"Content-Type": "application/json"}) associations = response.json() “` 3. **Variants endpoint** – `/singleNucleotidePolymorphisms/{rsID}` “`python # Get variant details url = "https://www.ebi.ac.uk/gwas/rest/api/singleNucleotidePolymorphisms/rs7903146" response = requests.get(url, headers={"Content-Type": "application/json"}) variant_info = response.json() “` 4. **Traits endpoint** – `/efoTraits/{efoID}` “`python # Get trait information url = "https://www.ebi.ac.uk/gwas/rest/api/efoTraits/EFO_0001360" response = requests.get(url, headers={"Content-Type": "application/json"}) trait_info = response.json() “` ### 4. Query Examples and Patterns **Example 1: Find all associations for a disease** “`python import requests trait = "EFO_0001360" # Type 2 diabetes base_url = "https://www.ebi.ac.uk/gwas/rest/api" # Query associations for this trait url = f"{base_url}/efoTraits/{trait}/associations" response = requests.get(url, headers={"Content-Type": "application/json"}) associations = response.json() # Process results for assoc in associations.get('_embedded', {}).get('associations', []): variant = assoc.get('rsId') pvalue = assoc.get('pvalue') risk_allele = assoc.get('strongestAllele') print(f"{variant}: p={pvalue}, risk allele={risk_allele}") “` **Example 2: Get variant information and all trait associations** “`python import requests variant = "rs7903146" base_url = "https://www.ebi.ac.uk/gwas/rest/api" # Get variant details url = f"{base_url}/singleNucleotidePolymorphisms/{variant}" response = requests.get(url, headers={"Content-Type": "application/json"}) variant_data = response.json() # Get all associations for this variant url = f"{base_url}/singleNucleotidePolymorphisms/{variant}/associations" params = {"projection": "associationBySnp"} response = requests.get(url, params=params, headers={"Content-Type": "application/json"}) associations = response.json() # Extract trait names and p-values for assoc in associations.get('_embedded', {}).get('associations', []): trait = assoc.get('efoTrait') pvalue = assoc.get('pvalue') print(f"Trait: {trait}, p-value: {pvalue}") “` **Example 3: Access summary statistics** “`python import requests # Query summary statistics API base_url = "https://www.ebi.ac.uk/gwas/summary-statistics/api" # Find associations by trait with p-value threshold trait = "EFO_0001360" # Type 2 diabetes p_upper = "0.000000001" # p < 1e-9 url = f"{base_url}/traits/{trait}/associations" params = { "p_upper": p_upper, "size": 100 # Number of results } response = requests.get(url, params=params) results = response.json() # Process genome-wide significant hits for hit in results.get('_embedded', {}).get('associations', []): variant_id = hit.get('variant_id') chromosome = hit.get('chromosome') position = hit.get('base_pair_location') pvalue = hit.get('p_value') print(f"{chromosome}:{position} ({variant_id}): p={pvalue}") “` **Example 4: Query by chromosomal region** “`python import requests # Find variants in a specific genomic region chromosome = "10" start_pos = 114000000 end_pos = 115000000 base_url = "https://www.ebi.ac.uk/gwas/rest/api" url = f"{base_url}/singleNucleotidePolymorphisms/search/findByChromBpLocationRange" params = { "chrom": chromosome, "bpStart": start_pos, "bpEnd": end_pos } response = requests.get(url, params=params, headers={"Content-Type": "application/json"}) variants_in_region = response.json() “` ### 5. Working with Summary Statistics The GWAS Catalog hosts full summary statistics for many studies, providing access to all tested variants (not just genome-wide significant hits). **Access Methods:** 1. **FTP download**: http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/ 2. **REST API**: Query-based access to summary statistics 3. **Web interface**: Browse and download via the website **Summary Statistics API Features:** – Filter by chromosome, position, p-value – Query specific variants across studies – Retrieve effect sizes and allele frequencies – Access harmonized and standardized data **Example: Download summary statistics for a study** “`python import requests import gzip # Get available summary statistics base_url = "https://www.ebi.ac.uk/gwas/summary-statistics/api" url = f"{base_url}/studies/GCST001234" response = requests.get(url) study_info = response.json() # Download link is provided in the response # Alternatively, use FTP: # ftp://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCSTXXXXXX/ “` ### 6. Data Integration and Cross-referencing The GWAS Catalog provides links to external resources: **Genomic Databases:** – Ensembl: Gene annotations and variant consequences – dbSNP: Variant identifiers and population frequencies – gnomAD: Population allele frequencies **Functional Resources:** – Open Targets: Target-disease associations – PGS Catalog: Polygenic risk scores – UCSC Genome Browser: Genomic context **Phenotype Resources:** – EFO (Experimental Factor Ontology): Standardized trait terms – OMIM: Disease gene relationships – Disease Ontology: Disease hierarchies **Following Links in API Responses:** “`python import requests # API responses include _links for related resources response = requests.get("https://www.ebi.ac.uk/gwas/rest/api/studies/GCST001234") study = response.json() # Follow link to associations associations_url = study['_links']['associations']['href'] associations_response = requests.get(associations_url) “` ## Query Workflows ### Workflow 1: Exploring Genetic Associations for a Disease 1. **Identify the trait** using EFO terms or free text: – Search web interface for disease name – Note the EFO ID (e.g., EFO_0001360 for type 2 diabetes) 2. **Query associations via API:** “`python url = f"https://www.ebi.ac.uk/gwas/rest/api/efoTraits/{efo_id}/associations" “` 3. **Filter by significance and population:** – Check p-values (genome-wide significant: p ≤ 5×10⁻⁸) – Review ancestry information in study metadata – Filter by sample size or discovery/replication status 4. **Extract variant details:** – rs IDs for each association – Effect alleles and directions – Effect sizes (odds ratios, beta coefficients) – Population allele frequencies 5. **Cross-reference with other databases:** – Look up variant consequences in Ensembl – Check population frequencies in gnomAD – Explore gene function and pathways ### Workflow 2: Investigating a Specific Genetic Variant 1. **Query the variant:** “`python url = f"https://www.ebi.ac.uk/gwas/rest/api/singleNucleotidePolymorphisms/{rs_id}" “` 2. **Retrieve all trait associations:** “`python url = f"https://www.ebi.ac.uk/gwas/rest/api/singleNucleotidePolymorphisms/{rs_id}/associations" “` 3. **Analyze pleiotropy:** – Identify all traits associated with this variant – Review effect directions across traits – Look for shared biological pathways 4. **Check genomic context:** – Determine nearby genes – Identify if variant is in coding/regulatory regions – Review linkage disequilibrium with other variants ### Workflow 3: Gene-Centric Association Analysis 1. **Search by gene symbol** in web interface or: “`python url = f"https://www.ebi.ac.uk/gwas/rest/api/singleNucleotidePolymorphisms/search/findByGene" params = {"geneName": gene_symbol} “` 2. **Retrieve variants in gene region:** – Get chromosomal coordinates for gene – Query variants in region – Include promoter and regulatory regions (extend boundaries) 3. **Analyze association patterns:** – Identify traits associated with variants in this gene – Look for consistent associations across studies – Review effect sizes and directions 4. **Functional interpretation:** – Determine variant consequences (missense, regulatory, etc.) – Check expression QTL (eQTL) data – Review pathway and network context ### Workflow 4: Systematic Review of Genetic Evidence 1. **Define research question:** – Specific trait or disease of interest – Population considerations – Study design requirements 2. **Comprehensive variant extraction:** – Query all associations for trait – Set significance threshold – Note discovery and replication studies 3. **Quality assessment:** – Review study sample sizes – Check for population diversity – Assess heterogeneity across studies – Identify potential biases 4. **Data synthesis:** – Aggregate associations across studies – Perform meta-analysis if applicable – Create summary tables – Generate Manhattan or forest plots 5. **Export and documentation:** – Download full association data – Export summary statistics if needed – Document search strategy and date – Create reproducible analysis scripts ### Workflow 5: Accessing and Analyzing Summary Statistics 1. **Identify studies with summary statistics:** – Browse summary statistics portal – Check FTP directory listings – Query API for available studies 2. **Download summary statistics:** “`bash # Via FTP wget ftp://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCSTXXXXXX/harmonised/GCSTXXXXXX-harmonised.tsv.gz “` 3. **Query via API for specific variants:** “`python url = f"https://www.ebi.ac.uk/gwas/summary-statistics/api/chromosomes/{chrom}/associations" params = {"start": start_pos, "end": end_pos} “` 4. **Process and analyze:** – Filter by p-value thresholds – Extract effect sizes and confidence intervals – Perform downstream analyses (fine-mapping, colocalization, etc.) ## Response Formats and Data Fields **Key Fields in Association Records:** – `rsId`: Variant identifier (rs number) – `strongestAllele`: Risk allele for the association – `pvalue`: Association p-value – `pvalueText`: P-value as text (may include inequality) – `orPerCopyNum`: Odds ratio or beta coefficient – `betaNum`: Effect size (for quantitative traits) – `betaUnit`: Unit of measurement for beta – `range`: Confidence interval – `efoTrait`: Associated trait name – `mappedLabel`: EFO-mapped trait term **Study Metadata Fields:** – `accessionId`: GCST study identifier – `pubmedId`: PubMed ID – `author`: First author – `publicationDate`: Publication date – `ancestryInitial`: Discovery population ancestry – `ancestryReplication`: Replication population ancestry – `sampleSize`: Total sample size **Pagination:** Results are paginated (default 20 items per page). Navigate using: – `size` parameter: Number of results per page – `page` parameter: Page number (0-indexed) – `_links` in response: URLs for next/previous pages ## Best Practices ### Query Strategy – Start with web interface to identify relevant EFO terms and study accessions – Use API for bulk data extraction and automated analyses – Implement pagination handling for large result sets – Cache API responses to minimize redundant requests ### Data Interpretation – Always check p-value thresholds (genome-wide: 5×10⁻⁸) – Review ancestry information for population applicability – Consider sample size when assessing evidence strength – Check for replication across independent studies – Be aware of winner's curse in effect size estimates ### Rate Limiting and Ethics – Respect API usage guidelines (no excessive requests) – Use summary statistics downloads for genome-wide analyses – Implement appropriate delays between API calls – Cache results locally when performing iterative analyses – Cite the GWAS Catalog in publications ### Data Quality Considerations – GWAS Catalog curates published associations (may contain inconsistencies) – Effect sizes reported as published (may need harmonization) – Some studies report conditional or joint associations – Check for study overlap when combining results – Be aware of ascertainment and selection biases ## Python Integration Example Complete workflow for querying and analyzing GWAS data: “`python import requests import pandas as pd from time import sleep def query_gwas_catalog(trait_id, p_threshold=5e-8): """ Query GWAS Catalog for trait associations Args: trait_id: EFO trait identifier (e.g., 'EFO_0001360') p_threshold: P-value threshold for filtering Returns: pandas DataFrame with association results """ base_url = "https://www.ebi.ac.uk/gwas/rest/api" url = f"{base_url}/efoTraits/{trait_id}/associations" headers = {"Content-Type": "application/json"} results = [] page = 0 while True: params = {"page": page, "size": 100} response = requests.get(url, params=params, headers=headers) if response.status_code != 200: break data = response.json() associations = data.get('_embedded', {}).get('associations', []) if not associations: break for assoc in associations: pvalue = assoc.get('pvalue') if pvalue and float(pvalue) <= p_threshold: results.append({ 'variant': assoc.get('rsId'), 'pvalue': pvalue, 'risk_allele': assoc.get('strongestAllele'), 'or_beta': assoc.get('orPerCopyNum') or assoc.get('betaNum'), 'trait': assoc.get('efoTrait'), 'pubmed_id': assoc.get('pubmedId') }) page += 1 sleep(0.1) # Rate limiting return pd.DataFrame(results) # Example usage df = query_gwas_catalog('EFO_0001360') # Type 2 diabetes print(df.head()) print(f"\nTotal associations: {len(df)}") print(f"Unique variants: {df['variant'].nunique()}") “` ## Resources ### references/api_reference.md Comprehensive API documentation including: – Detailed endpoint specifications for both APIs – Complete list of query parameters and filters – Response format specifications and field descriptions – Advanced query examples and patterns – Error handling and troubleshooting – Integration with external databases Consult this reference when: – Constructing complex API queries – Understanding response structures – Implementing pagination or batch operations – Troubleshooting API errors – Exploring advanced filtering options ### Training Materials The GWAS Catalog team provides workshop materials: – GitHub repository: https://github.com/EBISPOT/GWAS_Catalog-workshop – Jupyter notebooks with example queries – Google Colab integration for cloud execution ## Important Notes ### Data Updates – The GWAS Catalog is updated regularly with new publications – Re-run queries periodically for comprehensive coverage – Summary statistics are added as studies release data – EFO mappings may be updated over time ### Citation Requirements When using GWAS Catalog data, cite: – Sollis E, et al. (2023) The NHGRI-EBI GWAS Catalog: knowledgebase and deposition resource. Nucleic Acids Research. PMID: 37953337 – Include access date and version when available – Cite original studies when discussing specific findings ### Limitations – Not all GWAS publications are included (curation criteria apply) – Full summary statistics available for subset of studies – Effect sizes may require harmonization across studies – Population diversity is growing but historically limited – Some associations represent conditional or joint effects ### Data Access – Web interface: Free, no registration required – REST APIs: Free, no API key needed – FTP downloads: Open access – Rate limiting applies to API (be respectful) ## Additional Resources – **GWAS Catalog website**: https://www.ebi.ac.uk/gwas/ – **Documentation**: https://www.ebi.ac.uk/gwas/docs – **API documentation**: https://www.ebi.ac.uk/gwas/rest/docs/api – **Summary Statistics API**: https://www.ebi.ac.uk/gwas/summary-statistics/docs/ – **FTP site**: http://ftp.ebi.ac.uk/pub/databases/gwas/ – **Training materials**: https://github.com/EBISPOT/GWAS_Catalog-workshop – **PGS Catalog** (polygenic scores): https://www.pgscatalog.org/ – **Help and support**: [email protected]