Skills › Research & Science › Bioinformatics & life science
Gwas Database
"Query NHGRI-EBI GWAS Catalog for SNP-trait associations. Search variants by rs ID, disease/trait, gene, retrieve p-values and summary statistics, for genetic epidemiology and polygenic risk scores."
Tools: requests,gzip,pandas
The full skill
—
name: gwas-database
description: "Query NHGRI-EBI GWAS Catalog for SNP-trait associations. Search variants by rs ID, disease/trait, gene, retrieve p-values and summary statistics, for genetic epidemiology and polygenic risk scores."
—
# GWAS Catalog Database
## Overview
The GWAS Catalog is a comprehensive repository of published genome-wide association studies maintained by the National Human Genome Research Institute (NHGRI) and the European Bioinformatics Institute (EBI). The catalog contains curated SNP-trait associations from thousands of GWAS publications, including genetic variants, associated traits and diseases, p-values, effect sizes, and full summary statistics for many studies.
## When to Use This Skill
This skill should be used when queries involve:
– **Genetic variant associations**: Finding SNPs associated with diseases or traits
– **SNP lookups**: Retrieving information about specific genetic variants (rs IDs)
– **Trait/disease searches**: Discovering genetic associations for phenotypes
– **Gene associations**: Finding variants in or near specific genes
– **GWAS summary statistics**: Accessing complete genome-wide association data
– **Study metadata**: Retrieving publication and cohort information
– **Population genetics**: Exploring ancestry-specific associations
– **Polygenic risk scores**: Identifying variants for risk prediction models
– **Functional genomics**: Understanding variant effects and genomic context
– **Systematic reviews**: Comprehensive literature synthesis of genetic associations
## Core Capabilities
### 1. Understanding GWAS Catalog Data Structure
The GWAS Catalog is organized around four core entities:
– **Studies**: GWAS publications with metadata (PMID, author, cohort details)
– **Associations**: SNP-trait associations with statistical evidence (p ⤠5Ã10â»â¸)
– **Variants**: Genetic markers (SNPs) with genomic coordinates and alleles
– **Traits**: Phenotypes and diseases (mapped to EFO ontology terms)
**Key Identifiers:**
– Study accessions: `GCST` IDs (e.g., GCST001234)
– Variant IDs: `rs` numbers (e.g., rs7903146) or `variant_id` format
– Trait IDs: EFO terms (e.g., EFO_0001360 for type 2 diabetes)
– Gene symbols: HGNC approved names (e.g., TCF7L2)
### 2. Web Interface Searches
The web interface at https://www.ebi.ac.uk/gwas/ supports multiple search modes:
**By Variant (rs ID):**
“`
rs7903146
“`
Returns all trait associations for this SNP.
**By Disease/Trait:**
“`
type 2 diabetes
Parkinson disease
body mass index
“`
Returns all associated genetic variants.
**By Gene:**
“`
APOE
TCF7L2
“`
Returns variants in or near the gene region.
**By Chromosomal Region:**
“`
10:114000000-115000000
“`
Returns variants in the specified genomic interval.
**By Publication:**
“`
PMID:20581827
Author: McCarthy MI
GCST001234
“`
Returns study details and all reported associations.
### 3. REST API Access
The GWAS Catalog provides two REST APIs for programmatic access:
**Base URLs:**
– GWAS Catalog API: `https://www.ebi.ac.uk/gwas/rest/api`
– Summary Statistics API: `https://www.ebi.ac.uk/gwas/summary-statistics/api`
**API Documentation:**
– Main API docs: https://www.ebi.ac.uk/gwas/rest/docs/api
– Summary stats docs: https://www.ebi.ac.uk/gwas/summary-statistics/docs/
**Core Endpoints:**
1. **Studies endpoint** – `/studies/{accessionID}`
“`python
import requests
# Get a specific study
url = "https://www.ebi.ac.uk/gwas/rest/api/studies/GCST001795"
response = requests.get(url, headers={"Content-Type": "application/json"})
study = response.json()
“`
2. **Associations endpoint** – `/associations`
“`python
# Find associations for a variant
variant = "rs7903146"
url = f"https://www.ebi.ac.uk/gwas/rest/api/singleNucleotidePolymorphisms/{variant}/associations"
params = {"projection": "associationBySnp"}
response = requests.get(url, params=params, headers={"Content-Type": "application/json"})
associations = response.json()
“`
3. **Variants endpoint** – `/singleNucleotidePolymorphisms/{rsID}`
“`python
# Get variant details
url = "https://www.ebi.ac.uk/gwas/rest/api/singleNucleotidePolymorphisms/rs7903146"
response = requests.get(url, headers={"Content-Type": "application/json"})
variant_info = response.json()
“`
4. **Traits endpoint** – `/efoTraits/{efoID}`
“`python
# Get trait information
url = "https://www.ebi.ac.uk/gwas/rest/api/efoTraits/EFO_0001360"
response = requests.get(url, headers={"Content-Type": "application/json"})
trait_info = response.json()
“`
### 4. Query Examples and Patterns
**Example 1: Find all associations for a disease**
“`python
import requests
trait = "EFO_0001360" # Type 2 diabetes
base_url = "https://www.ebi.ac.uk/gwas/rest/api"
# Query associations for this trait
url = f"{base_url}/efoTraits/{trait}/associations"
response = requests.get(url, headers={"Content-Type": "application/json"})
associations = response.json()
# Process results
for assoc in associations.get('_embedded', {}).get('associations', []):
variant = assoc.get('rsId')
pvalue = assoc.get('pvalue')
risk_allele = assoc.get('strongestAllele')
print(f"{variant}: p={pvalue}, risk allele={risk_allele}")
“`
**Example 2: Get variant information and all trait associations**
“`python
import requests
variant = "rs7903146"
base_url = "https://www.ebi.ac.uk/gwas/rest/api"
# Get variant details
url = f"{base_url}/singleNucleotidePolymorphisms/{variant}"
response = requests.get(url, headers={"Content-Type": "application/json"})
variant_data = response.json()
# Get all associations for this variant
url = f"{base_url}/singleNucleotidePolymorphisms/{variant}/associations"
params = {"projection": "associationBySnp"}
response = requests.get(url, params=params, headers={"Content-Type": "application/json"})
associations = response.json()
# Extract trait names and p-values
for assoc in associations.get('_embedded', {}).get('associations', []):
trait = assoc.get('efoTrait')
pvalue = assoc.get('pvalue')
print(f"Trait: {trait}, p-value: {pvalue}")
“`
**Example 3: Access summary statistics**
“`python
import requests
# Query summary statistics API
base_url = "https://www.ebi.ac.uk/gwas/summary-statistics/api"
# Find associations by trait with p-value threshold
trait = "EFO_0001360" # Type 2 diabetes
p_upper = "0.000000001" # p < 1e-9
url = f"{base_url}/traits/{trait}/associations"
params = {
"p_upper": p_upper,
"size": 100 # Number of results
}
response = requests.get(url, params=params)
results = response.json()
# Process genome-wide significant hits
for hit in results.get('_embedded', {}).get('associations', []):
variant_id = hit.get('variant_id')
chromosome = hit.get('chromosome')
position = hit.get('base_pair_location')
pvalue = hit.get('p_value')
print(f"{chromosome}:{position} ({variant_id}): p={pvalue}")
“`
**Example 4: Query by chromosomal region**
“`python
import requests
# Find variants in a specific genomic region
chromosome = "10"
start_pos = 114000000
end_pos = 115000000
base_url = "https://www.ebi.ac.uk/gwas/rest/api"
url = f"{base_url}/singleNucleotidePolymorphisms/search/findByChromBpLocationRange"
params = {
"chrom": chromosome,
"bpStart": start_pos,
"bpEnd": end_pos
}
response = requests.get(url, params=params, headers={"Content-Type": "application/json"})
variants_in_region = response.json()
“`
### 5. Working with Summary Statistics
The GWAS Catalog hosts full summary statistics for many studies, providing access to all tested variants (not just genome-wide significant hits).
**Access Methods:**
1. **FTP download**: http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/
2. **REST API**: Query-based access to summary statistics
3. **Web interface**: Browse and download via the website
**Summary Statistics API Features:**
– Filter by chromosome, position, p-value
– Query specific variants across studies
– Retrieve effect sizes and allele frequencies
– Access harmonized and standardized data
**Example: Download summary statistics for a study**
“`python
import requests
import gzip
# Get available summary statistics
base_url = "https://www.ebi.ac.uk/gwas/summary-statistics/api"
url = f"{base_url}/studies/GCST001234"
response = requests.get(url)
study_info = response.json()
# Download link is provided in the response
# Alternatively, use FTP:
# ftp://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCSTXXXXXX/
“`
### 6. Data Integration and Cross-referencing
The GWAS Catalog provides links to external resources:
**Genomic Databases:**
– Ensembl: Gene annotations and variant consequences
– dbSNP: Variant identifiers and population frequencies
– gnomAD: Population allele frequencies
**Functional Resources:**
– Open Targets: Target-disease associations
– PGS Catalog: Polygenic risk scores
– UCSC Genome Browser: Genomic context
**Phenotype Resources:**
– EFO (Experimental Factor Ontology): Standardized trait terms
– OMIM: Disease gene relationships
– Disease Ontology: Disease hierarchies
**Following Links in API Responses:**
“`python
import requests
# API responses include _links for related resources
response = requests.get("https://www.ebi.ac.uk/gwas/rest/api/studies/GCST001234")
study = response.json()
# Follow link to associations
associations_url = study['_links']['associations']['href']
associations_response = requests.get(associations_url)
“`
## Query Workflows
### Workflow 1: Exploring Genetic Associations for a Disease
1. **Identify the trait** using EFO terms or free text:
– Search web interface for disease name
– Note the EFO ID (e.g., EFO_0001360 for type 2 diabetes)
2. **Query associations via API:**
“`python
url = f"https://www.ebi.ac.uk/gwas/rest/api/efoTraits/{efo_id}/associations"
“`
3. **Filter by significance and population:**
– Check p-values (genome-wide significant: p ⤠5Ã10â»â¸)
– Review ancestry information in study metadata
– Filter by sample size or discovery/replication status
4. **Extract variant details:**
– rs IDs for each association
– Effect alleles and directions
– Effect sizes (odds ratios, beta coefficients)
– Population allele frequencies
5. **Cross-reference with other databases:**
– Look up variant consequences in Ensembl
– Check population frequencies in gnomAD
– Explore gene function and pathways
### Workflow 2: Investigating a Specific Genetic Variant
1. **Query the variant:**
“`python
url = f"https://www.ebi.ac.uk/gwas/rest/api/singleNucleotidePolymorphisms/{rs_id}"
“`
2. **Retrieve all trait associations:**
“`python
url = f"https://www.ebi.ac.uk/gwas/rest/api/singleNucleotidePolymorphisms/{rs_id}/associations"
“`
3. **Analyze pleiotropy:**
– Identify all traits associated with this variant
– Review effect directions across traits
– Look for shared biological pathways
4. **Check genomic context:**
– Determine nearby genes
– Identify if variant is in coding/regulatory regions
– Review linkage disequilibrium with other variants
### Workflow 3: Gene-Centric Association Analysis
1. **Search by gene symbol** in web interface or:
“`python
url = f"https://www.ebi.ac.uk/gwas/rest/api/singleNucleotidePolymorphisms/search/findByGene"
params = {"geneName": gene_symbol}
“`
2. **Retrieve variants in gene region:**
– Get chromosomal coordinates for gene
– Query variants in region
– Include promoter and regulatory regions (extend boundaries)
3. **Analyze association patterns:**
– Identify traits associated with variants in this gene
– Look for consistent associations across studies
– Review effect sizes and directions
4. **Functional interpretation:**
– Determine variant consequences (missense, regulatory, etc.)
– Check expression QTL (eQTL) data
– Review pathway and network context
### Workflow 4: Systematic Review of Genetic Evidence
1. **Define research question:**
– Specific trait or disease of interest
– Population considerations
– Study design requirements
2. **Comprehensive variant extraction:**
– Query all associations for trait
– Set significance threshold
– Note discovery and replication studies
3. **Quality assessment:**
– Review study sample sizes
– Check for population diversity
– Assess heterogeneity across studies
– Identify potential biases
4. **Data synthesis:**
– Aggregate associations across studies
– Perform meta-analysis if applicable
– Create summary tables
– Generate Manhattan or forest plots
5. **Export and documentation:**
– Download full association data
– Export summary statistics if needed
– Document search strategy and date
– Create reproducible analysis scripts
### Workflow 5: Accessing and Analyzing Summary Statistics
1. **Identify studies with summary statistics:**
– Browse summary statistics portal
– Check FTP directory listings
– Query API for available studies
2. **Download summary statistics:**
“`bash
# Via FTP
wget ftp://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCSTXXXXXX/harmonised/GCSTXXXXXX-harmonised.tsv.gz
“`
3. **Query via API for specific variants:**
“`python
url = f"https://www.ebi.ac.uk/gwas/summary-statistics/api/chromosomes/{chrom}/associations"
params = {"start": start_pos, "end": end_pos}
“`
4. **Process and analyze:**
– Filter by p-value thresholds
– Extract effect sizes and confidence intervals
– Perform downstream analyses (fine-mapping, colocalization, etc.)
## Response Formats and Data Fields
**Key Fields in Association Records:**
– `rsId`: Variant identifier (rs number)
– `strongestAllele`: Risk allele for the association
– `pvalue`: Association p-value
– `pvalueText`: P-value as text (may include inequality)
– `orPerCopyNum`: Odds ratio or beta coefficient
– `betaNum`: Effect size (for quantitative traits)
– `betaUnit`: Unit of measurement for beta
– `range`: Confidence interval
– `efoTrait`: Associated trait name
– `mappedLabel`: EFO-mapped trait term
**Study Metadata Fields:**
– `accessionId`: GCST study identifier
– `pubmedId`: PubMed ID
– `author`: First author
– `publicationDate`: Publication date
– `ancestryInitial`: Discovery population ancestry
– `ancestryReplication`: Replication population ancestry
– `sampleSize`: Total sample size
**Pagination:**
Results are paginated (default 20 items per page). Navigate using:
– `size` parameter: Number of results per page
– `page` parameter: Page number (0-indexed)
– `_links` in response: URLs for next/previous pages
## Best Practices
### Query Strategy
– Start with web interface to identify relevant EFO terms and study accessions
– Use API for bulk data extraction and automated analyses
– Implement pagination handling for large result sets
– Cache API responses to minimize redundant requests
### Data Interpretation
– Always check p-value thresholds (genome-wide: 5Ã10â»â¸)
– Review ancestry information for population applicability
– Consider sample size when assessing evidence strength
– Check for replication across independent studies
– Be aware of winner's curse in effect size estimates
### Rate Limiting and Ethics
– Respect API usage guidelines (no excessive requests)
– Use summary statistics downloads for genome-wide analyses
– Implement appropriate delays between API calls
– Cache results locally when performing iterative analyses
– Cite the GWAS Catalog in publications
### Data Quality Considerations
– GWAS Catalog curates published associations (may contain inconsistencies)
– Effect sizes reported as published (may need harmonization)
– Some studies report conditional or joint associations
– Check for study overlap when combining results
– Be aware of ascertainment and selection biases
## Python Integration Example
Complete workflow for querying and analyzing GWAS data:
“`python
import requests
import pandas as pd
from time import sleep
def query_gwas_catalog(trait_id, p_threshold=5e-8):
"""
Query GWAS Catalog for trait associations
Args:
trait_id: EFO trait identifier (e.g., 'EFO_0001360')
p_threshold: P-value threshold for filtering
Returns:
pandas DataFrame with association results
"""
base_url = "https://www.ebi.ac.uk/gwas/rest/api"
url = f"{base_url}/efoTraits/{trait_id}/associations"
headers = {"Content-Type": "application/json"}
results = []
page = 0
while True:
params = {"page": page, "size": 100}
response = requests.get(url, params=params, headers=headers)
if response.status_code != 200:
break
data = response.json()
associations = data.get('_embedded', {}).get('associations', [])
if not associations:
break
for assoc in associations:
pvalue = assoc.get('pvalue')
if pvalue and float(pvalue) <= p_threshold:
results.append({
'variant': assoc.get('rsId'),
'pvalue': pvalue,
'risk_allele': assoc.get('strongestAllele'),
'or_beta': assoc.get('orPerCopyNum') or assoc.get('betaNum'),
'trait': assoc.get('efoTrait'),
'pubmed_id': assoc.get('pubmedId')
})
page += 1
sleep(0.1) # Rate limiting
return pd.DataFrame(results)
# Example usage
df = query_gwas_catalog('EFO_0001360') # Type 2 diabetes
print(df.head())
print(f"\nTotal associations: {len(df)}")
print(f"Unique variants: {df['variant'].nunique()}")
“`
## Resources
### references/api_reference.md
Comprehensive API documentation including:
– Detailed endpoint specifications for both APIs
– Complete list of query parameters and filters
– Response format specifications and field descriptions
– Advanced query examples and patterns
– Error handling and troubleshooting
– Integration with external databases
Consult this reference when:
– Constructing complex API queries
– Understanding response structures
– Implementing pagination or batch operations
– Troubleshooting API errors
– Exploring advanced filtering options
### Training Materials
The GWAS Catalog team provides workshop materials:
– GitHub repository: https://github.com/EBISPOT/GWAS_Catalog-workshop
– Jupyter notebooks with example queries
– Google Colab integration for cloud execution
## Important Notes
### Data Updates
– The GWAS Catalog is updated regularly with new publications
– Re-run queries periodically for comprehensive coverage
– Summary statistics are added as studies release data
– EFO mappings may be updated over time
### Citation Requirements
When using GWAS Catalog data, cite:
– Sollis E, et al. (2023) The NHGRI-EBI GWAS Catalog: knowledgebase and deposition resource. Nucleic Acids Research. PMID: 37953337
– Include access date and version when available
– Cite original studies when discussing specific findings
### Limitations
– Not all GWAS publications are included (curation criteria apply)
– Full summary statistics available for subset of studies
– Effect sizes may require harmonization across studies
– Population diversity is growing but historically limited
– Some associations represent conditional or joint effects
### Data Access
– Web interface: Free, no registration required
– REST APIs: Free, no API key needed
– FTP downloads: Open access
– Rate limiting applies to API (be respectful)
## Additional Resources
– **GWAS Catalog website**: https://www.ebi.ac.uk/gwas/
– **Documentation**: https://www.ebi.ac.uk/gwas/docs
– **API documentation**: https://www.ebi.ac.uk/gwas/rest/docs/api
– **Summary Statistics API**: https://www.ebi.ac.uk/gwas/summary-statistics/docs/
– **FTP site**: http://ftp.ebi.ac.uk/pub/databases/gwas/
– **Training materials**: https://github.com/EBISPOT/GWAS_Catalog-workshop
– **PGS Catalog** (polygenic scores): https://www.pgscatalog.org/
– **Help and support**: [email protected]