Skill

SkillsResearch & Science › Bioinformatics & life science

Ena Database

"Access European Nucleotide Archive via API/FTP. Retrieve DNA/RNA sequences, raw reads (FASTQ), genome assemblies by accession, for genomics and bioinformatics pipelines. Supports multiple formats."

Freerisk: low
enadatabasepython

Tools: requests

The full skill

— name: ena-database description: "Access European Nucleotide Archive via API/FTP. Retrieve DNA/RNA sequences, raw reads (FASTQ), genome assemblies by accession, for genomics and bioinformatics pipelines. Supports multiple formats." — # ENA Database ## Overview The European Nucleotide Archive (ENA) is a comprehensive public repository for nucleotide sequence data and associated metadata. Access and query DNA/RNA sequences, raw reads, genome assemblies, and functional annotations through REST APIs and FTP for genomics and bioinformatics pipelines. ## When to Use This Skill This skill should be used when: – Retrieving nucleotide sequences or raw sequencing reads by accession – Searching for samples, studies, or assemblies by metadata criteria – Downloading FASTQ files or genome assemblies for analysis – Querying taxonomic information for organisms – Accessing sequence annotations and functional data – Integrating ENA data into bioinformatics pipelines – Performing cross-reference searches to related databases – Bulk downloading datasets via FTP or Aspera ## Core Capabilities ### 1. Data Types and Structure ENA organizes data into hierarchical object types: **Studies/Projects** – Group related data and control release dates. Studies are the primary unit for citing archived data. **Samples** – Represent units of biomaterial from which sequencing libraries were produced. Samples must be registered before submitting most data types. **Raw Reads** – Consist of: – **Experiments**: Metadata about sequencing methods, library preparation, and instrument details – **Runs**: References to data files containing raw sequencing reads from a single sequencing run **Assemblies** – Genome, transcriptome, metagenome, or metatranscriptome assemblies at various completion levels. **Sequences** – Assembled and annotated sequences stored in the EMBL Nucleotide Sequence Database, including coding/non-coding regions and functional annotations. **Analyses** – Results from computational analyses of sequence data. **Taxonomy Records** – Taxonomic information including lineage and rank. ### 2. Programmatic Access ENA provides multiple REST APIs for data access. Consult `references/api_reference.md` for detailed endpoint documentation. **Key APIs:** **ENA Portal API** – Advanced search functionality across all ENA data types – Documentation: https://www.ebi.ac.uk/ena/portal/api/doc – Use for complex queries and metadata searches **ENA Browser API** – Direct retrieval of records and metadata – Documentation: https://www.ebi.ac.uk/ena/browser/api/doc – Use for downloading specific records by accession – Returns data in XML format **ENA Taxonomy REST API** – Query taxonomic information – Access lineage, rank, and related taxonomic data **ENA Cross Reference Service** – Access related records from external databases – Endpoint: https://www.ebi.ac.uk/ena/xref/rest/ **CRAM Reference Registry** – Retrieve reference sequences – Endpoint: https://www.ebi.ac.uk/ena/cram/ – Query by MD5 or SHA1 checksums **Rate Limiting**: All APIs have a rate limit of 50 requests per second. Exceeding this returns HTTP 429 (Too Many Requests). ### 3. Searching and Retrieving Data **Browser-Based Search:** – Free text search across all fields – Sequence similarity search (BLAST integration) – Cross-reference search to find related records – Advanced search with Rulespace query builder **Programmatic Queries:** – Use Portal API for advanced searches at scale – Filter by data type, date range, taxonomy, or metadata fields – Download results as tabulated metadata summaries or XML records **Example API Query Pattern:** “`python import requests # Search for samples from a specific study base_url = "https://www.ebi.ac.uk/ena/portal/api/search" params = { "result": "sample", "query": "study_accession=PRJEB1234", "format": "json", "limit": 100 } response = requests.get(base_url, params=params) samples = response.json() “` ### 4. Data Retrieval Formats **Metadata Formats:** – XML (native ENA format) – JSON (via Portal API) – TSV/CSV (tabulated summaries) **Sequence Data:** – FASTQ (raw reads) – BAM/CRAM (aligned reads) – FASTA (assembled sequences) – EMBL flat file format (annotated sequences) **Download Methods:** – Direct API download (small files) – FTP for bulk data transfer – Aspera for high-speed transfer of large datasets – enaBrowserTools command-line utility for bulk downloads ### 5. Common Use Cases **Retrieve raw sequencing reads by accession:** “`python # Download run files using Browser API accession = "ERR123456" url = f"https://www.ebi.ac.uk/ena/browser/api/xml/{accession}" “` **Search for all samples in a study:** “`python # Use Portal API to list samples study_id = "PRJNA123456" url = f"https://www.ebi.ac.uk/ena/portal/api/search?result=sample&query=study_accession={study_id}&format=tsv" “` **Find assemblies for a specific organism:** “`python # Search assemblies by taxonomy organism = "Escherichia coli" url = f"https://www.ebi.ac.uk/ena/portal/api/search?result=assembly&query=tax_tree({organism})&format=json" “` **Get taxonomic lineage:** “`python # Query taxonomy API taxon_id = "562" # E. coli url = f"https://www.ebi.ac.uk/ena/taxonomy/rest/tax-id/{taxon_id}" “` ### 6. Integration with Analysis Pipelines **Bulk Download Pattern:** 1. Search for accessions matching criteria using Portal API 2. Extract file URLs from search results 3. Download files via FTP or using enaBrowserTools 4. Process downloaded data in pipeline **BLAST Integration:** Integrate with EBI's NCBI BLAST service (REST/SOAP API) for sequence similarity searches against ENA sequences. ### 7. Best Practices **Rate Limiting:** – Implement exponential backoff when receiving HTTP 429 responses – Batch requests when possible to stay within 50 req/sec limit – Use bulk download tools for large datasets instead of iterating API calls **Data Citation:** – Always cite using Study/Project accessions when publishing – Include accession numbers for specific samples, runs, or assemblies used **API Response Handling:** – Check HTTP status codes before processing responses – Parse XML responses using proper XML libraries (not regex) – Handle pagination for large result sets **Performance:** – Use FTP/Aspera for downloading large files (>100MB) – Prefer TSV/JSON formats over XML when only metadata is needed – Cache taxonomy lookups locally when processing many records ## Resources This skill includes detailed reference documentation for working with ENA: ### references/ **api_reference.md** – Comprehensive API endpoint documentation including: – Detailed parameters for Portal API and Browser API – Response format specifications – Advanced query syntax and operators – Field names for filtering and searching – Common API patterns and examples Load this reference when constructing complex API queries, debugging API responses, or needing specific parameter details.