Skills › AI & Agent Engineering › Local & model ops
ai-multimodal
Process and generate multimedia content using Google Gemini API. Capabilities include analyze audio files (transcription with timestamps, summarization, speech understanding, music/sound analysis up to 9.5 hours), understand images (captioning, object detection, OCR, visual Q&A, segmentation), process videos (scene detection, Q&A, temporal analysis, YouTube URLs, up to 6 hours), extract from documents (PDF tables, forms, charts, diagrams, multi-page), generate images (text-to-image, editing, com
Tools: – bash
The full skill
—
name: ai-multimodal
description: Process and generate multimedia content using Google Gemini API. Capabilities include analyze audio files (transcription with timestamps, summarization, speech understanding, music/sound analysis up to 9.5 hours), understand images (captioning, object detection, OCR, visual Q&A, segmentation), process videos (scene detection, Q&A, temporal analysis, YouTube URLs, up to 6 hours), extract from documents (PDF tables, forms, charts, diagrams, multi-page), generate images (text-to-image, editing, composition, refinement). Use when working with audio/video files, analyzing images or screenshots, processing PDF documents, extracting structured data from media, creating images from text prompts, or implementing multimodal AI features. Supports multiple models (Gemini 2.5/2.0) with context windows up to 2M tokens.
license: MIT
allowed-tools:
– Bash
– Read
– Write
– Edit
—
# AI Multimodal Processing Skill
Process audio, images, videos, documents, and generate images using Google Gemini's multimodal API. Unified interface for all multimedia content understanding and generation.
## Core Capabilities
### Audio Processing
– Transcription with timestamps (up to 9.5 hours)
– Audio summarization and analysis
– Speech understanding and speaker identification
– Music and environmental sound analysis
– Text-to-speech generation with controllable voice
### Image Understanding
– Image captioning and description
– Object detection with bounding boxes (2.0+)
– Pixel-level segmentation (2.5+)
– Visual question answering
– Multi-image comparison (up to 3,600 images)
– OCR and text extraction
### Video Analysis
– Scene detection and summarization
– Video Q&A with temporal understanding
– Transcription with visual descriptions
– YouTube URL support
– Long video processing (up to 6 hours)
– Frame-level analysis
### Document Extraction
– Native PDF vision processing (up to 1,000 pages)
– Table and form extraction
– Chart and diagram analysis
– Multi-page document understanding
– Structured data output (JSON schema)
– Format conversion (PDF to HTML/JSON)
### Image Generation
– Text-to-image generation
– Image editing and modification
– Multi-image composition (up to 3 images)
– Iterative refinement
– Multiple aspect ratios (1:1, 16:9, 9:16, 4:3, 3:4)
– Controllable style and quality
## Capability Matrix
| Task | Audio | Image | Video | Document | Generation |
|——|:—–:|:—–:|:—–:|:——–:|:———-:|
| Transcription | â | – | â | – | – |
| Summarization | â | â | â | â | – |
| Q&A | â | â | â | â | – |
| Object Detection | – | â | â | – | – |
| Text Extraction | – | â | – | â | – |
| Structured Output | â | â | â | â | – |
| Creation | TTS | – | – | – | â |
| Timestamps | â | – | â | – | – |
| Segmentation | – | â | – | – | – |
## Model Selection Guide
### Gemini 2.5 Series (Recommended)
– **gemini-2.5-pro**: Highest quality, all features, 1M-2M context
– **gemini-2.5-flash**: Best balance, all features, 1M-2M context
– **gemini-2.5-flash-lite**: Lightweight, segmentation support
– **gemini-2.5-flash-image**: Image generation only
### Gemini 2.0 Series
– **gemini-2.0-flash**: Fast processing, object detection
– **gemini-2.0-flash-lite**: Lightweight option
### Feature Requirements
– **Segmentation**: Requires 2.5+ models
– **Object Detection**: Requires 2.0+ models
– **Multi-video**: Requires 2.5+ models
– **Image Generation**: Requires flash-image model
### Context Windows
– **2M tokens**: ~6 hours video (low-res) or ~2 hours (default)
– **1M tokens**: ~3 hours video (low-res) or ~1 hour (default)
– **Audio**: 32 tokens/second (1 min = 1,920 tokens)
– **PDF**: 258 tokens/page (fixed)
– **Image**: 258-1,548 tokens based on size
## Quick Start
### Prerequisites
**API Key Setup**: Supports both Google AI Studio and Vertex AI.
The skill checks for `GEMINI_API_KEY` in this order:
1. Process environment: `export GEMINI_API_KEY="your-key"`
2. Project root: `.env`
3. `.claude/.env`
4. `.claude/skills/.env`
5. `.claude/skills/ai-multimodal/.env`
**Get API key**: https://aistudio.google.com/apikey
**For Vertex AI**:
“`bash
export GEMINI_USE_VERTEX=true
export VERTEX_PROJECT_ID=your-gcp-project-id
export VERTEX_LOCATION=us-central1 # Optional
“`
**Install SDK**:
“`bash
pip install google-genai python-dotenv pillow
“`
### Common Patterns
**Transcribe Audio**:
“`bash
python scripts/gemini_batch_process.py \
–files audio.mp3 \
–task transcribe \
–model gemini-2.5-flash
“`
**Analyze Image**:
“`bash
python scripts/gemini_batch_process.py \
–files image.jpg \
–task analyze \
–prompt "Describe this image" \
–output docs/assets/<output-name>.md \
–model gemini-2.5-flash
“`
**Process Video**:
“`bash
python scripts/gemini_batch_process.py \
–files video.mp4 \
–task analyze \
–prompt "Summarize key points with timestamps" \
–output docs/assets/<output-name>.md \
–model gemini-2.5-flash
“`
**Extract from PDF**:
“`bash
python scripts/gemini_batch_process.py \
–files document.pdf \
–task extract \
–prompt "Extract table data as JSON" \
–output docs/assets/<output-name>.md \
–format json
“`
**Generate Image**:
“`bash
python scripts/gemini_batch_process.py \
–task generate \
–prompt "A futuristic city at sunset" \
–output docs/assets/<output-file-name> \
–model gemini-2.5-flash-image \
–aspect-ratio 16:9
“`
**Optimize Media**:
“`bash
# Prepare large video for processing
python scripts/media_optimizer.py \
–input large-video.mp4 \
–output docs/assets/<output-file-name> \
–target-size 100MB
# Batch optimize multiple files
python scripts/media_optimizer.py \
–input-dir ./videos \
–output-dir docs/assets/optimized \
–quality 85
“`
**Convert Documents to Markdown**:
“`bash
# Convert to PDF
python scripts/document_converter.py \
–input document.docx \
–output docs/assets/document.md
# Extract pages
python scripts/document_converter.py \
–input large.pdf \
–output docs/assets/chapter1.md \
–pages 1-20
“`
## Supported Formats
### Audio
– WAV, MP3, AAC, FLAC, OGG Vorbis, AIFF
– Max 9.5 hours per request
– Auto-downsampled to 16 Kbps mono
### Images
– PNG, JPEG, WEBP, HEIC, HEIF
– Max 3,600 images per request
– Resolution: â¤384px = 258 tokens, larger = tiled
### Video
– MP4, MPEG, MOV, AVI, FLV, MPG, WebM, WMV, 3GPP
– Max 6 hours (low-res) or 2 hours (default)
– YouTube URLs supported (public only)
### Documents
– PDF only for vision processing
– Max 1,000 pages
– TXT, HTML, Markdown supported (text-only)
### Size Limits
– **Inline**: <20MB total request
– **File API**: 2GB per file, 20GB project quota
– **Retention**: 48 hours auto-delete
## Reference Navigation
For detailed implementation guidance, see:
### Audio Processing
– `references/audio-processing.md` – Transcription, analysis, TTS
– Timestamp handling and segment analysis
– Multi-speaker identification
– Non-speech audio analysis
– Text-to-speech generation
### Image Understanding
– `references/vision-understanding.md` – Captioning, detection, OCR
– Object detection and localization
– Pixel-level segmentation
– Visual question answering
– Multi-image comparison
### Video Analysis
– `references/video-analysis.md` – Scene detection, temporal understanding
– YouTube URL processing
– Timestamp-based queries
– Video clipping and FPS control
– Long video optimization
### Document Extraction
– `references/document-extraction.md` – PDF processing, structured output
– Table and form extraction
– Chart and diagram analysis
– JSON schema validation
– Multi-page handling
### Image Generation
– `references/image-generation.md` – Text-to-image, editing
– Prompt engineering strategies
– Image editing and composition
– Aspect ratio selection
– Safety settings
## Cost Optimization
### Token Costs
**Input Pricing**:
– Gemini 2.5 Flash: $1.00/1M input, $0.10/1M output
– Gemini 2.5 Pro: $3.00/1M input, $12.00/1M output
– Gemini 1.5 Flash: $0.70/1M input, $0.175/1M output
**Token Rates**:
– Audio: 32 tokens/second (1 min = 1,920 tokens)
– Video: ~300 tokens/second (default) or ~100 (low-res)
– PDF: 258 tokens/page (fixed)
– Image: 258-1,548 tokens based on size
**TTS Pricing**:
– Flash TTS: $10/1M tokens
– Pro TTS: $20/1M tokens
### Best Practices
1. Use `gemini-2.5-flash` for most tasks (best price/performance)
2. Use File API for files >20MB or repeated queries
3. Optimize media before upload (see `media_optimizer.py`)
4. Process specific segments instead of full videos
5. Use lower FPS for static content
6. Implement context caching for repeated queries
7. Batch process multiple files in parallel
## Rate Limits
**Free Tier**:
– 10-15 RPM (requests per minute)
– 1M-4M TPM (tokens per minute)
– 1,500 RPD (requests per day)
**YouTube Limits**:
– Free tier: 8 hours/day
– Paid tier: No length limits
– Public videos only
**Storage Limits**:
– 20GB per project
– 2GB per file
– 48-hour retention
## Error Handling
Common errors and solutions:
– **400**: Invalid format/size – validate before upload
– **401**: Invalid API key – check configuration
– **403**: Permission denied – verify API key restrictions
– **404**: File not found – ensure file uploaded and active
– **429**: Rate limit exceeded – implement exponential backoff
– **500**: Server error – retry with backoff
## Scripts Overview
All scripts support unified API key detection and error handling:
**gemini_batch_process.py**: Batch process multiple media files
– Supports all modalities (audio, image, video, PDF)
– Progress tracking and error recovery
– Output formats: JSON, Markdown, CSV
– Rate limiting and retry logic
– Dry-run mode
**media_optimizer.py**: Prepare media for Gemini API
– Compress videos/audio for size limits
– Resize images appropriately
– Split long videos into chunks
– Format conversion
– Quality vs size optimization
**document_converter.py**: Convert documents to PDF
– Convert DOCX, XLSX, PPTX to PDF
– Extract page ranges
– Optimize PDFs for Gemini
– Extract images from PDFs
– Batch conversion support
Run any script with `–help` for detailed usage.
## Resources
– [Audio API Docs](https://ai.google.dev/gemini-api/docs/audio)
– [Image API Docs](https://ai.google.dev/gemini-api/docs/image-understanding)
– [Video API Docs](https://ai.google.dev/gemini-api/docs/video-understanding)
– [Document API Docs](https://ai.google.dev/gemini-api/docs/document-processing)
– [Image Gen Docs](https://ai.google.dev/gemini-api/docs/image-generation)
– [Get API Key](https://aistudio.google.com/apikey)
– [Pricing](https://ai.google.dev/pricing)