Skills › Research & Science › Research assistant
exploration
Systematic methodology for profiling datasets, assessing data quality, discovering patterns, and understanding schemas.
The full skill
—
name: exploration
description: "Systematic methodology for profiling datasets, assessing data quality, discovering patterns, and understanding schemas."
—
# Data Exploration Skill
Systematic methodology for profiling datasets, assessing data quality, discovering patterns, and understanding schemas.
## Data Profiling Methodology
### Phase 1: Structural Understanding
Before analyzing any data, understand its structure:
**Table-level questions:**
– How many rows and columns?
– What is the grain (one row per what)?
– What is the primary key? Is it unique?
– When was the data last updated?
– How far back does the data go?
**Column classification:**
Categorize each column as one of:
– **Identifier**: Unique keys, foreign keys, entity IDs
– **Dimension**: Categorical attributes for grouping/filtering (status, type, region, category)
– **Metric**: Quantitative values for measurement (revenue, count, duration, score)
– **Temporal**: Dates and timestamps (created_at, updated_at, event_date)
– **Text**: Free-form text fields (description, notes, name)
– **Boolean**: True/false flags
– **Structural**: JSON, arrays, nested structures
### Phase 2: Column-Level Profiling
For each column, compute:
**All columns:**
– Null count and null rate
– Distinct count and cardinality ratio (distinct / total)
– Most common values (top 5-10 with frequencies)
– Least common values (bottom 5 to spot anomalies)
**Numeric columns (metrics):**
“`
min, max, mean, median (p50)
standard deviation
percentiles: p1, p5, p25, p75, p95, p99
zero count
negative count (if unexpected)
“`
**String columns (dimensions, text):**
“`
min length, max length, avg length
empty string count
pattern analysis (do values follow a format?)
case consistency (all upper, all lower, mixed?)
leading/trailing whitespace count
“`
**Date/timestamp columns:**
“`
min date, max date
null dates
future dates (if unexpected)
distribution by month/week
gaps in time series
“`
**Boolean columns:**
“`
true count, false count, null count
true rate
“`
### Phase 3: Relationship Discovery
After profiling individual columns:
– **Foreign key candidates**: ID columns that might link to other tables
– **Hierarchies**: Columns that form natural drill-down paths (country > state > city)
– **Correlations**: Numeric columns that move together
– **Derived columns**: Columns that appear to be computed from others
– **Redundant columns**: Columns with identical or near-identical information
## Quality Assessment Framework
### Completeness Score
Rate each column:
– **Complete** (>99% non-null): Green
– **Mostly complete** (95-99%): Yellow — investigate the nulls
– **Incomplete** (80-95%): Orange — understand why and whether it matters
– **Sparse** (<80%): Red — may not be usable without imputation
### Consistency Checks
Look for:
– **Value format inconsistency**: Same concept represented differently ("USA", "US", "United States", "us")
– **Type inconsistency**: Numbers stored as strings, dates in various formats
– **Referential integrity**: Foreign keys that don't match any parent record
– **Business rule violations**: Negative quantities, end dates before start dates, percentages > 100
– **Cross-column consistency**: Status = "completed" but completed_at is null
### Accuracy Indicators
Red flags that suggest accuracy issues:
– **Placeholder values**: 0, -1, 999999, "N/A", "TBD", "test", "xxx"
– **Default values**: Suspiciously high frequency of a single value
– **Stale data**: Updated_at shows no recent changes in an active system
– **Impossible values**: Ages > 150, dates in the far future, negative durations
– **Round number bias**: All values ending in 0 or 5 (suggests estimation, not measurement)
### Timeliness Assessment
– When was the table last updated?
– What is the expected update frequency?
– Is there a lag between event time and load time?
– Are there gaps in the time series?
## Pattern Discovery Techniques
### Distribution Analysis
For numeric columns, characterize the distribution:
– **Normal**: Mean and median are close, bell-shaped
– **Skewed right**: Long tail of high values (common for revenue, session duration)
– **Skewed left**: Long tail of low values (less common)
– **Bimodal**: Two peaks (suggests two distinct populations)
– **Power law**: Few very large values, many small ones (common for user activity)
– **Uniform**: Roughly equal frequency across range (often synthetic or random)
### Temporal Patterns
For time series data, look for:
– **Trend**: Sustained upward or downward movement
– **Seasonality**: Repeating patterns (weekly, monthly, quarterly, annual)
– **Day-of-week effects**: Weekday vs. weekend differences
– **Holiday effects**: Drops or spikes around known holidays
– **Change points**: Sudden shifts in level or trend
– **Anomalies**: Individual data points that break the pattern
### Segmentation Discovery
Identify natural segments by:
– Finding categorical columns with 3-20 distinct values
– Comparing metric distributions across segment values
– Looking for segments with significantly different behavior
– Testing whether segments are homogeneous or contain sub-segments
### Correlation Exploration
Between numeric columns:
– Compute correlation matrix for all metric pairs
– Flag strong correlations (|r| > 0.7) for investigation
– Note: Correlation does not imply causation — flag this explicitly
– Check for non-linear relationships (e.g., quadratic, logarithmic)
## Schema Understanding and Documentation
### Schema Documentation Template
When documenting a dataset for team use:
“`markdown
## Table: [schema.table_name]
**Description**: [What this table represents]
**Grain**: [One row per…]
**Primary Key**: [column(s)]
**Row Count**: [approximate, with date]
**Update Frequency**: [real-time / hourly / daily / weekly]
**Owner**: [team or person responsible]
### Key Columns
| Column | Type | Description | Example Values | Notes |
|——–|——|————-|—————-|——-|
| user_id | STRING | Unique user identifier | "usr_abc123" | FK to users.id |
| event_type | STRING | Type of event | "click", "view", "purchase" | 15 distinct values |
| revenue | DECIMAL | Transaction revenue in USD | 29.99, 149.00 | Null for non-purchase events |
| created_at | TIMESTAMP | When the event occurred | 2024-01-15 14:23:01 | Partitioned on this column |
### Relationships
– Joins to `users` on `user_id`
– Joins to `products` on `product_id`
– Parent of `event_details` (1:many on event_id)
### Known Issues
– [List any known data quality issues]
– [Note any gotchas for analysts]
### Common Query Patterns
– [Typical use cases for this table]
“`
### Schema Exploration Queries
When connected to a data warehouse, use these patterns to discover schema:
“`sql
— List all tables in a schema (PostgreSQL)
SELECT table_name, table_type
FROM information_schema.tables
WHERE table_schema = 'public'
ORDER BY table_name;
— Column details (PostgreSQL)
SELECT column_name, data_type, is_nullable, column_default
FROM information_schema.columns
WHERE table_name = 'my_table'
ORDER BY ordinal_position;
— Table sizes (PostgreSQL)
SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;
— Row counts for all tables (general pattern)
— Run per-table: SELECT COUNT(*) FROM table_name
“`
### Lineage and Dependencies
When exploring an unfamiliar data environment:
1. Start with the "output" tables (what reports or dashboards consume)
2. Trace upstream: What tables feed into them?
3. Identify raw/staging/mart layers
4. Map the transformation chain from raw data to analytical tables
5. Note where data is enriched, filtered, or aggregated