Skill

SkillsResearch & Science › Research assistant

exploration

Systematic methodology for profiling datasets, assessing data quality, discovering patterns, and understanding schemas.

Freerisk: low
explorationsqlgopostgresqlschemamarkdown

The full skill

— name: exploration description: "Systematic methodology for profiling datasets, assessing data quality, discovering patterns, and understanding schemas." — # Data Exploration Skill Systematic methodology for profiling datasets, assessing data quality, discovering patterns, and understanding schemas. ## Data Profiling Methodology ### Phase 1: Structural Understanding Before analyzing any data, understand its structure: **Table-level questions:** – How many rows and columns? – What is the grain (one row per what)? – What is the primary key? Is it unique? – When was the data last updated? – How far back does the data go? **Column classification:** Categorize each column as one of: – **Identifier**: Unique keys, foreign keys, entity IDs – **Dimension**: Categorical attributes for grouping/filtering (status, type, region, category) – **Metric**: Quantitative values for measurement (revenue, count, duration, score) – **Temporal**: Dates and timestamps (created_at, updated_at, event_date) – **Text**: Free-form text fields (description, notes, name) – **Boolean**: True/false flags – **Structural**: JSON, arrays, nested structures ### Phase 2: Column-Level Profiling For each column, compute: **All columns:** – Null count and null rate – Distinct count and cardinality ratio (distinct / total) – Most common values (top 5-10 with frequencies) – Least common values (bottom 5 to spot anomalies) **Numeric columns (metrics):** “` min, max, mean, median (p50) standard deviation percentiles: p1, p5, p25, p75, p95, p99 zero count negative count (if unexpected) “` **String columns (dimensions, text):** “` min length, max length, avg length empty string count pattern analysis (do values follow a format?) case consistency (all upper, all lower, mixed?) leading/trailing whitespace count “` **Date/timestamp columns:** “` min date, max date null dates future dates (if unexpected) distribution by month/week gaps in time series “` **Boolean columns:** “` true count, false count, null count true rate “` ### Phase 3: Relationship Discovery After profiling individual columns: – **Foreign key candidates**: ID columns that might link to other tables – **Hierarchies**: Columns that form natural drill-down paths (country > state > city) – **Correlations**: Numeric columns that move together – **Derived columns**: Columns that appear to be computed from others – **Redundant columns**: Columns with identical or near-identical information ## Quality Assessment Framework ### Completeness Score Rate each column: – **Complete** (>99% non-null): Green – **Mostly complete** (95-99%): Yellow — investigate the nulls – **Incomplete** (80-95%): Orange — understand why and whether it matters – **Sparse** (<80%): Red — may not be usable without imputation ### Consistency Checks Look for: – **Value format inconsistency**: Same concept represented differently ("USA", "US", "United States", "us") – **Type inconsistency**: Numbers stored as strings, dates in various formats – **Referential integrity**: Foreign keys that don't match any parent record – **Business rule violations**: Negative quantities, end dates before start dates, percentages > 100 – **Cross-column consistency**: Status = "completed" but completed_at is null ### Accuracy Indicators Red flags that suggest accuracy issues: – **Placeholder values**: 0, -1, 999999, "N/A", "TBD", "test", "xxx" – **Default values**: Suspiciously high frequency of a single value – **Stale data**: Updated_at shows no recent changes in an active system – **Impossible values**: Ages > 150, dates in the far future, negative durations – **Round number bias**: All values ending in 0 or 5 (suggests estimation, not measurement) ### Timeliness Assessment – When was the table last updated? – What is the expected update frequency? – Is there a lag between event time and load time? – Are there gaps in the time series? ## Pattern Discovery Techniques ### Distribution Analysis For numeric columns, characterize the distribution: – **Normal**: Mean and median are close, bell-shaped – **Skewed right**: Long tail of high values (common for revenue, session duration) – **Skewed left**: Long tail of low values (less common) – **Bimodal**: Two peaks (suggests two distinct populations) – **Power law**: Few very large values, many small ones (common for user activity) – **Uniform**: Roughly equal frequency across range (often synthetic or random) ### Temporal Patterns For time series data, look for: – **Trend**: Sustained upward or downward movement – **Seasonality**: Repeating patterns (weekly, monthly, quarterly, annual) – **Day-of-week effects**: Weekday vs. weekend differences – **Holiday effects**: Drops or spikes around known holidays – **Change points**: Sudden shifts in level or trend – **Anomalies**: Individual data points that break the pattern ### Segmentation Discovery Identify natural segments by: – Finding categorical columns with 3-20 distinct values – Comparing metric distributions across segment values – Looking for segments with significantly different behavior – Testing whether segments are homogeneous or contain sub-segments ### Correlation Exploration Between numeric columns: – Compute correlation matrix for all metric pairs – Flag strong correlations (|r| > 0.7) for investigation – Note: Correlation does not imply causation — flag this explicitly – Check for non-linear relationships (e.g., quadratic, logarithmic) ## Schema Understanding and Documentation ### Schema Documentation Template When documenting a dataset for team use: “`markdown ## Table: [schema.table_name] **Description**: [What this table represents] **Grain**: [One row per…] **Primary Key**: [column(s)] **Row Count**: [approximate, with date] **Update Frequency**: [real-time / hourly / daily / weekly] **Owner**: [team or person responsible] ### Key Columns | Column | Type | Description | Example Values | Notes | |——–|——|————-|—————-|——-| | user_id | STRING | Unique user identifier | "usr_abc123" | FK to users.id | | event_type | STRING | Type of event | "click", "view", "purchase" | 15 distinct values | | revenue | DECIMAL | Transaction revenue in USD | 29.99, 149.00 | Null for non-purchase events | | created_at | TIMESTAMP | When the event occurred | 2024-01-15 14:23:01 | Partitioned on this column | ### Relationships – Joins to `users` on `user_id` – Joins to `products` on `product_id` – Parent of `event_details` (1:many on event_id) ### Known Issues – [List any known data quality issues] – [Note any gotchas for analysts] ### Common Query Patterns – [Typical use cases for this table] “` ### Schema Exploration Queries When connected to a data warehouse, use these patterns to discover schema: “`sql — List all tables in a schema (PostgreSQL) SELECT table_name, table_type FROM information_schema.tables WHERE table_schema = 'public' ORDER BY table_name; — Column details (PostgreSQL) SELECT column_name, data_type, is_nullable, column_default FROM information_schema.columns WHERE table_name = 'my_table' ORDER BY ordinal_position; — Table sizes (PostgreSQL) SELECT relname, pg_size_pretty(pg_total_relation_size(relid)) FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC; — Row counts for all tables (general pattern) — Run per-table: SELECT COUNT(*) FROM table_name “` ### Lineage and Dependencies When exploring an unfamiliar data environment: 1. Start with the "output" tables (what reports or dashboards consume) 2. Trace upstream: What tables feed into them? 3. Identify raw/staging/mart layers 4. Map the transformation chain from raw data to analytical tables 5. Note where data is enriched, filtered, or aggregated