feat: add gallery examples registry mapping examples to datasets #724

dsmedia · 2025-10-26T02:12:29Z

Gallery Examples Registry: Mapping Examples to Datasets

Summary

Our galleries are siloed across three repositories with no unified view. This PR adds a cross-ecosystem registry (gallery_examples.json) cataloging ~470 examples from Vega, Vega-Lite, and Altair, tracking which datasets each example uses. These examples are curated by the Vega community to demonstrate essential visualization techniques and design patterns. When joined with datapackage.json, the registry enables dataset-first learning ("show me everything I can do with flights data"), curation analytics (dataset coverage matrices), and provides high-quality structured training data for visualization AI/ML systems.

Open Questions & Decisions Needed

Status: Draft - This implementation is nearly complete. I'm seeking feedback on design decisions before locking in the approach.

1. JSON Schema Design

Question: Is the current structure optimal, or should we adjust before it becomes a committed format?

Current approach:

Flat array of examples (not nested by gallery)
datasets_used as string array: ["cars", "movies"]
Each example has: id, gallery_name, example_name, example_url, spec_url, categories, description, datasets_used

Alternatives considered:

Nest by gallery: {"vega": [...], "vega-lite": [...], "altair": [...]}
Object array for datasets: [{"name": "cars", "role": "primary"}]
Add spec metadata: marks, transforms, encodings

Recommended approach: Keep current flat structure

Rationale: Easier to query (no nested navigation), follows datapackage.json pattern (flat resources array), extensible via DatasetReference type alias migration path
Migration: Type alias includes documented path to object structure if needed later

2. URL Validation Strategy

Question: Should we add CI validation for ~470 external URLs? There are also URLs in the datapackage file, which can become out of date.

Options:

A) No validation (current): Manual regeneration when galleries change
B) Weekly scheduled check: lychee in GitHub Actions
C) Pre-commit hook: Validate before committing changes

Recommended approach: Option A (no automated validation)

Rationale: External URLs breaking doesn't indicate bugs in our code, creates noisy failures, galleries themselves validate their URLs, manual regeneration on-demand is sufficient
Alternative: If we add validation, use weekly scheduled check (non-blocking) for monitoring only

3. Merge Timing with Altair

Question: Wait for Altair PR #3859 or merge now?

Current state: 3-line name mapping config handles Altair's camelCase API names
After Altair lands: Name mapping can be removed entirely (confirmed by @mattijn)

Recommended approach: can merge before Altair PR #3859

Rationale: Minimal temporary code (3 mappings), immediate value for analysis, easy cleanup after Altair migrates, doesn't block either PR
Follow-up: Simple config edit + regeneration after Altair PR merges

4. Dataset Usage Analytics

Question: Add inverse mapping (datasets → examples) to the JSON output?

Example:

"dataset_usage": {
  "cars": {
    "usage_count": 42,
    "galleries": ["vega", "vega-lite", "altair"],
    "example_ids": [1, 5, 12, ...]
  }
}

Recommended approach: Defer to v2

Rationale: Can be computed from current structure, adds file size, easy to add later if needed, let usage patterns emerge first
Note: Added to "Considered but deferred" section

5. Build Integration

Question: Integrate into npm run build or keep manual?

Options:

A) Manual regeneration (current): uv run scripts/generate_gallery_examples.py
B) Automatic in build: Add to npm run build
C) Hybrid: Add npm script npm run update-gallery for convenient manual trigger

Recommended approach: Option C (hybrid)

Rationale: Avoids slowing down every build, provides convenience, galleries change less frequently than datasets, manual regeneration on-demand makes sense
Implementation: Add "update-gallery": "uv run scripts/generate_gallery_examples.py" to package.json

6. Documentation

Question: Where to document this feature?

Recommended approach: Both README.md and CONTRIBUTING.md

README.md: Add link to gallery_examples.json in "About" section (shows it's a first-class artifact)
CONTRIBUTING.md: Document regeneration process and when to update
Alternative: Create GALLERY_REGISTRY.md if more extensive docs needed

Motivation

Building on the recent addition of datapackage.json to vega-datasets, this PR addresses a gap in our example galleries: they're currently siloed across Vega, Vega-Lite, and Altair repositories with no unified registry.

While our galleries are one of the best ways for users to learn, there's no single place to see all examples and the datasets they use. This PR creates that unified view.

Key Benefits

1. Enhanced Learning & Discovery

Users can easily find all examples using a specific dataset (e.g., flights-2k.json) across Vega, Vega-Lite, and Altair, making it easier to learn different techniques on the same data
Creates clear learning pathways: see how to extend a Vega-Lite spec into Vega by comparing examples using the same dataset
Powers richer, cross-library tutorials and documentation (e.g., on Kaggle, Observable, etc.)
Enables dataset-first exploration: "Show me everything I can do with the cars dataset"

2. Improved Gallery Maintenance & Curation

Build a dataset coverage matrix to identify under-utilized datasets and diversify examples beyond the usual suspects (cars, movies, etc.)
Spot inconsistencies or opportunities for alignment across galleries
Identify gaps: which datasets lack examples? Which chart types are under-represented?
Simplify testing: changes to vega-datasets can be cross-checked against all dependent examples

3. Foundation for Advanced Tooling (ML/AI)

Combines structured example specs + structured dataset metadata (datapackage.json) to provide rich training data
Initial testing shows LLMs get significantly better at recommending appropriate chart types when given both files
Potential grows with richer semantic metadata in datapackage.json (cardinality, skew, distributions) - aligning with Draco concepts
Enables visualization recommendation systems based on real-world design patterns

4. Community & Research

Provides structured corpus for researchers studying visualization design, perception, and recommendation systems
Makes it easier for community to contribute examples by showing where gaps exist
Supports cross-ecosystem collaboration and alignment

Coordination with Altair Changes

This PR includes temporary manual name mappings for Altair's dataset API (e.g., data.londonBoroughs() → london_boroughs) in _data/gallery_examples.toml. Altair PR #3859 is migrating Altair to use an internal altair.datasets module that sources from vega-datasets directly.

Current state (temporary):

Altair examples use API names like data.londonBoroughs() that don't match datapackage.json canonical names (london_boroughs)
This PR uses explicit mappings in config to handle these differences
Only 3 mappings needed currently: londonBoroughs, londonCentroids, londonTubeLines

After Altair PR #3859 lands:

✅ Altair will use the canonical identifiers from datapackage.json (confirmed by @mattijn)
✅ Name mapping section can be removed entirely - no more workarounds needed
✅ Simply regenerate gallery_examples.json to pick up the aligned names
✅ Any new Altair examples will automatically work without config changes

Why this matters for the registry:

Once Altair migrates, all three galleries (Vega, Vega-Lite, Altair) will reference datasets consistently
Dataset extraction becomes simpler and more robust
This is one reason this PR is marked as Draft - we're coordinating timing with Altair's migration to minimize temporary code

The Power of the Join

When gallery_examples.json is joined with datapackage.json, we unlock:

-- Find all examples using time-series data
SELECT example_name, gallery_name
FROM gallery_examples
JOIN datasets ON dataset_used = dataset_name
WHERE temporal_coverage IS NOT NULL;

-- Identify under-utilized datasets
SELECT dataset_name, COUNT(*) as example_count
FROM datasets LEFT JOIN gallery_examples
GROUP BY dataset_name
HAVING example_count < 3;

-- Cross-library learning paths
SELECT vl.example_name as vega_lite_example,
       v.example_name as vega_example
FROM gallery_examples vl
JOIN gallery_examples v
  ON vl.datasets_used = v.datasets_used
WHERE vl.gallery_name = 'vega-lite'
  AND v.gallery_name = 'vega';

This registry is the first step toward treating our gallery as data - enabling the same kinds of analysis and tooling we build for visualizing datasets.

Implementation Details

Architecture

Follows vega-datasets' two-phase metadata generation pattern:

Configuration: _data/gallery_examples.toml (externalizes URLs, mappings, settings)
Generation: scripts/generate_gallery_examples.py (collection and extraction logic)
Output: gallery_examples.json (committed artifact, ~470 examples)

Data Collection Process

Fetches example metadata from three galleries (Vega, Vega-Lite, Altair)
Retrieves individual specifications/code files (~470 HTTP requests)
Extracts dataset references (handles different spec formats per framework)
Normalizes references to canonical datapackage.json names
Outputs cross-reference catalog with comprehensive metadata

Note on Altair examples: Altair maintains examples in two syntax styles: method-based (preferred as of Altair 5) and attribute-based. When the same example exists in both directories (116 cases), this PR uses the method-based version per Altair's documentation. The remaining 69 examples unique to the attribute-based directory are also included, for a total of 185 Altair examples.

Type Safety

Comprehensive TypedDict definitions for all data structures
Semantic type aliases for domain clarity (CanonicalName, FilePath, etc.)
Protocol-based validation infrastructure for extensibility
Pyright type checking enabled on select scripts
Future-proofing with documented migration paths

Files Changed

New Files

_data/gallery_examples.toml - Configuration (URLs, Altair name mappings, settings)
scripts/generate_gallery_examples.py - Generator script (2,289 lines, fully typed)
gallery_examples.json - Generated output (6,389 lines, ~470 examples)

Modified Files

pyproject.toml - Added Pyright configuration, expanded script coverage
scripts/species.py - Type safety improvements (TypedDict, type guards, aliases)

Quality Assurance

All Checks Pass ✓

✅ TOML formatting: uvx taplo fmt --check --diff
✅ Python linting: uvx ruff check
✅ Python formatting: uvx ruff format --check
✅ Type checking: pyright (0 errors, 1 warning for missing geopandas stubs)
✅ Build success: npm run build
✅ Script execution: uv run scripts/generate_gallery_examples.py --dry-run

Testing

Manual testing: ~470 examples collected successfully
Runtime: ~15 seconds (network-dependent)
Validation: Dataset references validated against datapackage.json
Error handling: Individual example failures don't crash collection

Usage

Regenerate Gallery Examples Registry

# Standard regeneration
uv run scripts/generate_gallery_examples.py

# Test without writing output
uv run scripts/generate_gallery_examples.py --dry-run

# Enable debug logging
uv run scripts/generate_gallery_examples.py --verbose

# Custom output path
uv run scripts/generate_gallery_examples.py --output custom.json

Configuration

Edit _data/gallery_examples.toml to:

Update source URLs (e.g., switch to a specific branch)
Add Altair API name mappings (for camelCase → snake_case)
Adjust network timeout settings
Change default output path

Output Format

The generated gallery_examples.json follows this structure:

{
  "name": "gallery-examples",
  "title": "Vega Ecosystem Gallery Examples Registry",
  "description": "Cross-reference catalog mapping gallery examples to vega-datasets resources...",
  "created": "2025-10-26T00:35:42.508794+00:00",
  "datapackage": {
    "name": "vega-datasets",
    "version": "3.2.1",
    "path": "./datapackage.json"
  },
  "examples": [
    {
      "id": 1,
      "gallery_name": "altair",
      "example_name": "2D Histogram Heatmap",
      "example_url": "https://altair-viz.github.io/gallery/histogram_heatmap.html",
      "spec_url": "https://raw.githubusercontent.com/vega/altair/main/tests/examples_methods_syntax/histogram_heatmap.py",
      "categories": ["Distributions"],
      "description": "This example shows how to make a heatmap from binned quantitative data.",
      "datasets_used": ["movies"]
    },
    ...
  ]
}

Design Decisions

1. Altair Name Mapping Strategy

Challenge: Altair's Python API uses camelCase names (e.g., data.londonBoroughs.url) while datapackage.json uses snake_case canonical names (e.g., london_boroughs).

Solution: Explicit mappings in _data/gallery_examples.toml:

[altair.name_mapping]
londonBoroughs = "london_boroughs"
londonCentroids = "london_centroids"

Rationale: Transparent, maintainable, and documented as temporary until Altair naming aligns or dataset mapper includes extension-less variants.

2. Committed Generated Artifact

Decision: gallery_examples.json is committed to the repository (like datapackage.json).

Rationale:

Matches vega-datasets' pattern for generated metadata
Enables downstream tools to consume without generation
Changes are visible in git diff for review
Manual regeneration when upstream galleries change

3. Type Safety with TypedDict

Decision: Use TypedDict throughout instead of dataclasses.

Rationale:

Direct JSON mapping for I/O boundaries
Lightweight (no runtime overhead)
Consistent with vega-datasets' JSON-centric architecture
Validation handled explicitly by DatasetValidator protocol

4. Graceful Degradation for External Datasets

Decision: Log warnings for unknown datasets but don't fail processing.

Rationale:

Gallery examples may reference datasets not yet in vega-datasets
Examples may use custom/external data sources
Full collection more valuable than strict validation
Warnings provide visibility for investigation

Breaking Changes

None - this is purely additive.

Related Work

Complements datapackage.json by adding usage perspective
Could inform future dataset curation decisions
May help identify candidates for deprecation or expansion

Future Enhancements

Considered but deferred:

Async HTTP requests: Could reduce runtime from ~15s to ~5s (out of scope for v1)
Dataset usage aggregation: Inverse mapping from datasets to examples (can add later)
Visualization spec metadata: Extract marks, transforms, encodings, interaction types from specs (valuable for analysis but adds complexity)
Trend tracking: Historical gallery growth over time (requires periodic snapshots)

Extension points:

DatasetValidator protocol enables custom validation strategies
DatasetReference type alias has documented migration path to object structure
Configuration externalized for easy adjustment without code changes

Checklist

Implementation Quality ✅

Code follows vega-datasets conventions
All quality checks pass (taplo, ruff, pyright, build)
Documentation is comprehensive
Configuration is externalized
Type safety is comprehensive
Error handling is robust
No breaking changes

Decisions Needed 🤔

JSON Schema - Approve current flat structure or request changes (see Question 1)
URL Validation - Approve no CI validation or request addition (see Question 2)
Merge Timing - Approve merge now or wait for Altair PR (see Question 3)
Dataset Analytics - Approve defer to v2 or request now (see Question 4)
Build Integration - Approve hybrid npm script approach (see Question 5)
Documentation - Approve README + CONTRIBUTING approach (see Question 6)

Before Marking Ready for Review

Fix bad URLs in gallery_examples.json (identified during testing)
Address feedback on open questions above
Add npm script for convenient regeneration (if approved)
Add documentation per approved approach
Squash commits to single atomic commit

Adds Pyright type checking to the project with initial coverage of select scripts. Configuration uses 'basic' mode for gradual typing adoption. Scripts included in type checking: - scripts/generate_gallery_examples.py (new) - scripts/build_datapackage.py - scripts/species.py - scripts/flights.py - scripts/income.py - scripts/us-state-capitals.py Type safety improvements to scripts/species.py (required to pass checks): - Add TypedDict definitions for configuration structures (FilterItem, GeographicFilter, ProcessingConfig, Config) - Add semantic type aliases (ItemId, SpeciesCode, CountyId, FileExtension, ExactExtractOp) for domain clarity - Add type guard function is_file_extension() for FileExtension validation - Improve function signatures with complete type annotations - Add TYPE_CHECKING block for type-only imports These changes ensure the build passes with Pyright enabled while improving code maintainability and IDE support.

Adds cross-ecosystem registry cataloging ~470 examples from Vega, Vega-Lite, and Altair galleries, tracking which datasets each example uses. New files: - _data/gallery_examples.toml: Configuration (URLs, Altair name mappings) - scripts/generate_gallery_examples.py: Generator (2,289 lines, fully typed) - gallery_examples.json: Generated output (~470 examples) When joined with datapackage.json, enables: - Dataset-first learning (find all examples using specific dataset) - Curation analytics (dataset coverage matrices, gap analysis) - High-quality training data for visualization AI/ML systems Examples are curated by the Vega community to demonstrate essential visualization techniques and design patterns. Implementation details: - Handles different spec formats per framework (Vega, Vega-Lite, Altair) - Normalizes all references to canonical datapackage.json names - Altair deduplication: Uses method-based syntax (preferred as of Altair 5) when examples exist in both syntax directories (116 cases) - Temporary name mappings for Altair API (3 mappings, will be removed after Altair PR #3859 lands) - Comprehensive type safety with TypedDict, Protocols, semantic type aliases - Protocol-based validation infrastructure for extensibility Runtime: ~15 seconds to collect all examples Quality: All checks pass (taplo, ruff, pyright, npm build)

Altair PR #3859 (merged 2025-10-26) migrated from vega_datasets package to altair.datasets module with canonical vega-datasets naming. This updates the gallery examples collection to track Altair v6+ main branch. Changes: - Empty [altair.name_mapping] section (was: londonBoroughs → london_boroughs) - Comments now document legacy v5.x support instead of temporary workaround - Add pattern for fully qualified altair.datasets.data.X.url syntax - Refactor extract_altair_api_datasets() with explicit name_mapping parameter - Regenerate gallery_examples.json (470 examples, all with canonical names) Type safety improvements: - extract_altair_api_datasets() now accepts name_mapping as parameter instead of accessing global _config directly - Explicit None default for Altair v6+ (no mapping needed) - Better testability and separation of concerns Backward compatibility: - Mapping section preserved (empty) with documentation for v5.x users - Historical camelCase examples commented out for reference - Function signature supports both v5 (with mapping) and v6 (without) Configuration notes: - Currently tracks Altair main branch (v6+ development) - Git ref hardcoded in Python script (line 1135) - documented in TOML - Stability note added: consider pinning to release tag when v6.0.0 available - Testing procedure documented for v5.x regression testing All three galleries (Vega, Vega-Lite, Altair) now use consistent canonical dataset naming from datapackage.json. Related: vega/altair#3859 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

dsmedia mentioned this pull request Oct 26, 2025

Best practice for dependency version pinning: ruff false positive gap between versions #725

Closed

dsmedia added enhancement documentation labels Oct 26, 2025

dsmedia requested review from domoritz and mattijn October 26, 2025 02:47

dsmedia mentioned this pull request Oct 26, 2025

chore(deps-dev): bump ruff from 0.8.3 to 0.14.2 in the dev group #726

Merged

dsmedia added 2 commits October 26, 2025 13:56

dsmedia force-pushed the feat/generate-gallery-examples branch from 823c2b9 to 0815882 Compare October 26, 2025 14:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: add gallery examples registry mapping examples to datasets #724

feat: add gallery examples registry mapping examples to datasets #724

dsmedia commented Oct 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

feat: add gallery examples registry mapping examples to datasets #724

Are you sure you want to change the base?

feat: add gallery examples registry mapping examples to datasets #724

Conversation

dsmedia commented Oct 26, 2025

Gallery Examples Registry: Mapping Examples to Datasets

Summary

Open Questions & Decisions Needed

1. JSON Schema Design

2. URL Validation Strategy

3. Merge Timing with Altair

4. Dataset Usage Analytics

5. Build Integration

6. Documentation

Motivation

Key Benefits

Coordination with Altair Changes

The Power of the Join

Implementation Details

Architecture

Data Collection Process

Type Safety

Files Changed

New Files

Modified Files

Quality Assurance

All Checks Pass ✓

Testing

Usage

Regenerate Gallery Examples Registry

Configuration

Output Format

Design Decisions

1. Altair Name Mapping Strategy

2. Committed Generated Artifact

3. Type Safety with TypedDict

4. Graceful Degradation for External Datasets

Breaking Changes

Related Work

Future Enhancements

Considered but deferred:

Extension points:

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant