Skip to content

Conversation

@dsmedia
Copy link
Collaborator

@dsmedia dsmedia commented Oct 26, 2025

Gallery Examples Registry: Mapping Examples to Datasets

Summary

Our galleries are siloed across three repositories with no unified view. This PR adds a cross-ecosystem registry (gallery_examples.json) cataloging ~470 examples from Vega, Vega-Lite, and Altair, tracking which datasets each example uses. These examples are curated by the Vega community to demonstrate essential visualization techniques and design patterns. When joined with datapackage.json, the registry enables dataset-first learning ("show me everything I can do with flights data"), curation analytics (dataset coverage matrices), and provides high-quality structured training data for visualization AI/ML systems.


Open Questions & Decisions Needed

Status: Draft - This implementation is nearly complete. I'm seeking feedback on design decisions before locking in the approach.

1. JSON Schema Design

Question: Is the current structure optimal, or should we adjust before it becomes a committed format?

Current approach:

  • Flat array of examples (not nested by gallery)
  • datasets_used as string array: ["cars", "movies"]
  • Each example has: id, gallery_name, example_name, example_url, spec_url, categories, description, datasets_used

Alternatives considered:

  • Nest by gallery: {"vega": [...], "vega-lite": [...], "altair": [...]}
  • Object array for datasets: [{"name": "cars", "role": "primary"}]
  • Add spec metadata: marks, transforms, encodings

Recommended approach: Keep current flat structure

  • Rationale: Easier to query (no nested navigation), follows datapackage.json pattern (flat resources array), extensible via DatasetReference type alias migration path
  • Migration: Type alias includes documented path to object structure if needed later

2. URL Validation Strategy

Question: Should we add CI validation for ~470 external URLs? There are also URLs in the datapackage file, which can become out of date.

Options:

  • A) No validation (current): Manual regeneration when galleries change
  • B) Weekly scheduled check: lychee in GitHub Actions
  • C) Pre-commit hook: Validate before committing changes

Recommended approach: Option A (no automated validation)

  • Rationale: External URLs breaking doesn't indicate bugs in our code, creates noisy failures, galleries themselves validate their URLs, manual regeneration on-demand is sufficient
  • Alternative: If we add validation, use weekly scheduled check (non-blocking) for monitoring only

3. Merge Timing with Altair

Question: Wait for Altair PR #3859 or merge now?

Current state: 3-line name mapping config handles Altair's camelCase API names
After Altair lands: Name mapping can be removed entirely (confirmed by @mattijn)

Recommended approach: can merge before Altair PR #3859

  • Rationale: Minimal temporary code (3 mappings), immediate value for analysis, easy cleanup after Altair migrates, doesn't block either PR
  • Follow-up: Simple config edit + regeneration after Altair PR merges

4. Dataset Usage Analytics

Question: Add inverse mapping (datasets → examples) to the JSON output?

Example:

"dataset_usage": {
  "cars": {
    "usage_count": 42,
    "galleries": ["vega", "vega-lite", "altair"],
    "example_ids": [1, 5, 12, ...]
  }
}

Recommended approach: Defer to v2

  • Rationale: Can be computed from current structure, adds file size, easy to add later if needed, let usage patterns emerge first
  • Note: Added to "Considered but deferred" section

5. Build Integration

Question: Integrate into npm run build or keep manual?

Options:

  • A) Manual regeneration (current): uv run scripts/generate_gallery_examples.py
  • B) Automatic in build: Add to npm run build
  • C) Hybrid: Add npm script npm run update-gallery for convenient manual trigger

Recommended approach: Option C (hybrid)

  • Rationale: Avoids slowing down every build, provides convenience, galleries change less frequently than datasets, manual regeneration on-demand makes sense
  • Implementation: Add "update-gallery": "uv run scripts/generate_gallery_examples.py" to package.json

6. Documentation

Question: Where to document this feature?

Recommended approach: Both README.md and CONTRIBUTING.md

  • README.md: Add link to gallery_examples.json in "About" section (shows it's a first-class artifact)
  • CONTRIBUTING.md: Document regeneration process and when to update
  • Alternative: Create GALLERY_REGISTRY.md if more extensive docs needed

Motivation

Building on the recent addition of datapackage.json to vega-datasets, this PR addresses a gap in our example galleries: they're currently siloed across Vega, Vega-Lite, and Altair repositories with no unified registry.

While our galleries are one of the best ways for users to learn, there's no single place to see all examples and the datasets they use. This PR creates that unified view.

Key Benefits

1. Enhanced Learning & Discovery

  • Users can easily find all examples using a specific dataset (e.g., flights-2k.json) across Vega, Vega-Lite, and Altair, making it easier to learn different techniques on the same data
  • Creates clear learning pathways: see how to extend a Vega-Lite spec into Vega by comparing examples using the same dataset
  • Powers richer, cross-library tutorials and documentation (e.g., on Kaggle, Observable, etc.)
  • Enables dataset-first exploration: "Show me everything I can do with the cars dataset"

2. Improved Gallery Maintenance & Curation

  • Build a dataset coverage matrix to identify under-utilized datasets and diversify examples beyond the usual suspects (cars, movies, etc.)
  • Spot inconsistencies or opportunities for alignment across galleries
  • Identify gaps: which datasets lack examples? Which chart types are under-represented?
  • Simplify testing: changes to vega-datasets can be cross-checked against all dependent examples

3. Foundation for Advanced Tooling (ML/AI)

  • Combines structured example specs + structured dataset metadata (datapackage.json) to provide rich training data
  • Initial testing shows LLMs get significantly better at recommending appropriate chart types when given both files
  • Potential grows with richer semantic metadata in datapackage.json (cardinality, skew, distributions) - aligning with Draco concepts
  • Enables visualization recommendation systems based on real-world design patterns

4. Community & Research

  • Provides structured corpus for researchers studying visualization design, perception, and recommendation systems
  • Makes it easier for community to contribute examples by showing where gaps exist
  • Supports cross-ecosystem collaboration and alignment

Coordination with Altair Changes

This PR includes temporary manual name mappings for Altair's dataset API (e.g., data.londonBoroughs()london_boroughs) in _data/gallery_examples.toml. Altair PR #3859 is migrating Altair to use an internal altair.datasets module that sources from vega-datasets directly.

Current state (temporary):

  • Altair examples use API names like data.londonBoroughs() that don't match datapackage.json canonical names (london_boroughs)
  • This PR uses explicit mappings in config to handle these differences
  • Only 3 mappings needed currently: londonBoroughs, londonCentroids, londonTubeLines

After Altair PR #3859 lands:

  • Altair will use the canonical identifiers from datapackage.json (confirmed by @mattijn)
  • Name mapping section can be removed entirely - no more workarounds needed
  • ✅ Simply regenerate gallery_examples.json to pick up the aligned names
  • ✅ Any new Altair examples will automatically work without config changes

Why this matters for the registry:

  • Once Altair migrates, all three galleries (Vega, Vega-Lite, Altair) will reference datasets consistently
  • Dataset extraction becomes simpler and more robust
  • This is one reason this PR is marked as Draft - we're coordinating timing with Altair's migration to minimize temporary code

The Power of the Join

When gallery_examples.json is joined with datapackage.json, we unlock:

-- Find all examples using time-series data
SELECT example_name, gallery_name
FROM gallery_examples
JOIN datasets ON dataset_used = dataset_name
WHERE temporal_coverage IS NOT NULL;

-- Identify under-utilized datasets
SELECT dataset_name, COUNT(*) as example_count
FROM datasets LEFT JOIN gallery_examples
GROUP BY dataset_name
HAVING example_count < 3;

-- Cross-library learning paths
SELECT vl.example_name as vega_lite_example,
       v.example_name as vega_example
FROM gallery_examples vl
JOIN gallery_examples v
  ON vl.datasets_used = v.datasets_used
WHERE vl.gallery_name = 'vega-lite'
  AND v.gallery_name = 'vega';

This registry is the first step toward treating our gallery as data - enabling the same kinds of analysis and tooling we build for visualizing datasets.

Implementation Details

Architecture

Follows vega-datasets' two-phase metadata generation pattern:

  1. Configuration: _data/gallery_examples.toml (externalizes URLs, mappings, settings)
  2. Generation: scripts/generate_gallery_examples.py (collection and extraction logic)
  3. Output: gallery_examples.json (committed artifact, ~470 examples)

Data Collection Process

  1. Fetches example metadata from three galleries (Vega, Vega-Lite, Altair)
  2. Retrieves individual specifications/code files (~470 HTTP requests)
  3. Extracts dataset references (handles different spec formats per framework)
  4. Normalizes references to canonical datapackage.json names
  5. Outputs cross-reference catalog with comprehensive metadata

Note on Altair examples: Altair maintains examples in two syntax styles: method-based (preferred as of Altair 5) and attribute-based. When the same example exists in both directories (116 cases), this PR uses the method-based version per Altair's documentation. The remaining 69 examples unique to the attribute-based directory are also included, for a total of 185 Altair examples.

Type Safety

  • Comprehensive TypedDict definitions for all data structures
  • Semantic type aliases for domain clarity (CanonicalName, FilePath, etc.)
  • Protocol-based validation infrastructure for extensibility
  • Pyright type checking enabled on select scripts
  • Future-proofing with documented migration paths

Files Changed

New Files

  • _data/gallery_examples.toml - Configuration (URLs, Altair name mappings, settings)
  • scripts/generate_gallery_examples.py - Generator script (2,289 lines, fully typed)
  • gallery_examples.json - Generated output (6,389 lines, ~470 examples)

Modified Files

  • pyproject.toml - Added Pyright configuration, expanded script coverage
  • scripts/species.py - Type safety improvements (TypedDict, type guards, aliases)

Quality Assurance

All Checks Pass ✓

  • ✅ TOML formatting: uvx taplo fmt --check --diff
  • ✅ Python linting: uvx ruff check
  • ✅ Python formatting: uvx ruff format --check
  • ✅ Type checking: pyright (0 errors, 1 warning for missing geopandas stubs)
  • ✅ Build success: npm run build
  • ✅ Script execution: uv run scripts/generate_gallery_examples.py --dry-run

Testing

  • Manual testing: ~470 examples collected successfully
  • Runtime: ~15 seconds (network-dependent)
  • Validation: Dataset references validated against datapackage.json
  • Error handling: Individual example failures don't crash collection

Usage

Regenerate Gallery Examples Registry

# Standard regeneration
uv run scripts/generate_gallery_examples.py

# Test without writing output
uv run scripts/generate_gallery_examples.py --dry-run

# Enable debug logging
uv run scripts/generate_gallery_examples.py --verbose

# Custom output path
uv run scripts/generate_gallery_examples.py --output custom.json

Configuration

Edit _data/gallery_examples.toml to:

  • Update source URLs (e.g., switch to a specific branch)
  • Add Altair API name mappings (for camelCase → snake_case)
  • Adjust network timeout settings
  • Change default output path

Output Format

The generated gallery_examples.json follows this structure:

{
  "name": "gallery-examples",
  "title": "Vega Ecosystem Gallery Examples Registry",
  "description": "Cross-reference catalog mapping gallery examples to vega-datasets resources...",
  "created": "2025-10-26T00:35:42.508794+00:00",
  "datapackage": {
    "name": "vega-datasets",
    "version": "3.2.1",
    "path": "./datapackage.json"
  },
  "examples": [
    {
      "id": 1,
      "gallery_name": "altair",
      "example_name": "2D Histogram Heatmap",
      "example_url": "https://altair-viz.github.io/gallery/histogram_heatmap.html",
      "spec_url": "https://raw.githubusercontent.com/vega/altair/main/tests/examples_methods_syntax/histogram_heatmap.py",
      "categories": ["Distributions"],
      "description": "This example shows how to make a heatmap from binned quantitative data.",
      "datasets_used": ["movies"]
    },
    ...
  ]
}

Design Decisions

1. Altair Name Mapping Strategy

Challenge: Altair's Python API uses camelCase names (e.g., data.londonBoroughs.url) while datapackage.json uses snake_case canonical names (e.g., london_boroughs).

Solution: Explicit mappings in _data/gallery_examples.toml:

[altair.name_mapping]
londonBoroughs = "london_boroughs"
londonCentroids = "london_centroids"

Rationale: Transparent, maintainable, and documented as temporary until Altair naming aligns or dataset mapper includes extension-less variants.

2. Committed Generated Artifact

Decision: gallery_examples.json is committed to the repository (like datapackage.json).

Rationale:

  • Matches vega-datasets' pattern for generated metadata
  • Enables downstream tools to consume without generation
  • Changes are visible in git diff for review
  • Manual regeneration when upstream galleries change

3. Type Safety with TypedDict

Decision: Use TypedDict throughout instead of dataclasses.

Rationale:

  • Direct JSON mapping for I/O boundaries
  • Lightweight (no runtime overhead)
  • Consistent with vega-datasets' JSON-centric architecture
  • Validation handled explicitly by DatasetValidator protocol

4. Graceful Degradation for External Datasets

Decision: Log warnings for unknown datasets but don't fail processing.

Rationale:

  • Gallery examples may reference datasets not yet in vega-datasets
  • Examples may use custom/external data sources
  • Full collection more valuable than strict validation
  • Warnings provide visibility for investigation

Breaking Changes

None - this is purely additive.

Related Work

  • Complements datapackage.json by adding usage perspective
  • Could inform future dataset curation decisions
  • May help identify candidates for deprecation or expansion

Future Enhancements

Considered but deferred:

  1. Async HTTP requests: Could reduce runtime from ~15s to ~5s (out of scope for v1)
  2. Dataset usage aggregation: Inverse mapping from datasets to examples (can add later)
  3. Visualization spec metadata: Extract marks, transforms, encodings, interaction types from specs (valuable for analysis but adds complexity)
  4. Trend tracking: Historical gallery growth over time (requires periodic snapshots)

Extension points:

  • DatasetValidator protocol enables custom validation strategies
  • DatasetReference type alias has documented migration path to object structure
  • Configuration externalized for easy adjustment without code changes

Checklist

Implementation Quality

  • Code follows vega-datasets conventions
  • All quality checks pass (taplo, ruff, pyright, build)
  • Documentation is comprehensive
  • Configuration is externalized
  • Type safety is comprehensive
  • Error handling is robust
  • No breaking changes

Decisions Needed 🤔

  • JSON Schema - Approve current flat structure or request changes (see Question 1)
  • URL Validation - Approve no CI validation or request addition (see Question 2)
  • Merge Timing - Approve merge now or wait for Altair PR (see Question 3)
  • Dataset Analytics - Approve defer to v2 or request now (see Question 4)
  • Build Integration - Approve hybrid npm script approach (see Question 5)
  • Documentation - Approve README + CONTRIBUTING approach (see Question 6)

Before Marking Ready for Review

  • Fix bad URLs in gallery_examples.json (identified during testing)
  • Address feedback on open questions above
  • Add npm script for convenient regeneration (if approved)
  • Add documentation per approved approach
  • Squash commits to single atomic commit

Adds Pyright type checking to the project with initial coverage of select
scripts. Configuration uses 'basic' mode for gradual typing adoption.

Scripts included in type checking:
- scripts/generate_gallery_examples.py (new)
- scripts/build_datapackage.py
- scripts/species.py
- scripts/flights.py
- scripts/income.py
- scripts/us-state-capitals.py

Type safety improvements to scripts/species.py (required to pass checks):
- Add TypedDict definitions for configuration structures (FilterItem,
  GeographicFilter, ProcessingConfig, Config)
- Add semantic type aliases (ItemId, SpeciesCode, CountyId, FileExtension,
  ExactExtractOp) for domain clarity
- Add type guard function is_file_extension() for FileExtension validation
- Improve function signatures with complete type annotations
- Add TYPE_CHECKING block for type-only imports

These changes ensure the build passes with Pyright enabled while improving
code maintainability and IDE support.
Adds cross-ecosystem registry cataloging ~470 examples from Vega, Vega-Lite,
and Altair galleries, tracking which datasets each example uses.

New files:
- _data/gallery_examples.toml: Configuration (URLs, Altair name mappings)
- scripts/generate_gallery_examples.py: Generator (2,289 lines, fully typed)
- gallery_examples.json: Generated output (~470 examples)

When joined with datapackage.json, enables:
- Dataset-first learning (find all examples using specific dataset)
- Curation analytics (dataset coverage matrices, gap analysis)
- High-quality training data for visualization AI/ML systems

Examples are curated by the Vega community to demonstrate essential
visualization techniques and design patterns.

Implementation details:
- Handles different spec formats per framework (Vega, Vega-Lite, Altair)
- Normalizes all references to canonical datapackage.json names
- Altair deduplication: Uses method-based syntax (preferred as of Altair 5)
  when examples exist in both syntax directories (116 cases)
- Temporary name mappings for Altair API (3 mappings, will be removed after
  Altair PR #3859 lands)
- Comprehensive type safety with TypedDict, Protocols, semantic type aliases
- Protocol-based validation infrastructure for extensibility

Runtime: ~15 seconds to collect all examples
Quality: All checks pass (taplo, ruff, pyright, npm build)
@dsmedia dsmedia force-pushed the feat/generate-gallery-examples branch from 823c2b9 to 0815882 Compare October 26, 2025 14:00
Altair PR #3859 (merged 2025-10-26) migrated from vega_datasets package
to altair.datasets module with canonical vega-datasets naming. This
updates the gallery examples collection to track Altair v6+ main branch.

Changes:
- Empty [altair.name_mapping] section (was: londonBoroughs → london_boroughs)
- Comments now document legacy v5.x support instead of temporary workaround
- Add pattern for fully qualified altair.datasets.data.X.url syntax
- Refactor extract_altair_api_datasets() with explicit name_mapping parameter
- Regenerate gallery_examples.json (470 examples, all with canonical names)

Type safety improvements:
- extract_altair_api_datasets() now accepts name_mapping as parameter
  instead of accessing global _config directly
- Explicit None default for Altair v6+ (no mapping needed)
- Better testability and separation of concerns

Backward compatibility:
- Mapping section preserved (empty) with documentation for v5.x users
- Historical camelCase examples commented out for reference
- Function signature supports both v5 (with mapping) and v6 (without)

Configuration notes:
- Currently tracks Altair main branch (v6+ development)
- Git ref hardcoded in Python script (line 1135) - documented in TOML
- Stability note added: consider pinning to release tag when v6.0.0 available
- Testing procedure documented for v5.x regression testing

All three galleries (Vega, Vega-Lite, Altair) now use consistent
canonical dataset naming from datapackage.json.

Related: vega/altair#3859

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant