-
-
Notifications
You must be signed in to change notification settings - Fork 217
feat: add gallery examples registry mapping examples to datasets #724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
dsmedia
wants to merge
3
commits into
vega:main
Choose a base branch
from
dsmedia:feat/generate-gallery-examples
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+8,880
−18
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Adds Pyright type checking to the project with initial coverage of select scripts. Configuration uses 'basic' mode for gradual typing adoption. Scripts included in type checking: - scripts/generate_gallery_examples.py (new) - scripts/build_datapackage.py - scripts/species.py - scripts/flights.py - scripts/income.py - scripts/us-state-capitals.py Type safety improvements to scripts/species.py (required to pass checks): - Add TypedDict definitions for configuration structures (FilterItem, GeographicFilter, ProcessingConfig, Config) - Add semantic type aliases (ItemId, SpeciesCode, CountyId, FileExtension, ExactExtractOp) for domain clarity - Add type guard function is_file_extension() for FileExtension validation - Improve function signatures with complete type annotations - Add TYPE_CHECKING block for type-only imports These changes ensure the build passes with Pyright enabled while improving code maintainability and IDE support.
Adds cross-ecosystem registry cataloging ~470 examples from Vega, Vega-Lite, and Altair galleries, tracking which datasets each example uses. New files: - _data/gallery_examples.toml: Configuration (URLs, Altair name mappings) - scripts/generate_gallery_examples.py: Generator (2,289 lines, fully typed) - gallery_examples.json: Generated output (~470 examples) When joined with datapackage.json, enables: - Dataset-first learning (find all examples using specific dataset) - Curation analytics (dataset coverage matrices, gap analysis) - High-quality training data for visualization AI/ML systems Examples are curated by the Vega community to demonstrate essential visualization techniques and design patterns. Implementation details: - Handles different spec formats per framework (Vega, Vega-Lite, Altair) - Normalizes all references to canonical datapackage.json names - Altair deduplication: Uses method-based syntax (preferred as of Altair 5) when examples exist in both syntax directories (116 cases) - Temporary name mappings for Altair API (3 mappings, will be removed after Altair PR #3859 lands) - Comprehensive type safety with TypedDict, Protocols, semantic type aliases - Protocol-based validation infrastructure for extensibility Runtime: ~15 seconds to collect all examples Quality: All checks pass (taplo, ruff, pyright, npm build)
823c2b9 to
0815882
Compare
Altair PR #3859 (merged 2025-10-26) migrated from vega_datasets package to altair.datasets module with canonical vega-datasets naming. This updates the gallery examples collection to track Altair v6+ main branch. Changes: - Empty [altair.name_mapping] section (was: londonBoroughs → london_boroughs) - Comments now document legacy v5.x support instead of temporary workaround - Add pattern for fully qualified altair.datasets.data.X.url syntax - Refactor extract_altair_api_datasets() with explicit name_mapping parameter - Regenerate gallery_examples.json (470 examples, all with canonical names) Type safety improvements: - extract_altair_api_datasets() now accepts name_mapping as parameter instead of accessing global _config directly - Explicit None default for Altair v6+ (no mapping needed) - Better testability and separation of concerns Backward compatibility: - Mapping section preserved (empty) with documentation for v5.x users - Historical camelCase examples commented out for reference - Function signature supports both v5 (with mapping) and v6 (without) Configuration notes: - Currently tracks Altair main branch (v6+ development) - Git ref hardcoded in Python script (line 1135) - documented in TOML - Stability note added: consider pinning to release tag when v6.0.0 available - Testing procedure documented for v5.x regression testing All three galleries (Vega, Vega-Lite, Altair) now use consistent canonical dataset naming from datapackage.json. Related: vega/altair#3859 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
This was referenced Oct 27, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Gallery Examples Registry: Mapping Examples to Datasets
Summary
Our galleries are siloed across three repositories with no unified view. This PR adds a cross-ecosystem registry (
gallery_examples.json) cataloging ~470 examples from Vega, Vega-Lite, and Altair, tracking which datasets each example uses. These examples are curated by the Vega community to demonstrate essential visualization techniques and design patterns. When joined withdatapackage.json, the registry enables dataset-first learning ("show me everything I can do with flights data"), curation analytics (dataset coverage matrices), and provides high-quality structured training data for visualization AI/ML systems.Open Questions & Decisions Needed
Status: Draft - This implementation is nearly complete. I'm seeking feedback on design decisions before locking in the approach.
1. JSON Schema Design
Question: Is the current structure optimal, or should we adjust before it becomes a committed format?
Current approach:
datasets_usedas string array:["cars", "movies"]Alternatives considered:
{"vega": [...], "vega-lite": [...], "altair": [...]}[{"name": "cars", "role": "primary"}]Recommended approach: Keep current flat structure
DatasetReferencetype alias migration path2. URL Validation Strategy
Question: Should we add CI validation for ~470 external URLs? There are also URLs in the datapackage file, which can become out of date.
Options:
lycheein GitHub ActionsRecommended approach: Option A (no automated validation)
3. Merge Timing with Altair
Question: Wait for Altair PR #3859 or merge now?
Current state: 3-line name mapping config handles Altair's camelCase API names
After Altair lands: Name mapping can be removed entirely (confirmed by @mattijn)
Recommended approach: can merge before Altair PR #3859
4. Dataset Usage Analytics
Question: Add inverse mapping (datasets → examples) to the JSON output?
Example:
Recommended approach: Defer to v2
5. Build Integration
Question: Integrate into
npm run buildor keep manual?Options:
uv run scripts/generate_gallery_examples.pynpm run buildnpm run update-galleryfor convenient manual triggerRecommended approach: Option C (hybrid)
"update-gallery": "uv run scripts/generate_gallery_examples.py"to package.json6. Documentation
Question: Where to document this feature?
Recommended approach: Both README.md and CONTRIBUTING.md
gallery_examples.jsonin "About" section (shows it's a first-class artifact)GALLERY_REGISTRY.mdif more extensive docs neededMotivation
Building on the recent addition of
datapackage.jsonto vega-datasets, this PR addresses a gap in our example galleries: they're currently siloed across Vega, Vega-Lite, and Altair repositories with no unified registry.While our galleries are one of the best ways for users to learn, there's no single place to see all examples and the datasets they use. This PR creates that unified view.
Key Benefits
1. Enhanced Learning & Discovery
flights-2k.json) across Vega, Vega-Lite, and Altair, making it easier to learn different techniques on the same data2. Improved Gallery Maintenance & Curation
3. Foundation for Advanced Tooling (ML/AI)
datapackage.json) to provide rich training datadatapackage.json(cardinality, skew, distributions) - aligning with Draco concepts4. Community & Research
Coordination with Altair Changes
This PR includes temporary manual name mappings for Altair's dataset API (e.g.,
data.londonBoroughs()→london_boroughs) in_data/gallery_examples.toml. Altair PR #3859 is migrating Altair to use an internalaltair.datasetsmodule that sources from vega-datasets directly.Current state (temporary):
data.londonBoroughs()that don't match datapackage.json canonical names (london_boroughs)londonBoroughs,londonCentroids,londonTubeLinesAfter Altair PR #3859 lands:
gallery_examples.jsonto pick up the aligned namesWhy this matters for the registry:
The Power of the Join
When
gallery_examples.jsonis joined withdatapackage.json, we unlock:This registry is the first step toward treating our gallery as data - enabling the same kinds of analysis and tooling we build for visualizing datasets.
Implementation Details
Architecture
Follows vega-datasets' two-phase metadata generation pattern:
_data/gallery_examples.toml(externalizes URLs, mappings, settings)scripts/generate_gallery_examples.py(collection and extraction logic)gallery_examples.json(committed artifact, ~470 examples)Data Collection Process
Note on Altair examples: Altair maintains examples in two syntax styles: method-based (preferred as of Altair 5) and attribute-based. When the same example exists in both directories (116 cases), this PR uses the method-based version per Altair's documentation. The remaining 69 examples unique to the attribute-based directory are also included, for a total of 185 Altair examples.
Type Safety
Files Changed
New Files
_data/gallery_examples.toml- Configuration (URLs, Altair name mappings, settings)scripts/generate_gallery_examples.py- Generator script (2,289 lines, fully typed)gallery_examples.json- Generated output (6,389 lines, ~470 examples)Modified Files
pyproject.toml- Added Pyright configuration, expanded script coveragescripts/species.py- Type safety improvements (TypedDict, type guards, aliases)Quality Assurance
All Checks Pass ✓
uvx taplo fmt --check --diffuvx ruff checkuvx ruff format --checkpyright(0 errors, 1 warning for missing geopandas stubs)npm run builduv run scripts/generate_gallery_examples.py --dry-runTesting
Usage
Regenerate Gallery Examples Registry
Configuration
Edit
_data/gallery_examples.tomlto:Output Format
The generated
gallery_examples.jsonfollows this structure:{ "name": "gallery-examples", "title": "Vega Ecosystem Gallery Examples Registry", "description": "Cross-reference catalog mapping gallery examples to vega-datasets resources...", "created": "2025-10-26T00:35:42.508794+00:00", "datapackage": { "name": "vega-datasets", "version": "3.2.1", "path": "./datapackage.json" }, "examples": [ { "id": 1, "gallery_name": "altair", "example_name": "2D Histogram Heatmap", "example_url": "https://altair-viz.github.io/gallery/histogram_heatmap.html", "spec_url": "https://raw.githubusercontent.com/vega/altair/main/tests/examples_methods_syntax/histogram_heatmap.py", "categories": ["Distributions"], "description": "This example shows how to make a heatmap from binned quantitative data.", "datasets_used": ["movies"] }, ... ] }Design Decisions
1. Altair Name Mapping Strategy
Challenge: Altair's Python API uses camelCase names (e.g.,
data.londonBoroughs.url) while datapackage.json uses snake_case canonical names (e.g.,london_boroughs).Solution: Explicit mappings in
_data/gallery_examples.toml:Rationale: Transparent, maintainable, and documented as temporary until Altair naming aligns or dataset mapper includes extension-less variants.
2. Committed Generated Artifact
Decision:
gallery_examples.jsonis committed to the repository (likedatapackage.json).Rationale:
3. Type Safety with TypedDict
Decision: Use TypedDict throughout instead of dataclasses.
Rationale:
4. Graceful Degradation for External Datasets
Decision: Log warnings for unknown datasets but don't fail processing.
Rationale:
Breaking Changes
None - this is purely additive.
Related Work
datapackage.jsonby adding usage perspectiveFuture Enhancements
Considered but deferred:
Extension points:
DatasetValidatorprotocol enables custom validation strategiesDatasetReferencetype alias has documented migration path to object structureChecklist
Implementation Quality ✅
Decisions Needed 🤔
Before Marking Ready for Review