-
Notifications
You must be signed in to change notification settings - Fork 0
Add Model Registry & Storage #18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
alicup29
wants to merge
8
commits into
main
Choose a base branch
from
amick/model-registry
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Implements a comprehensive model registry for tracking trained SLEAP models
with automatic checkpoint recovery on connection failures.
## Model Registry
- Hash-based model identification (SHA256, 8-char IDs)
- JSON/YAML registry format support with auto-detection
- Atomic writes with corruption recovery
- Model lifecycle tracking (training → completed/interrupted)
- Metadata storage (dataset, GPU, metrics, timestamps)
## Worker Integration
- Registry initialization in RTCWorkerClient
- Hash-based directory naming: {model_type}_{model_id}
- Model registration at training start
- Automatic checkpoint resumption for interrupted jobs
- Connection drop detection with mark_interrupted
- Training completion tracking with metrics
## Checkpoint Recovery
- Detects WebRTC connection failures during training
- Marks models as interrupted with last checkpoint path
- Resumes from best.ckpt on reconnection
- Leverages PyTorch Lightning's native checkpoint support
## Testing
- 22/24 unit tests passing for ModelRegistry class
- Test coverage: initialization, CRUD ops, status transitions,
corruption recovery, atomic writes, hash collisions
Related: openspec/changes/add-model-registry
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Enables clients to query worker registries via WebRTC data channels, supporting the workflow where clients discover available models before running inference. Worker-side changes: - Add REGISTRY_QUERY_LIST handler for listing models with filters - Add REGISTRY_QUERY_INFO handler for detailed model information - Add REGISTRY_RESPONSE_* messages for query responses - Track training job hash (MD5) for package identification - Mark models as interrupted on connection loss Client-side changes: - Add RegistryQueryClient for WebRTC-based registry queries - Add query_registry_list() and query_registry_info() helpers - Support both session string and room-based connections CLI changes: - Update list-models with --session-string, --room-id, --token options - Update model-info with same remote connection options - Add --model flag to client-track for model ID-based inference - Support dual-mode (local/remote) for all registry commands This enables the key use cases: 1. Client queries available models: sleap-rtc list-models --room-id ROOM --token TOKEN 2. Client views model details: sleap-rtc model-info a3f5e8c9 --session-string SESSION 3. Client runs inference by ID: sleap-rtc client-track --model a3f5e8c9 --data video.slp 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Fixes two critical bugs found during manual testing:
1. Client-side registry access bug:
- CLI commands now check if registry directory exists before attempting
local queries
- Provides helpful error message directing users to use remote query
options (--room-id/--token or --session-string)
- Prevents clients from accidentally accessing/creating local registries
2. Model completion status bug:
- Fixed scope issue where self.current_model_id was being overwritten
in training loop before mark_completed() was called
- Now uses local variable current_training_model_id within loop scope
- Added better error handling and logging for completion marking
Testing notes:
- Models should now correctly show "completed" status after training
- Client machines should not be able to query local registries
- Only worker machines with existing registries can use local queries
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <[email protected]>
Fixes multiple issues discovered during testing: 1. Worker discovery message fix: - Changed from 'discover_workers' to 'list_peers' message type - Filters peer list for role='worker' - Compatible with existing signaling server implementation - Better error messages when discovery fails 2. Async/await fix in anonymous signin: - Use asyncio.run_in_executor() for synchronous requests.post() - Prevents blocking in async context - Properly awaits HTTP response 3. Improved CLI validation messages: - Better error messages when registry directory not found - Added info logging when using local registry - Clarifies when to use remote vs local queries - Helps distinguish between testing scenarios Testing notes: - Remote queries now work with: SLEAP_RTC_ENV=development uv run sleap-rtc list-models --room-id ROOM --token TOKEN - Worker discovery uses standard list_peers message - Local testing on same machine shows clear logging 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
The signaling server uses 'discover_peers' message type, not 'list_peers'. This matches the implementation in client_class.py. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Registry query clients close their connections after receiving responses, which was causing workers to enter reconnect mode and eventually shut down. Solution: - Added is_query_only_connection flag to track registry-only connections - Set flag when receiving REGISTRY_QUERY_LIST or REGISTRY_QUERY_INFO messages - In ICE connection state handler, detect query-only disconnects and handle gracefully - Query-only connections clean up quietly and return worker to "available" status - Worker continues running and accepting new connections This allows clients to query registries multiple times without disrupting the worker. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
The previous fix called `await self.pc.close()` on a peer connection that was already in "closed" state (closed by the client). This corrupted the peer connection's internal state, causing subsequent connection attempts to fail with "'int' object is not callable" errors. Solution: - Remove the `await self.pc.close()` call - The connection is already closed by the client - we're just detecting it - Worker's peer connection object can be reused for new offers when in closed state - Manually closing it again causes state corruption The worker now correctly: - Handles first query-only connection ✅ - Returns to available state ✅ - Accepts subsequent query connections ✅ - Reuses the same peer connection object without corruption ✅ 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Model Registry & Storage
This PR implements a comprehensive model registry and the beginning to a checkpoint recovery system for tracking trained models on workers, enabling model discovery, and supporting training resumption after connection loss.
Overview
The model registry solves two key problems:
Key Features
1. Model Registry System
models/.registry/manifest.json2. Remote Registry Queries
discover_peers3. Enhanced CLI Commands
List Models:
Model Info:
Model ID-Based Inference:
# Use model ID instead of paths sleap-rtc client-track \ --model a3f5e8c9 \ --data_path video.slp \ --room-id ROOM --token TOKEN4. Checkpoint Recovery
resume_ckpt_path)Implementation Details
New Files
sleap_rtc/worker/model_registry.py(432 lines) - Core registry implementationsleap_rtc/client/registry_query.py(295 lines) - Client-side remote query supporttests/worker/test_model_registry.py(650+ lines) - Comprehensive test suiteModified Files
sleap_rtc/worker/worker_class.py- Registry integration, training lifecycle trackingsleap_rtc/cli.py- New registry commands with remote query supportRegistry Message Protocol
REGISTRY_QUERY_LIST::<filters_json>- Query for model listREGISTRY_QUERY_INFO::<model_id>- Query for specific model infoREGISTRY_RESPONSE_LIST::<response_json>- List responseREGISTRY_RESPONSE_INFO::<response_json>- Info responseREGISTRY_RESPONSE_ERROR::<error_json>- Error responseUse Cases Enabled
1. Client Discovers and Uses Remote Models
2. Worker Provides Model Options
Workers can respond to registry queries, allowing clients to discover what models are available before running inference.
3. Training Resumption After Failure
If connection drops during training:
Testing
Test Coverage
test_model_registry.pyManual Testing Performed
Bug Fixes Included
Issue 1: Model Completion Status
Problem: Models stuck in "training" status after completion
Cause: Loop variable scope issue -
self.current_model_idoverwritten beforemark_completed()calledFix: Use local variable
current_training_model_idwithin loop scopeIssue 2: Client Registry Access
Problem: Clients could access local registry on same machine
Cause: No validation that registry directory should exist
Fix: Check registry exists before local access, provide helpful error messages
Issue 3: Worker Discovery Message
Problem: Used
discover_workersmessage type not supported by signaling serverCause: Wrong message type
Fix: Changed to
discover_peerswith role filteringIssue 4: Async/Await in Anonymous Signin
Problem: Synchronous
requests.post()called in async functionCause: Blocking I/O in async context
Fix: Use
asyncio.run_in_executor()to run in thread poolBreaking Changes
None - this is entirely new functionality.
Migration Notes
{model_type}_{model_id}/Future Work
Related Issues
Addresses the model identification and checkpoint recovery concerns discussed in initial requirements.
🤖 Generated with Claude Code
Co-Authored-By: Claude [email protected]