Add Model Registry & Storage #18

alicup29 · 2025-11-06T20:47:49Z

Model Registry & Storage

This PR implements a comprehensive model registry and the beginning to a checkpoint recovery system for tracking trained models on workers, enabling model discovery, and supporting training resumption after connection loss.

Overview

The model registry solves two key problems:

Model Identification: Models are now identified by deterministic hash-based IDs, making it easy to reference specific trained models
Checkpoint Recovery: Training can resume from checkpoints if connections drop during training

Key Features

1. Model Registry System

Hash-based Model IDs: 8-character SHA256 hashes generated from config + dataset + run name
Persistent Storage: JSON/YAML registry file at models/.registry/manifest.json
Model Metadata Tracking:
- Status (training, completed, interrupted, failed)
- Checkpoint paths
- Training metrics
- Timestamps and run information
- GPU model and dataset info

2. Remote Registry Queries

Client-to-Worker Queries: Clients can query remote worker registries via WebRTC
Worker Discovery: Automatic worker discovery in rooms using discover_peers
Dual-Mode CLI: Commands work both locally (on worker) and remotely (via network)

3. Enhanced CLI Commands

List Models:

# Local (on worker machine)
sleap-rtc list-models
sleap-rtc list-models --status completed --model-type centroid

# Remote (from client)
sleap-rtc list-models --room-id ROOM --token TOKEN
sleap-rtc list-models --session-string SESSION

Model Info:

# Local
sleap-rtc model-info a3f5e8c9

# Remote  
sleap-rtc model-info a3f5e8c9 --room-id ROOM --token TOKEN

Model ID-Based Inference:

# Use model ID instead of paths
sleap-rtc client-track \
  --model a3f5e8c9 \
  --data_path video.slp \
  --room-id ROOM --token TOKEN

4. Checkpoint Recovery

Models marked as "interrupted" on connection loss
Training automatically resumes from last checkpoint when same config is submitted again
Uses PyTorch Lightning's native checkpoint recovery (resume_ckpt_path)

Implementation Details

New Files

sleap_rtc/worker/model_registry.py (432 lines) - Core registry implementation
sleap_rtc/client/registry_query.py (295 lines) - Client-side remote query support
tests/worker/test_model_registry.py (650+ lines) - Comprehensive test suite

Modified Files

sleap_rtc/worker/worker_class.py - Registry integration, training lifecycle tracking
sleap_rtc/cli.py - New registry commands with remote query support

Registry Message Protocol

REGISTRY_QUERY_LIST::<filters_json> - Query for model list
REGISTRY_QUERY_INFO::<model_id> - Query for specific model info
REGISTRY_RESPONSE_LIST::<response_json> - List response
REGISTRY_RESPONSE_INFO::<response_json> - Info response
REGISTRY_RESPONSE_ERROR::<error_json> - Error response

Use Cases Enabled

1. Client Discovers and Uses Remote Models

# 1. Query available models
sleap-rtc list-models --room-id ROOM --token TOKEN

# 2. Get model details
sleap-rtc model-info a3f5e8c9 --room-id ROOM --token TOKEN

# 3. Run inference using model ID
sleap-rtc client-track --model a3f5e8c9 --data video.slp --room-id ROOM --token TOKEN

2. Worker Provides Model Options

Workers can respond to registry queries, allowing clients to discover what models are available before running inference.

3. Training Resumption After Failure

If connection drops during training:

Worker marks model as "interrupted" in registry
Client reconnects and submits same training package
Worker detects interrupted job and resumes from checkpoint

Testing

Test Coverage

22/24 tests passing in test_model_registry.py
2 skipped (YAML-specific, PyYAML not required)

Manual Testing Performed

✅ Local registry queries
✅ Remote registry queries via WebRTC
✅ Model completion status tracking
✅ Worker discovery and auto-selection
✅ Model ID-based inference
✅ Registry file persistence and corruption recovery

Bug Fixes Included

Issue 1: Model Completion Status

Problem: Models stuck in "training" status after completion
Cause: Loop variable scope issue - self.current_model_id overwritten before mark_completed() called
Fix: Use local variable current_training_model_id within loop scope

Issue 2: Client Registry Access

Problem: Clients could access local registry on same machine
Cause: No validation that registry directory should exist
Fix: Check registry exists before local access, provide helpful error messages

Issue 3: Worker Discovery Message

Problem: Used discover_workers message type not supported by signaling server
Cause: Wrong message type
Fix: Changed to discover_peers with role filtering

Issue 4: Async/Await in Anonymous Signin

Problem: Synchronous requests.post() called in async function
Cause: Blocking I/O in async context
Fix: Use asyncio.run_in_executor() to run in thread pool

Breaking Changes

None - this is entirely new functionality.

Migration Notes

Registry is created automatically on first training job
Existing model directories are not affected
New models will use hash-based directory names: {model_type}_{model_id}/

Future Work

Parse training logs to extract actual validation loss metrics
Track actual epoch numbers during training for better resumption
Support for model deletion/archival in registry
Registry synchronization across multiple workers
Web UI for registry browsing

Related Issues

Addresses the model identification and checkpoint recovery concerns discussed in initial requirements.

🤖 Generated with Claude Code

Co-Authored-By: Claude [email protected]

Implements a comprehensive model registry for tracking trained SLEAP models with automatic checkpoint recovery on connection failures. ## Model Registry - Hash-based model identification (SHA256, 8-char IDs) - JSON/YAML registry format support with auto-detection - Atomic writes with corruption recovery - Model lifecycle tracking (training → completed/interrupted) - Metadata storage (dataset, GPU, metrics, timestamps) ## Worker Integration - Registry initialization in RTCWorkerClient - Hash-based directory naming: {model_type}_{model_id} - Model registration at training start - Automatic checkpoint resumption for interrupted jobs - Connection drop detection with mark_interrupted - Training completion tracking with metrics ## Checkpoint Recovery - Detects WebRTC connection failures during training - Marks models as interrupted with last checkpoint path - Resumes from best.ckpt on reconnection - Leverages PyTorch Lightning's native checkpoint support ## Testing - 22/24 unit tests passing for ModelRegistry class - Test coverage: initialization, CRUD ops, status transitions, corruption recovery, atomic writes, hash collisions Related: openspec/changes/add-model-registry 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Enables clients to query worker registries via WebRTC data channels, supporting the workflow where clients discover available models before running inference. Worker-side changes: - Add REGISTRY_QUERY_LIST handler for listing models with filters - Add REGISTRY_QUERY_INFO handler for detailed model information - Add REGISTRY_RESPONSE_* messages for query responses - Track training job hash (MD5) for package identification - Mark models as interrupted on connection loss Client-side changes: - Add RegistryQueryClient for WebRTC-based registry queries - Add query_registry_list() and query_registry_info() helpers - Support both session string and room-based connections CLI changes: - Update list-models with --session-string, --room-id, --token options - Update model-info with same remote connection options - Add --model flag to client-track for model ID-based inference - Support dual-mode (local/remote) for all registry commands This enables the key use cases: 1. Client queries available models: sleap-rtc list-models --room-id ROOM --token TOKEN 2. Client views model details: sleap-rtc model-info a3f5e8c9 --session-string SESSION 3. Client runs inference by ID: sleap-rtc client-track --model a3f5e8c9 --data video.slp 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Fixes two critical bugs found during manual testing: 1. Client-side registry access bug: - CLI commands now check if registry directory exists before attempting local queries - Provides helpful error message directing users to use remote query options (--room-id/--token or --session-string) - Prevents clients from accidentally accessing/creating local registries 2. Model completion status bug: - Fixed scope issue where self.current_model_id was being overwritten in training loop before mark_completed() was called - Now uses local variable current_training_model_id within loop scope - Added better error handling and logging for completion marking Testing notes: - Models should now correctly show "completed" status after training - Client machines should not be able to query local registries - Only worker machines with existing registries can use local queries 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Fixes multiple issues discovered during testing: 1. Worker discovery message fix: - Changed from 'discover_workers' to 'list_peers' message type - Filters peer list for role='worker' - Compatible with existing signaling server implementation - Better error messages when discovery fails 2. Async/await fix in anonymous signin: - Use asyncio.run_in_executor() for synchronous requests.post() - Prevents blocking in async context - Properly awaits HTTP response 3. Improved CLI validation messages: - Better error messages when registry directory not found - Added info logging when using local registry - Clarifies when to use remote vs local queries - Helps distinguish between testing scenarios Testing notes: - Remote queries now work with: SLEAP_RTC_ENV=development uv run sleap-rtc list-models --room-id ROOM --token TOKEN - Worker discovery uses standard list_peers message - Local testing on same machine shows clear logging 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

The signaling server uses 'discover_peers' message type, not 'list_peers'. This matches the implementation in client_class.py. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Registry query clients close their connections after receiving responses, which was causing workers to enter reconnect mode and eventually shut down. Solution: - Added is_query_only_connection flag to track registry-only connections - Set flag when receiving REGISTRY_QUERY_LIST or REGISTRY_QUERY_INFO messages - In ICE connection state handler, detect query-only disconnects and handle gracefully - Query-only connections clean up quietly and return worker to "available" status - Worker continues running and accepting new connections This allows clients to query registries multiple times without disrupting the worker. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

The previous fix called `await self.pc.close()` on a peer connection that was already in "closed" state (closed by the client). This corrupted the peer connection's internal state, causing subsequent connection attempts to fail with "'int' object is not callable" errors. Solution: - Remove the `await self.pc.close()` call - The connection is already closed by the client - we're just detecting it - Worker's peer connection object can be reused for new offers when in closed state - Manually closing it again causes state corruption The worker now correctly: - Handles first query-only connection ✅ - Returns to available state ✅ - Accepts subsequent query connections ✅ - Reuses the same peer connection object without corruption ✅ 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

alicup29 and others added 6 commits November 6, 2025 10:29

feat: Add CLI commands for model registry management

13fa99c

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

alicup29 changed the title ~~Add model registry and checkpoint recovery system~~ Add Model Registry & Storage Nov 6, 2025

alicup29 and others added 2 commits November 6, 2025 14:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Model Registry & Storage #18

Add Model Registry & Storage #18

Uh oh!

alicup29 commented Nov 6, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add Model Registry & Storage #18

Are you sure you want to change the base?

Add Model Registry & Storage #18

Uh oh!

Conversation

alicup29 commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Model Registry & Storage

Overview

Key Features

1. Model Registry System

2. Remote Registry Queries

3. Enhanced CLI Commands

4. Checkpoint Recovery

Implementation Details

New Files

Modified Files

Registry Message Protocol

Use Cases Enabled

1. Client Discovers and Uses Remote Models

2. Worker Provides Model Options

3. Training Resumption After Failure

Testing

Test Coverage

Manual Testing Performed

Bug Fixes Included

Issue 1: Model Completion Status

Issue 2: Client Registry Access

Issue 3: Worker Discovery Message

Issue 4: Async/Await in Anonymous Signin

Breaking Changes

Migration Notes

Future Work

Related Issues

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alicup29 commented Nov 6, 2025 •

edited

Loading