Skip to content

Conversation

@alicup29
Copy link
Collaborator

@alicup29 alicup29 commented Nov 6, 2025

Model Registry & Storage

This PR implements a comprehensive model registry and the beginning to a checkpoint recovery system for tracking trained models on workers, enabling model discovery, and supporting training resumption after connection loss.

Overview

The model registry solves two key problems:

  1. Model Identification: Models are now identified by deterministic hash-based IDs, making it easy to reference specific trained models
  2. Checkpoint Recovery: Training can resume from checkpoints if connections drop during training

Key Features

1. Model Registry System

  • Hash-based Model IDs: 8-character SHA256 hashes generated from config + dataset + run name
  • Persistent Storage: JSON/YAML registry file at models/.registry/manifest.json
  • Model Metadata Tracking:
    • Status (training, completed, interrupted, failed)
    • Checkpoint paths
    • Training metrics
    • Timestamps and run information
    • GPU model and dataset info

2. Remote Registry Queries

  • Client-to-Worker Queries: Clients can query remote worker registries via WebRTC
  • Worker Discovery: Automatic worker discovery in rooms using discover_peers
  • Dual-Mode CLI: Commands work both locally (on worker) and remotely (via network)

3. Enhanced CLI Commands

List Models:

# Local (on worker machine)
sleap-rtc list-models
sleap-rtc list-models --status completed --model-type centroid

# Remote (from client)
sleap-rtc list-models --room-id ROOM --token TOKEN
sleap-rtc list-models --session-string SESSION

Model Info:

# Local
sleap-rtc model-info a3f5e8c9

# Remote  
sleap-rtc model-info a3f5e8c9 --room-id ROOM --token TOKEN

Model ID-Based Inference:

# Use model ID instead of paths
sleap-rtc client-track \
  --model a3f5e8c9 \
  --data_path video.slp \
  --room-id ROOM --token TOKEN

4. Checkpoint Recovery

  • Models marked as "interrupted" on connection loss
  • Training automatically resumes from last checkpoint when same config is submitted again
  • Uses PyTorch Lightning's native checkpoint recovery (resume_ckpt_path)

Implementation Details

New Files

  • sleap_rtc/worker/model_registry.py (432 lines) - Core registry implementation
  • sleap_rtc/client/registry_query.py (295 lines) - Client-side remote query support
  • tests/worker/test_model_registry.py (650+ lines) - Comprehensive test suite

Modified Files

  • sleap_rtc/worker/worker_class.py - Registry integration, training lifecycle tracking
  • sleap_rtc/cli.py - New registry commands with remote query support

Registry Message Protocol

  • REGISTRY_QUERY_LIST::<filters_json> - Query for model list
  • REGISTRY_QUERY_INFO::<model_id> - Query for specific model info
  • REGISTRY_RESPONSE_LIST::<response_json> - List response
  • REGISTRY_RESPONSE_INFO::<response_json> - Info response
  • REGISTRY_RESPONSE_ERROR::<error_json> - Error response

Use Cases Enabled

1. Client Discovers and Uses Remote Models

# 1. Query available models
sleap-rtc list-models --room-id ROOM --token TOKEN

# 2. Get model details
sleap-rtc model-info a3f5e8c9 --room-id ROOM --token TOKEN

# 3. Run inference using model ID
sleap-rtc client-track --model a3f5e8c9 --data video.slp --room-id ROOM --token TOKEN

2. Worker Provides Model Options

Workers can respond to registry queries, allowing clients to discover what models are available before running inference.

3. Training Resumption After Failure

If connection drops during training:

  1. Worker marks model as "interrupted" in registry
  2. Client reconnects and submits same training package
  3. Worker detects interrupted job and resumes from checkpoint

Testing

Test Coverage

  • 22/24 tests passing in test_model_registry.py
  • 2 skipped (YAML-specific, PyYAML not required)

Manual Testing Performed

  • ✅ Local registry queries
  • ✅ Remote registry queries via WebRTC
  • ✅ Model completion status tracking
  • ✅ Worker discovery and auto-selection
  • ✅ Model ID-based inference
  • ✅ Registry file persistence and corruption recovery

Bug Fixes Included

Issue 1: Model Completion Status

Problem: Models stuck in "training" status after completion
Cause: Loop variable scope issue - self.current_model_id overwritten before mark_completed() called
Fix: Use local variable current_training_model_id within loop scope

Issue 2: Client Registry Access

Problem: Clients could access local registry on same machine
Cause: No validation that registry directory should exist
Fix: Check registry exists before local access, provide helpful error messages

Issue 3: Worker Discovery Message

Problem: Used discover_workers message type not supported by signaling server
Cause: Wrong message type
Fix: Changed to discover_peers with role filtering

Issue 4: Async/Await in Anonymous Signin

Problem: Synchronous requests.post() called in async function
Cause: Blocking I/O in async context
Fix: Use asyncio.run_in_executor() to run in thread pool

Breaking Changes

None - this is entirely new functionality.

Migration Notes

  • Registry is created automatically on first training job
  • Existing model directories are not affected
  • New models will use hash-based directory names: {model_type}_{model_id}/

Future Work

  • Parse training logs to extract actual validation loss metrics
  • Track actual epoch numbers during training for better resumption
  • Support for model deletion/archival in registry
  • Registry synchronization across multiple workers
  • Web UI for registry browsing

Related Issues

Addresses the model identification and checkpoint recovery concerns discussed in initial requirements.


🤖 Generated with Claude Code

Co-Authored-By: Claude [email protected]

alicup29 and others added 6 commits November 6, 2025 10:29
Implements a comprehensive model registry for tracking trained SLEAP models
with automatic checkpoint recovery on connection failures.

## Model Registry
- Hash-based model identification (SHA256, 8-char IDs)
- JSON/YAML registry format support with auto-detection
- Atomic writes with corruption recovery
- Model lifecycle tracking (training → completed/interrupted)
- Metadata storage (dataset, GPU, metrics, timestamps)

## Worker Integration
- Registry initialization in RTCWorkerClient
- Hash-based directory naming: {model_type}_{model_id}
- Model registration at training start
- Automatic checkpoint resumption for interrupted jobs
- Connection drop detection with mark_interrupted
- Training completion tracking with metrics

## Checkpoint Recovery
- Detects WebRTC connection failures during training
- Marks models as interrupted with last checkpoint path
- Resumes from best.ckpt on reconnection
- Leverages PyTorch Lightning's native checkpoint support

## Testing
- 22/24 unit tests passing for ModelRegistry class
- Test coverage: initialization, CRUD ops, status transitions,
  corruption recovery, atomic writes, hash collisions

Related: openspec/changes/add-model-registry

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Enables clients to query worker registries via WebRTC data channels,
supporting the workflow where clients discover available models before
running inference.

Worker-side changes:
- Add REGISTRY_QUERY_LIST handler for listing models with filters
- Add REGISTRY_QUERY_INFO handler for detailed model information
- Add REGISTRY_RESPONSE_* messages for query responses
- Track training job hash (MD5) for package identification
- Mark models as interrupted on connection loss

Client-side changes:
- Add RegistryQueryClient for WebRTC-based registry queries
- Add query_registry_list() and query_registry_info() helpers
- Support both session string and room-based connections

CLI changes:
- Update list-models with --session-string, --room-id, --token options
- Update model-info with same remote connection options
- Add --model flag to client-track for model ID-based inference
- Support dual-mode (local/remote) for all registry commands

This enables the key use cases:
1. Client queries available models: sleap-rtc list-models --room-id ROOM --token TOKEN
2. Client views model details: sleap-rtc model-info a3f5e8c9 --session-string SESSION
3. Client runs inference by ID: sleap-rtc client-track --model a3f5e8c9 --data video.slp

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Fixes two critical bugs found during manual testing:

1. Client-side registry access bug:
   - CLI commands now check if registry directory exists before attempting
     local queries
   - Provides helpful error message directing users to use remote query
     options (--room-id/--token or --session-string)
   - Prevents clients from accidentally accessing/creating local registries

2. Model completion status bug:
   - Fixed scope issue where self.current_model_id was being overwritten
     in training loop before mark_completed() was called
   - Now uses local variable current_training_model_id within loop scope
   - Added better error handling and logging for completion marking

Testing notes:
- Models should now correctly show "completed" status after training
- Client machines should not be able to query local registries
- Only worker machines with existing registries can use local queries

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Fixes multiple issues discovered during testing:

1. Worker discovery message fix:
   - Changed from 'discover_workers' to 'list_peers' message type
   - Filters peer list for role='worker'
   - Compatible with existing signaling server implementation
   - Better error messages when discovery fails

2. Async/await fix in anonymous signin:
   - Use asyncio.run_in_executor() for synchronous requests.post()
   - Prevents blocking in async context
   - Properly awaits HTTP response

3. Improved CLI validation messages:
   - Better error messages when registry directory not found
   - Added info logging when using local registry
   - Clarifies when to use remote vs local queries
   - Helps distinguish between testing scenarios

Testing notes:
- Remote queries now work with: SLEAP_RTC_ENV=development uv run sleap-rtc list-models --room-id ROOM --token TOKEN
- Worker discovery uses standard list_peers message
- Local testing on same machine shows clear logging

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
The signaling server uses 'discover_peers' message type, not 'list_peers'.
This matches the implementation in client_class.py.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@alicup29 alicup29 changed the title Add model registry and checkpoint recovery system Add Model Registry & Storage Nov 6, 2025
alicup29 and others added 2 commits November 6, 2025 14:19
Registry query clients close their connections after receiving responses,
which was causing workers to enter reconnect mode and eventually shut down.

Solution:
- Added is_query_only_connection flag to track registry-only connections
- Set flag when receiving REGISTRY_QUERY_LIST or REGISTRY_QUERY_INFO messages
- In ICE connection state handler, detect query-only disconnects and handle gracefully
- Query-only connections clean up quietly and return worker to "available" status
- Worker continues running and accepting new connections

This allows clients to query registries multiple times without disrupting the worker.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
The previous fix called `await self.pc.close()` on a peer connection that was
already in "closed" state (closed by the client). This corrupted the peer
connection's internal state, causing subsequent connection attempts to fail
with "'int' object is not callable" errors.

Solution:
- Remove the `await self.pc.close()` call
- The connection is already closed by the client - we're just detecting it
- Worker's peer connection object can be reused for new offers when in closed state
- Manually closing it again causes state corruption

The worker now correctly:
- Handles first query-only connection ✅
- Returns to available state ✅
- Accepts subsequent query connections ✅
- Reuses the same peer connection object without corruption ✅

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants