Skip to content

feat: add analyzer plugin system #1825

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

ryan-arman
Copy link
Contributor

@ryan-arman ryan-arman commented Jul 14, 2025

Description

This is the third PR for for Analyze tool. The complete changes can be found in the full feature branch: main...ryan-arman-anlyze_v0

This PR introduces the core plugin architecture for sample analyzers in Oumi:

  • SampleAnalyzer abstract base class for implementing custom analyzers
  • AnalyzerRegistry for registering and instantiating analyzer plugins by ID
  • Minimal, codebase-aligned unit tests for the registry and plugin interface

Related issues

Towards OPE-1407

Before submitting

  • This PR only changes documentation. (You can ignore the following checks in that case)
  • Did you read the contributor guideline Pull Request guidelines?
  • Did you link the issue(s) related to this PR in the section above?
  • Did you add / update tests where needed?

Reviewers

At least one review from a member of oumi-ai/oumi-staff is required.

Add core plugin architecture for sample analyzers:
- SampleAnalyzer abstract base class
- AnalyzerRegistry for plugin management
- Simple unit tests following codebase patterns

This provides the foundation for implementing specific analyzers
in future PRs.
@ryan-arman ryan-arman self-assigned this Jul 14, 2025
@ryan-arman ryan-arman requested a review from a team July 15, 2025 02:17
Comment on lines 48 to 103
class AnalyzerRegistry:
"""Registry for sample analyzer plugins."""

_analyzers: dict[str, type[SampleAnalyzer]] = {}

@classmethod
def register(cls, analyzer_id: str, analyzer_class: type[SampleAnalyzer]) -> None:
"""Register a sample analyzer class.

Args:
analyzer_id: Unique identifier for the analyzer
analyzer_class: The sample analyzer class to register
"""
cls._analyzers[analyzer_id] = analyzer_class

@classmethod
def get_analyzer(cls, analyzer_id: str) -> Union[type[SampleAnalyzer], None]:
"""Get a sample analyzer class by ID.

Args:
analyzer_id: The analyzer ID to look up

Returns:
The sample analyzer class or None if not found
"""
return cls._analyzers.get(analyzer_id)

@classmethod
def list_analyzers(cls) -> list[str]:
"""List all registered sample analyzer IDs.

Returns:
List of registered sample analyzer IDs
"""
return list(cls._analyzers.keys())

@classmethod
def create_analyzer(
cls, analyzer_id: str, config: dict[str, Any]
) -> SampleAnalyzer:
"""Create a sample analyzer instance.

Args:
analyzer_id: The analyzer ID to create
config: Configuration for the analyzer

Returns:
An instance of the sample analyzer

Raises:
ValueError: If the analyzer ID is not registered
"""
analyzer_class = cls.get_analyzer(analyzer_id)
if analyzer_class is None:
raise ValueError(f"Unknown analyzer ID: {analyzer_id}")
return analyzer_class(config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this doesn't live in our registry.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AnalyzerRegistry is only used for managing analyzers and keeps that logic modular and separate from main dataset/model registry. If we prefer centralization, I can move it to registry.py

Comment on lines 24 to 30
def __init__(self, config: dict[str, Any]):
"""Initialize the sample analyzer with configuration.

Args:
config: Configuration dictionary for the analyzer
"""
self.config = config
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the base class need to enforce using a config in the initializer? I don't see any methods that use this value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is a fair point. I can't think of a reason now for the base class to need the config. Removed that

Comment on lines 15 to 35
def test_analyzer_registry_basic():
"""Test basic registry functionality."""
# Clear registry
AnalyzerRegistry._analyzers.clear()

# Register analyzer
AnalyzerRegistry.register("simple", SimpleAnalyzer)
assert "simple" in AnalyzerRegistry._analyzers

# Get analyzer
analyzer_class = AnalyzerRegistry.get_analyzer("simple")
assert analyzer_class == SimpleAnalyzer

# Create instance
analyzer = AnalyzerRegistry.create_analyzer("simple", {"test": "config"})
assert isinstance(analyzer, SimpleAnalyzer)
assert analyzer.config == {"test": "config"}

# Test analysis
result = analyzer.analyze_message("hello", {"role": "user"})
assert result == {"length": 5}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Each of these should be unique tests.
It's much more informative to see that test_analyzer_registry_get_analyzer failed than test_analyzer_registry_basic failed. The first tells you immediately what happened, whereas the second requires you to parse the test to understand what's broken.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, split them into separate tests

analyzer_id: Unique identifier for the analyzer
analyzer_class: The sample analyzer class to register
"""
cls._analyzers[analyzer_id] = analyzer_class
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add validation that we don't add the same id twice/overwrite

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, added the check in the register method and added a test for it

def test_analyzer_registry_basic():
"""Test basic registry functionality."""
# Clear registry
AnalyzerRegistry._analyzers.clear()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably make this something that runs before every test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great point. added a clear_registry fixture that runs automatically before each test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants