feat: add analyzer plugin system #1825

ryan-arman · 2025-07-14T22:33:56Z

Description

This is the third PR for for Analyze tool. The complete changes can be found in the full feature branch: main...ryan-arman-anlyze_v0

This PR introduces the core plugin architecture for sample analyzers in Oumi:

SampleAnalyzer abstract base class for implementing custom analyzers
AnalyzerRegistry for registering and instantiating analyzer plugins by ID
Minimal, codebase-aligned unit tests for the registry and plugin interface

Related issues

Towards OPE-1407

Before submitting

This PR only changes documentation. (You can ignore the following checks in that case)
Did you read the contributor guideline Pull Request guidelines?
Did you link the issue(s) related to this PR in the section above?
Did you add / update tests where needed?

Reviewers

At least one review from a member of oumi-ai/oumi-staff is required.

Add core plugin architecture for sample analyzers: - SampleAnalyzer abstract base class - AnalyzerRegistry for plugin management - Simple unit tests following codebase patterns This provides the foundation for implementing specific analyzers in future PRs.

taenin · 2025-07-16T18:09:00Z

src/oumi/core/analyze/sample_analyzer.py

+class AnalyzerRegistry:
+    """Registry for sample analyzer plugins."""
+
+    _analyzers: dict[str, type[SampleAnalyzer]] = {}
+
+    @classmethod
+    def register(cls, analyzer_id: str, analyzer_class: type[SampleAnalyzer]) -> None:
+        """Register a sample analyzer class.
+
+        Args:
+            analyzer_id: Unique identifier for the analyzer
+            analyzer_class: The sample analyzer class to register
+        """
+        cls._analyzers[analyzer_id] = analyzer_class
+
+    @classmethod
+    def get_analyzer(cls, analyzer_id: str) -> Union[type[SampleAnalyzer], None]:
+        """Get a sample analyzer class by ID.
+
+        Args:
+            analyzer_id: The analyzer ID to look up
+
+        Returns:
+            The sample analyzer class or None if not found
+        """
+        return cls._analyzers.get(analyzer_id)
+
+    @classmethod
+    def list_analyzers(cls) -> list[str]:
+        """List all registered sample analyzer IDs.
+
+        Returns:
+            List of registered sample analyzer IDs
+        """
+        return list(cls._analyzers.keys())
+
+    @classmethod
+    def create_analyzer(
+        cls, analyzer_id: str, config: dict[str, Any]
+    ) -> SampleAnalyzer:
+        """Create a sample analyzer instance.
+
+        Args:
+            analyzer_id: The analyzer ID to create
+            config: Configuration for the analyzer
+
+        Returns:
+            An instance of the sample analyzer
+
+        Raises:
+            ValueError: If the analyzer ID is not registered
+        """
+        analyzer_class = cls.get_analyzer(analyzer_id)
+        if analyzer_class is None:
+            raise ValueError(f"Unknown analyzer ID: {analyzer_id}")
+        return analyzer_class(config)


Any reason this doesn't live in our registry.py?

AnalyzerRegistry is only used for managing analyzers and keeps that logic modular and separate from main dataset/model registry. If we prefer centralization, I can move it to registry.py

taenin · 2025-07-16T18:11:02Z

src/oumi/core/analyze/sample_analyzer.py

+    def __init__(self, config: dict[str, Any]):
+        """Initialize the sample analyzer with configuration.
+
+        Args:
+            config: Configuration dictionary for the analyzer
+        """
+        self.config = config


Why does the base class need to enforce using a config in the initializer? I don't see any methods that use this value.

that is a fair point. I can't think of a reason now for the base class to need the config. Removed that

taenin · 2025-07-16T18:13:18Z

tests/unit/core/analyze/test_sample_analyzer.py

+def test_analyzer_registry_basic():
+    """Test basic registry functionality."""
+    # Clear registry
+    AnalyzerRegistry._analyzers.clear()
+
+    # Register analyzer
+    AnalyzerRegistry.register("simple", SimpleAnalyzer)
+    assert "simple" in AnalyzerRegistry._analyzers
+
+    # Get analyzer
+    analyzer_class = AnalyzerRegistry.get_analyzer("simple")
+    assert analyzer_class == SimpleAnalyzer
+
+    # Create instance
+    analyzer = AnalyzerRegistry.create_analyzer("simple", {"test": "config"})
+    assert isinstance(analyzer, SimpleAnalyzer)
+    assert analyzer.config == {"test": "config"}
+
+    # Test analysis
+    result = analyzer.analyze_message("hello", {"role": "user"})
+    assert result == {"length": 5}


Nit: Each of these should be unique tests.
It's much more informative to see that test_analyzer_registry_get_analyzer failed than test_analyzer_registry_basic failed. The first tells you immediately what happened, whereas the second requires you to parse the test to understand what's broken.

good point, split them into separate tests

jgreer013 · 2025-07-16T20:53:03Z

src/oumi/core/analyze/sample_analyzer.py

+            analyzer_id: Unique identifier for the analyzer
+            analyzer_class: The sample analyzer class to register
+        """
+        cls._analyzers[analyzer_id] = analyzer_class


Add validation that we don't add the same id twice/overwrite

good point, added the check in the register method and added a test for it

jgreer013 · 2025-07-16T20:54:00Z

tests/unit/core/analyze/test_sample_analyzer.py

+def test_analyzer_registry_basic():
+    """Test basic registry functionality."""
+    # Clear registry
+    AnalyzerRegistry._analyzers.clear()


Probably make this something that runs before every test?

great point. added a clear_registry fixture that runs automatically before each test

ryan-arman requested review from oelachqar, taenin and jgreer013 July 14, 2025 22:34

ryan-arman self-assigned this Jul 14, 2025

ryan-arman added 2 commits July 14, 2025 15:41

fix: add Apache license header to analyze __init__.py

c13669e

Merge branch 'main' into ryan-arman-analyze-plugin-system

bbc22b6

ryan-arman requested a review from a team July 15, 2025 02:17

taenin reviewed Jul 16, 2025

View reviewed changes

jgreer013 approved these changes Jul 16, 2025

View reviewed changes

Add duplicate analyzer ID validation and improve test structure

a73a2b9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add analyzer plugin system #1825

feat: add analyzer plugin system #1825

ryan-arman commented Jul 14, 2025 •

edited

Loading

Uh oh!

taenin Jul 16, 2025

Uh oh!

ryan-arman Jul 16, 2025

Uh oh!

taenin Jul 16, 2025

Uh oh!

ryan-arman Jul 16, 2025

Uh oh!

taenin Jul 16, 2025

Uh oh!

ryan-arman Jul 16, 2025

Uh oh!

jgreer013 Jul 16, 2025

Uh oh!

ryan-arman Jul 16, 2025

Uh oh!

jgreer013 Jul 16, 2025

Uh oh!

ryan-arman Jul 16, 2025

Uh oh!

Uh oh!

feat: add analyzer plugin system #1825

Are you sure you want to change the base?

feat: add analyzer plugin system #1825

Conversation

ryan-arman commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Before submitting

Reviewers

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ryan-arman commented Jul 14, 2025 •

edited

Loading