Add Conversation-Level Analysis #1914

ryan-arman · 2025-08-06T20:47:45Z

Description

This PR adds dual-level conversation analysis - enabling both message-level and conversation-level analysis of datasets.

Main changes:

SampleAnalyzer has a analyze_sample method (instead of analyze_message) which computes both message level and conversation level metrics and returns tuple[list[MessageAnalysisResult], ConversationAnalysisResult]
LengthAnalyzer implements SampleAnalyzer in a way that conversation level metrics are the aggregate of message level metrics for all of the metrics (char, word, sentence) except for token. For token, it tokenizes the conversation directly using the datast.tokenize
dataset is passed to the LengthAnalyzer so it can use it for tokenziation
we can remove the tokenizer in a follow up PR since we are going to use the dataset directly
All of the tests are updated accordingly
The check for sample_count has been moved to analyze_config

Related issues

Fixes OPE-1455

Before submitting

This PR only changes documentation. (You can ignore the following checks in that case)
Did you read the contributor guideline Pull Request guidelines?
Did you link the issue(s) related to this PR in the section above?
Did you add / update tests where needed?

Reviewers

At least one review from a member of oumi-ai/oumi-staff is required.

- Add analyze_conversation() method to SampleAnalyzer base class - Implement conversation analysis in LengthAnalyzer with conversation rendering - Add ConversationAnalysisResult dataclass and update DatasetAnalysisResult - Add conversation-level querying and DataFrame generation methods - Update test files to implement analyze_conversation() in mock analyzers This allows analyzers to compute both message-level and conversation-level metrics, with conversation-level analysis potentially giving different results than aggregating individual message metrics.

…t into SampleAnalysisResult - Create new SampleAnalysisResult class that combines message-level and conversation-level results - Update SampleAnalyzer to return SampleAnalysisResult instead of DatasetAnalysisResult - Refactor DatasetAnalysisResult to contain list of SampleAnalysisResult objects - Update LengthAnalyzer and all test files to work with new structure - Simplify data organization and make analyzer results more intuitive - All tests passing and code formatting compliant

- Fix DatasetAnalyzer to automatically pass tokenizer from config to analyzer constructors - Resolves issue where LengthAnalyzer with token_count=True failed without tokenizer - Maintains backward compatibility for analyzers that don't need tokenizers - Fix import issue in test file - All tests passing and code formatting compliant

…c clarity

…t wrapper

…f execution

src/oumi/core/analyze/length_analyzer.py

…tokenization

ryan-arman self-assigned this Aug 6, 2025

ryan-arman requested review from jgreer013, oelachqar and taenin August 6, 2025 20:47

ryan-arman added 14 commits August 7, 2025 10:07

refactor: remove unused samples field from DatasetAnalysisResult

5cdd274

fix: restore copyright header in analyze __init__.py

4155028

refactor: move dataclasses from types.py to dataset_analyzer.py

0dc51a2

update comments

4033048

undo unintended changes

0e25a2a

move sample_count to config

079e508

add logging back

d22bcb0

refactor: rename compute_metrics to analyze_sample for better semanti…

12f1863

…c clarity

refactor LengthAnalyzer

85232cb

refactor: return individual components instead of SampleAnalysisResul…

073bff0

…t wrapper

git statusfix: validate sample_count during config creation instead o…

7c9d4db

…f execution

ryan-arman force-pushed the ryan-arman-conversation-analysis branch from b0944cd to 7c9d4db Compare August 7, 2025 17:14

oelachqar reviewed Aug 7, 2025

View reviewed changes

src/oumi/core/analyze/length_analyzer.py Outdated Show resolved Hide resolved

src/oumi/core/analyze/length_analyzer.py Outdated Show resolved Hide resolved

ryan-arman and others added 2 commits August 7, 2025 14:01

Merge branch 'main' into ryan-arman-conversation-analysis

108f81d

refactor: remove dataset dependency from LengthAnalyzer and simplify …

aa8c225

…tokenization

ryan-arman requested a review from oelachqar August 8, 2025 00:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Conversation-Level Analysis #1914

Add Conversation-Level Analysis #1914

ryan-arman commented Aug 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Add Conversation-Level Analysis #1914

Are you sure you want to change the base?

Add Conversation-Level Analysis #1914

Conversation

ryan-arman commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Before submitting

Reviewers

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ryan-arman commented Aug 6, 2025 •

edited

Loading