Skip to content

Add Conversation-Level Analysis #1914

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

ryan-arman
Copy link
Contributor

@ryan-arman ryan-arman commented Aug 6, 2025

Description

This PR adds dual-level conversation analysis - enabling both message-level and conversation-level analysis of datasets.

Main changes:

  • SampleAnalyzer has a analyze_sample method (instead of analyze_message) which computes both message level and conversation level metrics and returns tuple[list[MessageAnalysisResult], ConversationAnalysisResult]
  • LengthAnalyzer implements SampleAnalyzer in a way that conversation level metrics are the aggregate of message level metrics for all of the metrics (char, word, sentence) except for token. For token, it tokenizes the conversation directly using the datast.tokenize
  • dataset is passed to the LengthAnalyzer so it can use it for tokenziation
  • we can remove the tokenizer in a follow up PR since we are going to use the dataset directly
  • All of the tests are updated accordingly
  • The check for sample_count has been moved to analyze_config

Related issues

Fixes OPE-1455

Before submitting

  • This PR only changes documentation. (You can ignore the following checks in that case)
  • Did you read the contributor guideline Pull Request guidelines?
  • Did you link the issue(s) related to this PR in the section above?
  • Did you add / update tests where needed?

Reviewers

At least one review from a member of oumi-ai/oumi-staff is required.

@ryan-arman ryan-arman self-assigned this Aug 6, 2025
- Add analyze_conversation() method to SampleAnalyzer base class
- Implement conversation analysis in LengthAnalyzer with conversation rendering
- Add ConversationAnalysisResult dataclass and update DatasetAnalysisResult
- Add conversation-level querying and DataFrame generation methods
- Update test files to implement analyze_conversation() in mock analyzers

This allows analyzers to compute both message-level and conversation-level metrics,
with conversation-level analysis potentially giving different results than
aggregating individual message metrics.
…t into SampleAnalysisResult

- Create new SampleAnalysisResult class that combines message-level and conversation-level results
- Update SampleAnalyzer to return SampleAnalysisResult instead of DatasetAnalysisResult
- Refactor DatasetAnalysisResult to contain list of SampleAnalysisResult objects
- Update LengthAnalyzer and all test files to work with new structure
- Simplify data organization and make analyzer results more intuitive
- All tests passing and code formatting compliant
- Fix DatasetAnalyzer to automatically pass tokenizer from config to analyzer constructors
- Resolves issue where LengthAnalyzer with token_count=True failed without tokenizer
- Maintains backward compatibility for analyzers that don't need tokenizers
- Fix import issue in test file
- All tests passing and code formatting compliant
@ryan-arman ryan-arman force-pushed the ryan-arman-conversation-analysis branch from b0944cd to 7c9d4db Compare August 7, 2025 17:14
@ryan-arman ryan-arman requested a review from oelachqar August 8, 2025 00:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants