Skip to content

Enhance Benchmark Dataset by Adding Diverse Editing Traces #1289

@m4ushold

Description

@m4ushold

What would you like to be added:
Currently, the benchmark dataset for evaluating collaborative text editing algorithms is limited to a single dataset: editing-trace.json. To address this limitation, I propose enhancing the benchmark suite by including additional CRDT-related datasets, such as those used in the eg-walker paper (https://github.com/josephg/editing-traces) and from the json-crdt-traces project. These datasets cover a variety of editing scenarios that would be valuable for testing and validating our algorithm’s performance.
Specifically, the eg-walker paper’s datasets include:

  • Sequential Traces: No concurrency.
  • Concurrent Traces: Two users collaborating on writing tasks, with artificial latency to simulate real-time concurrency (C1, C2).
  • Asynchronous Traces: Editing traces reconstructed from Git repositories (A1, A2), mirroring the branching and merging behavior typical in version control.

Why is this needed:
Currently, we rely on just one dataset (editing-trace.json with about 250,000 editing operations), which makes it difficult to evaluate whether our algorithm generalizes well or if it’s overfitting to a single dataset. Incorporating additional editing traces will help us better assess the algorithm’s robustness across a wider range of real-world editing scenarios and ensure it isn’t narrowly optimized for a single dataset.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions