Skip to content

Natural Language Interface to Trigger Relevant Recordings #951

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
abrichr opened this issue May 23, 2025 · 0 comments
Open

Natural Language Interface to Trigger Relevant Recordings #951

abrichr opened this issue May 23, 2025 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@abrichr
Copy link
Member

abrichr commented May 23, 2025

Feature request

Enable users to launch OpenAdapt workflows using natural language commands like “do my taxes,” replacing the current manual replay model. This aligns with modern AI UX expectations and leverages existing recordings more effectively.

Problem

Requiring users to manually select and replay a Recording is unintuitive and limits accessibility. A natural language interface would allow users to describe tasks in plain English and let the system infer the most relevant automation.

Goal

Let users initiate task automation by typing a natural language description. The system finds relevant past demonstrations and uses them to guide replay or plan next steps adaptively.

Components

1. UI Input

  • Launch from system tray icon or similar.
  • Prompt: “What do you want help with today?”
  • Accepts a free-form natural language input.

2. Embedding Generation

  • Generate an embedding from user input using a model like sentence-transformers/all-MiniLM-L6-v2.

3. Embedding Search

  • Use sqlite-vss to find nearest matches among stored Recording.description embeddings.
  • Store embeddings in SQLite (RecordingEmbedding table or similar).
  • Leverage Recording.description as initial source of semantic content.

4. Reranking & Demonstration Injection (instead of hard selection)

  • Instead of selecting a single best match or displaying a list:

    • Retrieve top-N semantically similar recordings.
    • Use reranking (e.g. with a cross-encoder) to sort them.
    • Inject the top few (or most relevant subsections) as demonstrations into the model prompt.
  • At each step, the model sees:

    • Current GUI state (screenshot + accessible DOM or bounding box data).
    • Prior user query.
    • Retrieved demonstrations (replay logs or summaries).
  • The model then decides what to do next.

5. Hierarchical / Recursive Summarization

  • For each recording, generate multi-resolution summaries:

    • High level: “file taxes”
    • Mid-level: “log in to TurboTax”
    • Low level: “click 'T4 Income' tab”
  • Use these summaries to:

    • Improve retrieval accuracy.
    • Enable context-aware planning and prompt construction.
    • Eventually support segmentation and partial replays.

UX Considerations

  • Lightweight, interruptible input box.
  • System should confirm before executing anything destructive.
  • If confidence is low, suggest clarifications or fallback paths.

Alternatives Considered

  • Keyword search: brittle and non-semantic.
  • Command line or dropdown replay: slower and less intuitive.
  • Static top-k display: less adaptive than demonstration-based inference.

Acceptance Criteria

  • User can enter a natural language query from a tray icon.
  • System retrieves and embeds the query, finds similar past demos.
  • The model uses these demos to determine how to proceed.
  • Summaries and embeddings are auto-generated for future replays.

Motivation

No response

@abrichr abrichr added the enhancement New feature or request label May 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants