Add RAG-based Robot Dataset Health Analysis (Cohere Hackathon) #2127

Sahanave · 2025-10-06T21:17:59Z

RAG-based Robot Dataset Health Analysis (Cohere Hackathon)

Labels: 🧠 feature 🔎 evaluation 🗃️ tooling
Scope: Adds RAG-based dataset QA, outlier detection over motor stats, and a conversational explainer.

What this does

This PR introduces a lightweight “Dataset Doctor” for LeRobot datasets:

RAG health analysis over per-episode motor statistics with FAISS indexing.
Outlier detection + scoring on initial joint/motor positions (means/variance) to surface suspicious episodes.
Conversational insights via Cohere to explain why an episode/segment looks unhealthy and suggest curation actions.
CLI tools for batch analysis and interactive Q&A.
Two new scripts:
- src/lerobot/scripts/collect_initpos.py: extracts per-episode first-frame motor statistics & thumbnails.
- src/lerobot/scripts/rag_robot_health.py: builds FAISS index, runs outlier scoring, and exposes a chat interface.

No training code or existing evaluation logic is modified. Default behavior is opt-in and isolated under src/lerobot/scripts/.

Why it matters

Robotic datasets are noisy and grow in breadth, not depth. Fast, explainable QA reduces wasted training cycles and helps contributors spot drift, recording mistakes, and hardware quirks before they poison experiments.

How it was tested

Unit-ish checks (local):
- Ran collect_initpos.py on small subsets of lerobot/pusht and a local ALOHA capture to confirm schema + I/O.
- Verified FAISS index build/dump/load round-trip.
- Sanity-checked outlier flags against injected anomalies (manually skewed joint-0 mean ⇒ correctly flagged).
E2E smoke:
- Built index from extracted stats, asked 10 representative prompts (“Which episodes have likely gripper miscalibration?”) and verified responses referenced flagged episodes + stats.
Perf notes:
- 5k episodes: index build < 30s on laptop CPU; query latency ~10–40 ms (FAISS flat, float32).
Determinism: Fixed random seeds where relevant; pure CPU path.

No changes to core training; zero impact on existing pipelines unless scripts are invoked.

How to check out & try (reviewer quickstart)

# 0) Install minimal deps (no GPU needed)
uv pip install faiss-cpu cohere numpy pandas pillow tqdm

# 1) Extract initial-position stats from a dataset
python -m src.lerobot.scripts.collect_initpos \
  --dataset.repo_id lerobot/pusht \
  --dataset.revision main \
  --output ./artifacts/pusht_initpos.parquet \
  --thumbnails ./artifacts/pusht_thumbs

# 2) Build RAG index + run health analysis (non-interactive)
python -m src.lerobot.scripts.rag_robot_health \
  --stats_path ./artifacts/pusht_initpos.parquet \
  --index_path ./artifacts/pusht_faiss.index \
  --report_path ./artifacts/pusht_health_report.json

# 3) Optional: conversational explainer (set your Cohere key)
export COHERE_API_KEY=***your_key***
python -m src.lerobot.scripts.rag_robot_health \
  --stats_path ./artifacts/pusht_initpos.parquet \
  --index_path ./artifacts/pusht_faiss.index \
  --chat

What to look for

..._health_report.json → includes per-episode z-scores, outlier flags, and a dataset-level health score.
Chat mode → ask: “Top 5 episodes to exclude and why?” or “Any sensor drift patterns?”

Implementation notes

Indexing: FAISS FlatL2 over normalized feature vectors [motor_mean..., motor_std...].
Outliering: robust z-scores with MAD; configurable threshold (default 3.5).
Schema: Parquet with columns: episode_id, motor_mean_*, motor_std_*, timestamp_first_frame, thumb_path.
Safety: Scripts are read-only on dataset; artifacts written under user-provided paths.

Backward compatibility

Purely additive. No existing CLI or configs changed. If you don’t run the scripts, nothing changes.

Trade-offs / Limitations

Only uses first-frame motor stats today (cheap signal). Future work: temporal windows, velocity/torque, vision embeddings.
Cohere optional; without it you still get deterministic outlier reports, just no conversational layer.
FAISS uses CPU flat index by default; HNSW/IVF can be added if we need bigger scales.

Documentation

Added docstrings + --help for both scripts.
If maintainers want, I can add a short “Dataset QA” page under docs/.

Future work (follow-ups I can own)

Add tests/ with a tiny synthetic dataset to validate: schema, index round-trip, outlier thresholds.
Add temporal stats (Δpose, jerk) + simple visual heuristics (blur/over/under-exposure) from first frames.
Export HTML report with thumbnails and quick episode links.
Optional: integrate into lerobot eval suite behind a flag.

Security & Privacy

No PII; local file processing only.
API key (Cohere) read from env; never logged.

Changelog

src/lerobot/scripts/collect_initpos.py — new
src/lerobot/scripts/rag_robot_health.py — new

Copilot

Pull Request Overview

This PR introduces a comprehensive RAG-based robot dataset health analysis system for the Cohore hackathon. The system analyzes robot motor data to identify outliers and provides conversational insights about dataset quality.

Adds RAG system for analyzing robot motor averages with outlier detection and health scoring
Implements data collection script for extracting initial position statistics from robot datasets
Provides conversational AI interface using Cohere for dataset health insights and explanations

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
src/lerobot/scripts/rag_robot_health.py	Main RAG system with FAISS indexing, Cohere integration, and CLI for dataset health analysis
src/lerobot/scripts/collect_initpos.py	Data collection script for extracting motor averages and first frames from robot episodes

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

src/lerobot/scripts/rag_robot_health.py

src/lerobot/scripts/collect_initpos.py

Co-authored-by: Copilot <[email protected]> Signed-off-by: Sahana Venkatesh <[email protected]>

Copilot

Pull Request Overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

src/lerobot/scripts/rag_robot_health.py

Copilot · 2025-10-06T21:25:03Z

src/lerobot/scripts/rag_robot_health.py

+    Accepts motor-major JSON with floats/strings/lists.
+    Returns: motor -> {episode_id: float_mean}
+    """
+    raw = json.loads(Path(path).read_text())


JSON loading should include error handling to prevent potential security issues from malformed files. Consider using a try-except block around the JSON parsing.

src/lerobot/scripts/rag_robot_health.py

src/lerobot/scripts/collect_initpos.py

Co-authored-by: Copilot <[email protected]> Signed-off-by: Sahana Venkatesh <[email protected]>

Sahana added 3 commits September 14, 2025 14:06

first version of inital-pos

1cbd56c

add rag_health

a6d8bf8

add

ea91b98

Copilot AI review requested due to automatic review settings October 6, 2025 21:18

Merge branch 'main' into cohore-hackathon

d5f8e1b

Copilot AI reviewed Oct 6, 2025

View reviewed changes

Sahanave and others added 4 commits October 6, 2025 22:19

Update src/lerobot/scripts/rag_robot_health.py

f1835bb

Co-authored-by: Copilot <[email protected]> Signed-off-by: Sahana Venkatesh <[email protected]>

Update src/lerobot/scripts/rag_robot_health.py

0affb9c

Co-authored-by: Copilot <[email protected]> Signed-off-by: Sahana Venkatesh <[email protected]>

Update src/lerobot/scripts/rag_robot_health.py

de11127

Co-authored-by: Copilot <[email protected]> Signed-off-by: Sahana Venkatesh <[email protected]>

Update src/lerobot/scripts/collect_initpos.py

ba2e37b

Co-authored-by: Copilot <[email protected]> Signed-off-by: Sahana Venkatesh <[email protected]>

Sahanave changed the title ~~Cohore hackathon~~ Add RAG-based Robot Dataset Health Analysis (Cohere Hackathon) Oct 6, 2025

Sahanave requested a review from Copilot October 6, 2025 21:24

Copilot AI reviewed Oct 6, 2025

View reviewed changes

Sahanave and others added 3 commits October 6, 2025 22:26

Update src/lerobot/scripts/rag_robot_health.py

39a2f26

Co-authored-by: Copilot <[email protected]> Signed-off-by: Sahana Venkatesh <[email protected]>

Update src/lerobot/scripts/rag_robot_health.py

ca41477

Co-authored-by: Copilot <[email protected]> Signed-off-by: Sahana Venkatesh <[email protected]>

Update src/lerobot/scripts/collect_initpos.py

b210ed9

Co-authored-by: Copilot <[email protected]> Signed-off-by: Sahana Venkatesh <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add RAG-based Robot Dataset Health Analysis (Cohere Hackathon) #2127

Add RAG-based Robot Dataset Health Analysis (Cohere Hackathon) #2127

Sahanave commented Oct 6, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Oct 6, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add RAG-based Robot Dataset Health Analysis (Cohere Hackathon) #2127

Are you sure you want to change the base?

Add RAG-based Robot Dataset Health Analysis (Cohere Hackathon) #2127

Conversation

Sahanave commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

RAG-based Robot Dataset Health Analysis (Cohere Hackathon)

What this does

Why it matters

How it was tested

How to check out & try (reviewer quickstart)

Implementation notes

Backward compatibility

Trade-offs / Limitations

Documentation

Future work (follow-ups I can own)

Security & Privacy

Changelog

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Copilot AI Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Sahanave commented Oct 6, 2025 •

edited

Loading