Skip to content

Conversation

Sahanave
Copy link

@Sahanave Sahanave commented Oct 6, 2025

RAG-based Robot Dataset Health Analysis (Cohere Hackathon)

Labels: 🧠 feature πŸ”Ž evaluation πŸ—ƒοΈ tooling
Scope: Adds RAG-based dataset QA, outlier detection over motor stats, and a conversational explainer.

What this does

This PR introduces a lightweight β€œDataset Doctor” for LeRobot datasets:

  • RAG health analysis over per-episode motor statistics with FAISS indexing.

  • Outlier detection + scoring on initial joint/motor positions (means/variance) to surface suspicious episodes.

  • Conversational insights via Cohere to explain why an episode/segment looks unhealthy and suggest curation actions.

  • CLI tools for batch analysis and interactive Q&A.

  • Two new scripts:

    • src/lerobot/scripts/collect_initpos.py: extracts per-episode first-frame motor statistics & thumbnails.
    • src/lerobot/scripts/rag_robot_health.py: builds FAISS index, runs outlier scoring, and exposes a chat interface.

No training code or existing evaluation logic is modified. Default behavior is opt-in and isolated under src/lerobot/scripts/.

Why it matters

Robotic datasets are noisy and grow in breadth, not depth. Fast, explainable QA reduces wasted training cycles and helps contributors spot drift, recording mistakes, and hardware quirks before they poison experiments.

How it was tested

  • Unit-ish checks (local):

    • Ran collect_initpos.py on small subsets of lerobot/pusht and a local ALOHA capture to confirm schema + I/O.
    • Verified FAISS index build/dump/load round-trip.
    • Sanity-checked outlier flags against injected anomalies (manually skewed joint-0 mean β‡’ correctly flagged).
  • E2E smoke:

    • Built index from extracted stats, asked 10 representative prompts (β€œWhich episodes have likely gripper miscalibration?”) and verified responses referenced flagged episodes + stats.
  • Perf notes:

    • 5k episodes: index build < 30s on laptop CPU; query latency ~10–40 ms (FAISS flat, float32).
  • Determinism: Fixed random seeds where relevant; pure CPU path.

No changes to core training; zero impact on existing pipelines unless scripts are invoked.

How to check out & try (reviewer quickstart)

# 0) Install minimal deps (no GPU needed)
uv pip install faiss-cpu cohere numpy pandas pillow tqdm

# 1) Extract initial-position stats from a dataset
python -m src.lerobot.scripts.collect_initpos \
  --dataset.repo_id lerobot/pusht \
  --dataset.revision main \
  --output ./artifacts/pusht_initpos.parquet \
  --thumbnails ./artifacts/pusht_thumbs

# 2) Build RAG index + run health analysis (non-interactive)
python -m src.lerobot.scripts.rag_robot_health \
  --stats_path ./artifacts/pusht_initpos.parquet \
  --index_path ./artifacts/pusht_faiss.index \
  --report_path ./artifacts/pusht_health_report.json

# 3) Optional: conversational explainer (set your Cohere key)
export COHERE_API_KEY=***your_key***
python -m src.lerobot.scripts.rag_robot_health \
  --stats_path ./artifacts/pusht_initpos.parquet \
  --index_path ./artifacts/pusht_faiss.index \
  --chat

What to look for

  • ..._health_report.json β†’ includes per-episode z-scores, outlier flags, and a dataset-level health score.
  • Chat mode β†’ ask: β€œTop 5 episodes to exclude and why?” or β€œAny sensor drift patterns?”

Implementation notes

  • Indexing: FAISS FlatL2 over normalized feature vectors [motor_mean..., motor_std...].
  • Outliering: robust z-scores with MAD; configurable threshold (default 3.5).
  • Schema: Parquet with columns: episode_id, motor_mean_*, motor_std_*, timestamp_first_frame, thumb_path.
  • Safety: Scripts are read-only on dataset; artifacts written under user-provided paths.

Backward compatibility

  • Purely additive. No existing CLI or configs changed. If you don’t run the scripts, nothing changes.

Trade-offs / Limitations

  • Only uses first-frame motor stats today (cheap signal). Future work: temporal windows, velocity/torque, vision embeddings.
  • Cohere optional; without it you still get deterministic outlier reports, just no conversational layer.
  • FAISS uses CPU flat index by default; HNSW/IVF can be added if we need bigger scales.

Documentation

  • Added docstrings + --help for both scripts.
  • If maintainers want, I can add a short β€œDataset QA” page under docs/.

Future work (follow-ups I can own)

  • Add tests/ with a tiny synthetic dataset to validate: schema, index round-trip, outlier thresholds.
  • Add temporal stats (Ξ”pose, jerk) + simple visual heuristics (blur/over/under-exposure) from first frames.
  • Export HTML report with thumbnails and quick episode links.
  • Optional: integrate into lerobot eval suite behind a flag.

Security & Privacy

  • No PII; local file processing only.
  • API key (Cohere) read from env; never logged.

Changelog

  • src/lerobot/scripts/collect_initpos.py β€” new
  • src/lerobot/scripts/rag_robot_health.py β€” new

@Copilot Copilot AI review requested due to automatic review settings October 6, 2025 21:18
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a comprehensive RAG-based robot dataset health analysis system for the Cohore hackathon. The system analyzes robot motor data to identify outliers and provides conversational insights about dataset quality.

  • Adds RAG system for analyzing robot motor averages with outlier detection and health scoring
  • Implements data collection script for extracting initial position statistics from robot datasets
  • Provides conversational AI interface using Cohere for dataset health insights and explanations

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
src/lerobot/scripts/rag_robot_health.py Main RAG system with FAISS indexing, Cohere integration, and CLI for dataset health analysis
src/lerobot/scripts/collect_initpos.py Data collection script for extracting motor averages and first frames from robot episodes

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Sahanave and others added 4 commits October 6, 2025 22:19
Co-authored-by: Copilot <[email protected]>
Signed-off-by: Sahana Venkatesh <[email protected]>
Co-authored-by: Copilot <[email protected]>
Signed-off-by: Sahana Venkatesh <[email protected]>
Co-authored-by: Copilot <[email protected]>
Signed-off-by: Sahana Venkatesh <[email protected]>
Co-authored-by: Copilot <[email protected]>
Signed-off-by: Sahana Venkatesh <[email protected]>
@Sahanave Sahanave changed the title Cohore hackathon Add RAG-based Robot Dataset Health Analysis (Cohere Hackathon) Oct 6, 2025
@Sahanave Sahanave requested a review from Copilot October 6, 2025 21:24
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Accepts motor-major JSON with floats/strings/lists.
Returns: motor -> {episode_id: float_mean}
"""
raw = json.loads(Path(path).read_text())
Copy link

Copilot AI Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSON loading should include error handling to prevent potential security issues from malformed files. Consider using a try-except block around the JSON parsing.

Copilot uses AI. Check for mistakes.

Sahanave and others added 3 commits October 6, 2025 22:26
Co-authored-by: Copilot <[email protected]>
Signed-off-by: Sahana Venkatesh <[email protected]>
Co-authored-by: Copilot <[email protected]>
Signed-off-by: Sahana Venkatesh <[email protected]>
Co-authored-by: Copilot <[email protected]>
Signed-off-by: Sahana Venkatesh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant