Skip to content

preespp/augment-robot-arm-yolo-vla

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

62 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Augmenting Low-Cost Robotic Arms with YOLO-Based Perception and Vision-Language-Action Policies

πŸ“– Introduction

In this project, we build an integrated robotic system that combines vision detection and robotic manipulation using the LeRobot framework. Our setup includes a XIAO ESP32S3 Sense module for wireless video streaming, a standard laptop webcam, and a UVC arm-mounted camera for multi-view perception.
We first tested and deployed YOLOv8 models for object detection on the laptop side, leveraging live video feeds streamed over a local Wi-Fi network.
The robotic manipulation is performed using LeRobot so100 arms, assembled from leader and follower units.
We trained a Vision-Language-Action (VLA) model and fine-tuned YOLOv8 to operate jointly, enabling the robot to detect, reason, and act in a dynamic environment.
To further enhance the VLA model’s generalization capability, we designed and implemented our own control strategies based on real-time YOLO detection results.
The complete technical steps, experimental results, and lessons learned are thoroughly discussed in our final report.
You can find the video demonstration in our YouTube Video Link or click Setup Photo Below to watch the video.

Watch the video

πŸ“‹ Project To-Do List

  • Planning to implement YOLOv8 for vision detection and LeRobot so100 robotic arms as the actuators
  • Testing YOLOv8 model performance directly on laptop with sample images and webcam video stream
  • Deciding between two implementation structures:
    • Running YOLO Nano (tiny model) inference directly on ESP32 (onboard detection)
    • OR Streaming video to laptop and running full YOLOv8 detection locally
  • Finalizing the choice to use laptop-side YOLOv8 inference based on model size, speed, and ESP32 capabilities
  • Understanding and exploring the overall LeRobot framework structure: motor buses, control pipelines, dataset recording, training
  • Searching and evaluating possible alternative motors to replace Feetech STS3215 if needed
  • Waiting for hardware arrival: receiving and assembling LeRobot leader and follower arms
  • Assembling the mechanical structure and wiring motors according to the so100 configuration standard
  • Applying the LeRobot framework to:
    • Perform motor calibration for each joint using control_robot.py calibrate
    • Execute teleoperation through keyboard control and validate arm movement
    • Record teleoperation datasets using Hugging Face’s dataset standard (LeRobotDataset v2)
    • Visualize dataset frames locally and via visualize_dataset_html.py
    • Perform basic policy training (e.g., simple DiffusionPolicy or ACT baseline) using collected data
  • Setting up XIAO ESP32S3 Sense board as Wi-Fi camera module for video streaming
  • Configuring ESP32-S3 to join existing Wi-Fi network (STA mode), avoiding SoftAP creation
  • Debugging MJPEG streaming issues, including partial frames and Wi-Fi packet losses (send error 104)
  • Successfully setting up ESP32 camera to stream stable video feed to the laptop
  • Using ESP32 camera streaming as the live video source for YOLOv8 running on the laptop
  • Preparing dataset from open-source and our own dataset for fine-tuning YOLOv8 model
  • Writing timestamp alignment tools to synchronize YOLO detection timestamps with LeRobot dataset frames
  • Integrating YOLO bounding box data into LeRobot dataset recording as an additional observation feature
  • Recording extended datasets combining robot action data and real-time vision detection results
  • Training improved policies (Diffusion / ACT) that utilize both proprioceptive and visual inputs
  • Debugging generalization issues by proposing data augmentation strategies or multi-object configurations
  • Running trained policies on hardware for evaluation through replay and teleoperation comparison
  • Preparing the final demo video showing complete project phases: hardware setup, calibration, teleop, YOLO integration, training, deployment
  • Writing and finalizing the technical report including system architecture diagrams, method explanations, experimental results, limitations, and future work discussions

ESP32 Mount Assembly in CAD

ESP32 Mount + Assembly of Mount in CAD

System Architecture Design

System Architecture Design

Project Structure

.
β”œβ”€β”€ Images/                               # Photos used in README or reports (e.g., diagrams, sample outputs)
β”œβ”€β”€ LeRobot/                              # LeRobot framework for robotic arm control and data processing
β”œβ”€β”€ Other Scripts/                        # Utility scripts for data preprocessing, testing, and analysis
β”‚   β”œβ”€β”€ file_manage.py                    # Script for cleaning and managing dataset files
β”‚   β”œβ”€β”€ label_script.py                   # Script for updating class IDs in YOLO label files
β”‚   β”œβ”€β”€ main_coord.py                     # Script for testing camera feed, YOLO detection, and center coordinate calculation
β”‚   β”œβ”€β”€ random_script.py                  # Script for reading class orders from YOLO models (e.g., COCO class list)
β”‚   └── tcp_udp_photo.py                  # Script for testing photo transmission via TCP/UDP protocols
β”œβ”€β”€ PC Scripts/                           # Scripts running on the PC side for processing and control
β”‚   β”œβ”€β”€ passing_data.py                   # Script for testing data reading from JSON output by YOLO
β”‚   β”œβ”€β”€ yolo_display_flow.py              # YOLO detection with energy-saving logic (runs YOLO only on motion)
β”‚   β”œβ”€β”€ yolo_display.py                   # Main script for displaying video feed with YOLO detection and sending data to LeRobot
β”‚   └── yolov8n.pt                        # YOLOv8 Nano model used for inference on the PC side
β”œβ”€β”€ Report_VideoDemo/                     # Directory for formal reports and video demonstrations
β”œβ”€β”€ esp32-HighRes/                        # ESP32-S3 firmware project for camera streaming and YOLO inference
β”‚   β”œβ”€β”€ main/                             # Main application source code for ESP32 firmware
β”‚   β”‚   β”œβ”€β”€ CMakeLists.txt                # Build system configuration for compiling main application
β”‚   β”‚   β”œβ”€β”€ idf_component.yml             # ESP-IDF component metadata (dependencies, versioning)
β”‚   β”‚   └── main.c                        # Core firmware logic: camera initialization, streaming, inference coordination
β”‚   β”œβ”€β”€ managed_components/espressif__esp32-camera/  # ESP32 camera driver component (library for camera support)
β”‚   β”œβ”€β”€ CMakeLists.txt                    # Top-level build configuration for the entire firmware project
β”‚   β”œβ”€β”€ dependencie.lock                  # Dependency lock file ensuring consistent ESP-IDF component versions
β”‚   └── sdkconfig                         # ESP-IDF project configuration (camera settings, Wi-Fi, etc.)
β”œβ”€β”€ fine-tuning/                          # Folder for fine-tuning YOLO models
β”‚   β”œβ”€β”€ dataset/                          # Dataset directory for fine-tuning
β”‚   β”‚   β”œβ”€β”€ train/                        # Training dataset (images and labels)
β”‚   β”‚   └── valid/                        # Validation dataset (images and labels)
β”‚   β”œβ”€β”€ fine-tuning/fine_tuned_yolov5n/   # Output directory for storing fine-tuned YOLOv5n model results
β”‚   β”œβ”€β”€ runs/                             # YOLO training output (logs, checkpoints, metrics)
β”‚   β”œβ”€β”€ data.yaml                         # Dataset configuration for YOLO training (paths, class names)
β”‚   β”œβ”€β”€ fine_tune.py                      # Script to fine-tune YOLO models using Ultralytics API
β”‚   β”œβ”€β”€ yolo11n.pt                        # Pre-trained YOLOv11 Nano model (optional/custom model)
β”‚   β”œβ”€β”€ yolov5nu.pt                       # YOLOv5 Nano model (used for fine-tuning)
β”‚   └── yolov8n.pt                        # YOLOv8 Nano model (used for detection/testing)
β”œβ”€β”€ .gitignore                            # Specifies files and directories to be ignored by Git version control
β”œβ”€β”€ LICENSE                               # Licensing information for the project
└── README.md                             # Main project documentation (setup instructions, usage, architecture)

Dependencies Setup

1. Install Ultralytics YOLO for Object Detection (PC Side)

This will automatically install all required dependencies, including torch, numpy, and others.

pip install ultralytics

2. Install ESP-IDF Framework (ESP32 Firmware Development)

Follow the official ESP-IDF installation guide for your platform:
ESP-IDF Installation Guide

Alternatively,

git clone --recursive https://github.com/espressif/esp-idf.git

cd esp-idf

./install.sh

3. Install esp_camera Module (ESP32-S3 Camera Support)

The esp_camera module is included in the ESP-IDF component registry. Ensure it's integrated into your ESP32-S3 project:

idf.py add-dependency esp32-camera

Running Training & Deployment

1. Configure Wi-Fi and Server IP

  • ESP32 Firmware: Set your Wi-Fi SSID and password in the firmware source code (e.g., main.c).
  • PC Scripts: Update the server_ip variable in yolo_display.py to match your ESP32's IP address.

2. Build and Flash ESP32 Firmware

idf.py set-target esp32s3
idf.py build
idf.py flash monitor

3. Run YOLO Object Detection on PC

Start the PC-side YOLO detection and video stream receiver:

python3 yolo_display.py

This script will receive the video stream from the ESP32-S3 and run YOLO inference on each frame.

4. Run LeRobot Framework (for Data Training)

  • Follow the LeRobot documentation to set up the environment:
    LeRobot GitHub Repository
  • Use LeRobot to fine-tune or train your data as needed.
  • Our Script in LeRobot folder is modified to synthesize with YOLO data passed from yolo_display.py script
  • Make sure to run yolo_display.py in the same directory with control_robot from LeRobot

Source

About

Open Source Robotic Arm Manipulation System

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •