In this project, we build an integrated robotic system that combines vision detection and robotic manipulation using the LeRobot framework.
Our setup includes a XIAO ESP32S3 Sense module for wireless video streaming, a standard laptop webcam, and a UVC arm-mounted camera for multi-view perception.
We first tested and deployed YOLOv8 models for object detection on the laptop side, leveraging live video feeds streamed over a local Wi-Fi network.
The robotic manipulation is performed using LeRobot so100 arms, assembled from leader and follower units.
We trained a Vision-Language-Action (VLA) model and fine-tuned YOLOv8 to operate jointly, enabling the robot to detect, reason, and act in a dynamic environment.
To further enhance the VLA modelβs generalization capability, we designed and implemented our own control strategies based on real-time YOLO detection results.
The complete technical steps, experimental results, and lessons learned are thoroughly discussed in our final report.
You can find the video demonstration in our YouTube Video Link or click Setup Photo Below to watch the video.
- Planning to implement YOLOv8 for vision detection and LeRobot so100 robotic arms as the actuators
- Testing YOLOv8 model performance directly on laptop with sample images and webcam video stream
- Deciding between two implementation structures:
- Running YOLO Nano (tiny model) inference directly on ESP32 (onboard detection)
- OR Streaming video to laptop and running full YOLOv8 detection locally
- Finalizing the choice to use laptop-side YOLOv8 inference based on model size, speed, and ESP32 capabilities
- Understanding and exploring the overall LeRobot framework structure: motor buses, control pipelines, dataset recording, training
- Searching and evaluating possible alternative motors to replace Feetech STS3215 if needed
- Waiting for hardware arrival: receiving and assembling LeRobot leader and follower arms
- Assembling the mechanical structure and wiring motors according to the so100 configuration standard
- Applying the LeRobot framework to:
- Perform motor calibration for each joint using
control_robot.py calibrate - Execute teleoperation through keyboard control and validate arm movement
- Record teleoperation datasets using Hugging Faceβs dataset standard (LeRobotDataset v2)
- Visualize dataset frames locally and via
visualize_dataset_html.py - Perform basic policy training (e.g., simple DiffusionPolicy or ACT baseline) using collected data
- Perform motor calibration for each joint using
- Setting up XIAO ESP32S3 Sense board as Wi-Fi camera module for video streaming
- Configuring ESP32-S3 to join existing Wi-Fi network (STA mode), avoiding SoftAP creation
- Debugging MJPEG streaming issues, including partial frames and Wi-Fi packet losses (send error 104)
- Successfully setting up ESP32 camera to stream stable video feed to the laptop
- Using ESP32 camera streaming as the live video source for YOLOv8 running on the laptop
- Preparing dataset from open-source and our own dataset for fine-tuning YOLOv8 model
- Writing timestamp alignment tools to synchronize YOLO detection timestamps with LeRobot dataset frames
- Integrating YOLO bounding box data into LeRobot dataset recording as an additional observation feature
- Recording extended datasets combining robot action data and real-time vision detection results
- Training improved policies (Diffusion / ACT) that utilize both proprioceptive and visual inputs
- Debugging generalization issues by proposing data augmentation strategies or multi-object configurations
- Running trained policies on hardware for evaluation through replay and teleoperation comparison
- Preparing the final demo video showing complete project phases: hardware setup, calibration, teleop, YOLO integration, training, deployment
- Writing and finalizing the technical report including system architecture diagrams, method explanations, experimental results, limitations, and future work discussions
ESP32 Mount + Assembly of Mount in CAD
System Architecture Design
.
βββ Images/ # Photos used in README or reports (e.g., diagrams, sample outputs)
βββ LeRobot/ # LeRobot framework for robotic arm control and data processing
βββ Other Scripts/ # Utility scripts for data preprocessing, testing, and analysis
β βββ file_manage.py # Script for cleaning and managing dataset files
β βββ label_script.py # Script for updating class IDs in YOLO label files
β βββ main_coord.py # Script for testing camera feed, YOLO detection, and center coordinate calculation
β βββ random_script.py # Script for reading class orders from YOLO models (e.g., COCO class list)
β βββ tcp_udp_photo.py # Script for testing photo transmission via TCP/UDP protocols
βββ PC Scripts/ # Scripts running on the PC side for processing and control
β βββ passing_data.py # Script for testing data reading from JSON output by YOLO
β βββ yolo_display_flow.py # YOLO detection with energy-saving logic (runs YOLO only on motion)
β βββ yolo_display.py # Main script for displaying video feed with YOLO detection and sending data to LeRobot
β βββ yolov8n.pt # YOLOv8 Nano model used for inference on the PC side
βββ Report_VideoDemo/ # Directory for formal reports and video demonstrations
βββ esp32-HighRes/ # ESP32-S3 firmware project for camera streaming and YOLO inference
β βββ main/ # Main application source code for ESP32 firmware
β β βββ CMakeLists.txt # Build system configuration for compiling main application
β β βββ idf_component.yml # ESP-IDF component metadata (dependencies, versioning)
β β βββ main.c # Core firmware logic: camera initialization, streaming, inference coordination
β βββ managed_components/espressif__esp32-camera/ # ESP32 camera driver component (library for camera support)
β βββ CMakeLists.txt # Top-level build configuration for the entire firmware project
β βββ dependencie.lock # Dependency lock file ensuring consistent ESP-IDF component versions
β βββ sdkconfig # ESP-IDF project configuration (camera settings, Wi-Fi, etc.)
βββ fine-tuning/ # Folder for fine-tuning YOLO models
β βββ dataset/ # Dataset directory for fine-tuning
β β βββ train/ # Training dataset (images and labels)
β β βββ valid/ # Validation dataset (images and labels)
β βββ fine-tuning/fine_tuned_yolov5n/ # Output directory for storing fine-tuned YOLOv5n model results
β βββ runs/ # YOLO training output (logs, checkpoints, metrics)
β βββ data.yaml # Dataset configuration for YOLO training (paths, class names)
β βββ fine_tune.py # Script to fine-tune YOLO models using Ultralytics API
β βββ yolo11n.pt # Pre-trained YOLOv11 Nano model (optional/custom model)
β βββ yolov5nu.pt # YOLOv5 Nano model (used for fine-tuning)
β βββ yolov8n.pt # YOLOv8 Nano model (used for detection/testing)
βββ .gitignore # Specifies files and directories to be ignored by Git version control
βββ LICENSE # Licensing information for the project
βββ README.md # Main project documentation (setup instructions, usage, architecture)This will automatically install all required dependencies, including torch, numpy, and others.
pip install ultralyticsFollow the official ESP-IDF installation guide for your platform:
ESP-IDF Installation Guide
Alternatively,
git clone --recursive https://github.com/espressif/esp-idf.git
cd esp-idf
./install.shThe esp_camera module is included in the ESP-IDF component registry. Ensure it's integrated into your ESP32-S3 project:
idf.py add-dependency esp32-camera- ESP32 Firmware: Set your Wi-Fi SSID and password in the firmware source code (e.g.,
main.c). - PC Scripts: Update the
server_ipvariable inyolo_display.pyto match your ESP32's IP address.
idf.py set-target esp32s3
idf.py build
idf.py flash monitorStart the PC-side YOLO detection and video stream receiver:
python3 yolo_display.pyThis script will receive the video stream from the ESP32-S3 and run YOLO inference on each frame.
- Follow the LeRobot documentation to set up the environment:
LeRobot GitHub Repository - Use LeRobot to fine-tune or train your data as needed.
- Our Script in LeRobot folder is modified to synthesize with YOLO data passed from yolo_display.py script
- Make sure to run yolo_display.py in the same directory with control_robot from LeRobot

