Realtime Yolo ONNX Segmentation on ROS

seg_s-ezgif com-video-to-gif-converter

1. Abstract

This work presents a robust real-time segmentation framework designed for versatility across various robotic perception tasks. Our approach employs a YOLO segmentation model trained on a custom dataset tailored to humanoid soccer scenarios, including critical elements such as players, the ball, field lines, boundaries, and the field. While the dataset was inspired by our team’s participation in RoboCup 2023 and 2024, the underlying model and system architecture are generalized to support a wide range of real-time segmentation tasks. By utilizing the ONNX runtime for deployment, we ensured not only high portability but also optimal performance on typical robot PCs, which often lack high-performance GPUs. The ONNX runtime enables efficient inference even on CPUs, making it an ideal choice for resource-constrained environments. Furthermore, it allows optimal utilization of GPU resources when available, enhancing the system's adaptability across hardware configurations. Integrated into a ROS environment, the framework enables seamless real-time perception and decision-making, showcasing its potential for diverse robotics applications beyond soccer.

2. Introduction

Perception is a fundamental component of robotics, enabling machines to understand and interact with their environment. Effective segmentation is essential for a variety of robotics tasks, such as navigation, object manipulation, and scene understanding. This work introduces a generalized segmentation framework capable of addressing a broad spectrum of tasks, with an initial application demonstrated in humanoid robot soccer.

Our motivation stems from participation in RoboCup 2023 and 2024, where precise segmentation of key elements—such as players, the ball, field lines, and boundaries—proved essential for gameplay. Drawing from these experiences, we developed a custom dataset that reflects the challenges of dynamic and complex environments. While our dataset emphasizes soccer-related scenarios, the framework’s architecture is designed to be task-agnostic, making it adaptable to diverse robotic applications.

The YOLO segmentation model, known for its efficiency in real-time applications, was chosen as the foundation of this framework. To address the hardware limitations of typical robot PCs, which often lack high-performance GPUs, the ONNX runtime was employed for deployment. The ONNX runtime enables efficient CPU-based inference, ensuring the framework's usability on resource-constrained platforms. Additionally, it optimizes GPU utilization when available, allowing the framework to scale seamlessly across different hardware configurations. The system’s integration within a ROS environment allows real-time perception, decision-making, and task execution, demonstrating its utility across various robotics domains.

3. Methodology

3.1 Dataset Development

The dataset used in this work was designed to capture the complexities of dynamic environments. Focused initially on humanoid robot soccer, it includes annotated images of players, the ball, field lines, boundaries, and the field under diverse conditions such as varying lighting and occlusions. The dataset was informed by our experiences in RoboCup 2023 and 2024, ensuring relevance to real-world scenarios. Although soccer-specific elements are prominent, the dataset's structure allows adaptation for other applications by incorporating additional classes or annotations relevant to different tasks.

3.2 Model Selection and Training

We selected the YOLO segmentation model due to its balance of accuracy and computational efficiency, making it well-suited for real-time robotics tasks. Using transfer learning, the model was fine-tuned on our dataset to achieve robust segmentation performance. Example Dataset & Learning Process

3.3 Deployment with ONNX Runtime

The trained YOLO model was converted to the ONNX format to leverage the runtime’s capabilities for cross-platform compatibility and performance optimization. ONNX runtime was selected for its ability to maximize computational efficiency on both CPUs and GPUs. In many robotics systems, high-performance GPUs are unavailable due to cost or power constraints. ONNX runtime enables near-optimal inference speeds on CPUs, ensuring reliable performance on resource-limited hardware. When GPUs are available, ONNX runtime seamlessly leverages their capabilities to achieve even higher performance, making the framework adaptable to a wide range of hardware environments.

3.4 ROS Integration

To support real-time operations, the framework was integrated into a ROS-based system. ROS nodes were designed to handle camera inputs, execute the segmentation model using ONNX runtime, and publish segmented outputs for downstream tasks such as navigation, manipulation, or gameplay strategy. To ensure low-latency and high-performance processing, the implementation was developed in a fast C++ environment, leveraging its efficiency for real-time inference. This choice minimizes processing overhead and enables the framework to meet the stringent timing requirements of robotics tasks. The modular ROS architecture further ensures the framework’s adaptability to diverse robotics applications by allowing easy modification or addition of task-specific nodes. Ros Package

3.5 Evaluation and Optimization

The framework was rigorously evaluated in simulated and real-world environments. In the context of soccer, metrics such as Intersection over Union (IoU) and Frames Per Second (FPS) were used to assess segmentation accuracy and inference speed. Beyond soccer, the framework was tested on other robotic perception tasks, demonstrating its versatility and robustness. These evaluations informed iterative optimizations in model architecture and deployment strategies.

4. Experiments

4.1 Experimental Setup

The experiments were conducted using real-world footage recorded by a humanoid robot's onboard camera. To maintain objectivity, the video data used for testing was not part of the training dataset. The recorded video was converted into a ROS topic, allowing real-time streaming of frames to the segmentation framework. The framework processed the data using the ONNX runtime for inference and published the results as ROS messages. Ros Video Converter

The setup included a typical robotics mini PC, equipped without a dedicated GPU, to reflect the hardware constraints commonly faced in robotics applications. The experiments aimed to evaluate the system’s performance in terms of inference speed, latency, and its capability to seamlessly integrate with ROS for real-time operations.

4.2 Execution Method

The camera feed was streamed as a ROS topic.
The segmentation framework processed the frames in real-time using the ONNX runtime.
Results were displayed and visualized using OpenCV tools while simultaneously being published as ROS messages for potential downstream tasks.

5. Results

The experiments yielded the following key observations:

5.1 Inference Speed:

The ONNX runtime demonstrated significantly faster inference compared to OpenCV's DNN module on a CPU-only setup.
On the mini PC without a GPU, the ONNX runtime achieved inference speeds approximately 2–3 times faster than the DNN module, maintaining a consistent rate of 6–7 FPS, with an average inference time of approximately 100ms per frame. This inference speed was sufficient for real-time segmentation, enabling effective operation in robotics tasks.

5.2 Real-Time Integration:

The system successfully published segmentation results as ROS messages in real-time, ensuring compatibility with downstream robotics applications such as navigation and manipulation.
Low-latency processing was observed, confirming the system's suitability for real-world robotic operations.

5.3 Resource Efficiency:

The ONNX runtime’s optimized CPU utilization allowed for efficient operation on resource-constrained hardware, making the framework ideal for deployment on typical robotics platforms.
These results validate the framework’s robustness and efficiency in real-world scenarios, particularly for applications requiring real-time perception and decision-making.

6. Conclusion

This work demonstrates a versatile and efficient real-time segmentation framework leveraging the ONNX runtime for deployment in resource-constrained robotics environments. By integrating a YOLO-based segmentation model with ROS, the system achieved real-time processing speeds of 6–7 FPS on a CPU-only mini PC, with an average inference time of just 100ms per frame, outperforming traditional DNN inference methods by 2–3 times.

The experiments confirmed the framework’s capability to seamlessly process camera inputs, generate segmentation outputs, and publish results as ROS messages. This level of performance underscores its potential for applications beyond humanoid robot soccer, including autonomous navigation, object detection, and general scene understanding tasks.

Since this project focused primarily on adapting the segmentation model for real-time deployment in a ROS environment on various robotic platforms, model accuracy and inference metrics such as precision, recall, and mIoU were not the primary focus and thus were not included in this evaluation. Future work could explore optimizing the system further for even lower-latency operations, extending support for additional robotic tasks, and incorporating hardware acceleration options like GPUs or TPUs to maximize inference performance where available.

7. Link

ROS ONNX Segmentation Package

https://github.com/Leeseunghun03/ros_onnx_segmentation.git

Custom Dataset Yolo Learning Process

https://github.com/Leeseunghun03/yolo_segmentation_robocup.git

Video to ROS Topic Converter

https://github.com/Leeseunghun03/video2rostopic.git