Object detection and tracking are crucial components in the development of various applications and research endeavors within the computer science and robotics community. However, the diverse shapes and appearances of real-world objects, as well as dynamic nature of the scenes, may pose significant challenges for these tasks. Existing object detection and tracking methods often require extensive data annotation and model re-training when applied to new objects or environments, diverting valuable time and resources from the primary research objectives. In this paper, we present IST-ROS, Interactive Segmentation and Tracking for ROS, a software solution that leverages the capabilities of the Segment Anything Model (SAM) and semi-supervised video object segmentation methods to enable flexible and efficient object segmentation and tracking. Its graphical interface allows interactive object selection and segmentation using various prompts, while integrated tracking ensures robust performance even under occlusions and object interactions. By providing a flexible solution for object segmentation and tracking, IST-ROS aims to facilitate rapid prototyping and advancement of robotics applications.
Articel DOI: https://doi.org/10.1016/j.softx.2024.101979
Object detection and tracking are indispensable for computer vision and robotics applications, forming the basis for tasks like object manipulation, human–robot interaction, and augmented reality. However, the diverse appearances of real-world objects and dynamic scenes often pose challenges, prompting researchers to invest significant time into specialized detection and tracking techniques. This can divert attention from core research goals, especially when continuously streaming sensor data must be processed in real-time under varying lighting conditions and resolutions.
Video Object Segmentation (VOS) expands on detection and tracking by focusing on accurate object segmentation over time, handling occlusions, temporary disappearances, and shape changes. In parallel, automatic key-frame segmentation tools have emerged, simplifying object segmentation even in interactive scenarios, yet they often lack a temporal component to handle evolving scenes or multiple targets. To address these challenges, we present IST-ROS: Interactive Segmentation and Tracking for ROS. By integrating robust segmentation and real-time VOS capabilities, IST-ROS streamlines object detection and tracking without additional data collection or training. The modular, user-friendly design and included GUI encourage customization for diverse robotics applications, delivering precise segmentations across multiple targets even under partial occlusion or temporary disappearance.
IST-ROS employs a multi-threaded design, separating command, image, and GUI operations for real-time performance. Users interact through a GUI to select objects by drawing bounding boxes and refine segmentation until it meets their needs. The segmented masks, generated using an interactive segmentation model, are passed to a memory-based VOS algorithm that assigns persistent IDs and maintains object identities even under partial occlusions. This method ensures robust and efficient tracking by only storing relevant features in memory, allowing smooth operation on diverse hardware configurations and real-time streams. The system also provides an offline script to annotate and process video data for later analysis or model training.
In robotic teleoperation scenarios, particularly for object manipulation tasks, object detection system has to handle partial occlusions and maintain distinct object identities despite dynamic interactions. As shown in the figure below, we demonstrate the IST-ROS’s performance across three diverse setups: inserting a credit card into a wallet with a parallel gripper, manipulating a toy plane with two 5-DoF robotic hands, and a suturing task with 3-DoF forceps. Root Cause Analysis indicates that consistent tracking becomes challenging when objects overlap, disappear temporarily, or vary significantly in size (e.g., a small suture needle). Nevertheless, IST-ROS effectively differentiates the end effector from the target in most frames. When errors do occur—such as part of a suture thread being tracked as the needle—users can pause and refine the selection in the GUI. This interactive feedback loop enhances performance in real-world tasks and streamlines data collection for further system development.
Figure 1: GUI overview: the top row shows various user prompts for target selection, while the bottom row displays the corresponding SAM-based segmentation results. (a) Target selection using foreground points. (b) Target selection using a combination of foreground and background points. (c) Target selection using a bounding box. (d) Target selection using a combination of bounding box and point prompts.
Figure 2: Visualization examples showing results with different output options: image masks only, barycenters (centers) for each segmented mask, or both masks and barycenters together. These visualizations help evaluate tracking performance and provide user feedback, enabling selection or re-selection of targets through the GUI.
Figure 3: Robotic teleoperation use case scenarios. The first column shows target selection using bounding boxes and selection prompts: the red bounding box corresponds to the first target, blue to the second, and white to the third. The second column illustrates the initial mask output generated from the GUI prompts, which is used for subsequent segmentation and tracking shown in the third to fifth column.
The current performance of the IST-ROS framework depends significantly on the precision of the user’s initial mask selection via the GUI. When the user’s selection is off-target or when dealing with small, low-resolution objects, the segmentation quality degrades, reducing tracking accuracy. This issue directly affects the initial segmentation and cascades through the overall tracking process, leading to potential errors.
IST-ROS currently uses the SAM and Xmem pre-trained models for VOS. Adding more VOS models would improve the system’s adaptability, while enabling additional GUI prompts—text input, scribbles, or free-drawing—could refine mask selection and enhance user interaction. Looking ahead, extending compatibility to ROS2 is crucial for meeting modern robotics requirements.
IST-ROS aims to offer researchers and developers a straightforward solution for object segmentation and tracking, minimizing the time spent on data annotation and model retraining. Its real-time processing, modular structure, and easy integration with ROS make it well-suited for numerous robotic tasks. By leveraging SAM and Xmem, the system delivers near state-of-the-art results in real time, maintaining interactive capabilities essential for efficient development.
This publication is based on the following article: Khusniddin Fozilov, Yutaro Yamada, Jacinto Colan, Yaonan Zhu, Yasuhisa Hasegawa, IST-ROS: A flexible object segmentation and tracking framework for robotics applications, SoftwareX, Volume 29, 2025, 101979, ISSN 2352-7110, https://doi.org/10.1016/j.softx.2024.101979.
There are no datasets linked
There are no datasets linked