Nov 16, 2024●17 reads●MIT License

3D Object Detection with Monocular Depth Visualization

a
Ivan Apedo

Abstract

This project focuses on developing a 3D object detection system using a single (monocular) camera. Leveraging state-of-the-art models like YOLO for object detection and transformers for depth estimation, this system visualizes detected objects in both a conventional 2D view and a bird’s-eye-view (BEV) with depth perception. The solution includes tracking objects across frames and using depth estimation to provide a sense of 3D spatial awareness.

Introduction

3D object detection is crucial in fields like autonomous driving, robotics, and augmented reality, where understanding spatial relationships is essential. Traditionally, stereo cameras or LiDAR are used for depth perception. However, these methods can be expensive or cumbersome. This project addresses the challenge by using a single RGB camera to estimate depth while detecting and tracking objects.

Key features of the project :

Object Detection using a pre-trained YOLO model.
Object Tracking across frames to maintain object consistency.
Depth Estimation using a transformer-based depth estimation model.
3D Visualization with bounding boxes and a BEV visualization to illustrate object placement in a 3D space.

Methodology

Object Detection

Object detection is performed using a YOLOv8 model, a lightweight and efficient neural network architecture known for its real-time capabilities. The object detection step identifies relevant objects within a video frame and estimates their bounding boxes.

# Initialization in object_detector.py
from ultralytics import YOLO

class ObjectDetector:
    def __init__(self):
        self.model = YOLO('yolov8n.pt')  # Load YOLOv8 model

Relevant Classes: Only certain classes (e.g., persons, vehicles) are considered for tracking, ensuring that irrelevant objects do not clutter the visualization.

Object Tracking

Object tracking is achieved by calculating the centroids of detected objects. If a detected object's centroid in the current frame is within a threshold distance of a tracked object's centroid from the previous frame, the object is considered to be the same.

# Tracking objects across frames
def track_objects(self, detections):
    centroid_curr_frame = [(det['centroid'], det['class']) for det in detections]

    # Update tracks if previous frame centroids are available
    for obj_id, (prev_centroid, class_name) in self.tracking_objects.copy().items():
        # Calculate distance to determine if it's the same object
        dist = np.hypot(prev_centroid[0] - centroid[0], prev_centroid[1] - centroid[1])
        if dist < 50:  # Distance threshold
            self.tracking_objects[obj_id] = (centroid, curr_class)

Depth Estimation

Depth estimation is handled using a transformer-based model from the transformers library. This model, trained on large datasets, estimates depth from a single image. The output is a depth map, which is normalized and visualized in color.

# Depth Estimation
from transformers import AutoImageProcessor, AutoModelForDepthEstimation

model_name = "Intel/dpt-hybrid-midas"
self.processor = AutoImageProcessor.from_pretrained(model_name)
self.depth_model = AutoModelForDepthEstimation.from_pretrained(model_name)
self.depth_model.to(self.device)  # Use GPU if available

The estimated depth is visualized using a color map, with closer objects appearing in warmer colors and distant objects in cooler tones.

Visualization

3D Bounding Boxes: Detected objects are annotated with a 3D bounding box using OpenCV, adding a sense of depth.
Bird's-Eye View (BEV): The depth map is used to visualize the scene from a top-down perspective, giving a better understanding of the spatial layout of objects.

# 3D Bounding Box Drawing
def draw_3d_box(self, frame, bbox):
    # 3D effect parameters
    depth = 20  # Depth parameter for 3D effect
    # Draw bounding box
    for i in range(4):
        cv2.line(frame, bottom_rect[i], top_rect[i], (255, 0, 0), 1)

Experiments

Dataset

The primary source of input data is a video file or a live webcam feed.
The detection, tracking, and depth estimation algorithms were tested on various scenes with object densities, and movement complexities.

To replicate the experiment :

Install the required dependencies :

pip install -r requirements.txt

Run :

python main.py

Results

Object Detection

The YOLOv8 model by Ultralytics demonstrated high accuracy in detecting and classifying objects, achieving an average precision above 85% for common objects like cars and pedestrians.

Object Tracking

The tracking system maintained stable IDs for moving objects across frames, effectively distinguishing between objects even when they moved closely.

Depth Visualization

The depth estimation provided a clear visualization of the scene’s depth structure, with closer objects clearly distinguishable from further ones.

The BEV (Bird's-Eye View) visualization was particularly helpful for understanding object placement in the 3D space, enhancing spatial awareness.

Demo

Conclusion

This project successfully combines object detection, tracking, and depth estimation using monocular images, demonstrating that effective 3D visualization can be achieved without expensive stereo cameras or LiDAR systems. The system can be further improved by:

Enhancing the tracking algorithm with more sophisticated methods like Kalman Filters.
Optimizing the depth estimation for faster processing.
Integrating more visualization options, such as 3D object models instead of simple bounding boxes.

This approach is promising for applications requiring real-time processing and spatial awareness, such as robotics and autonomous navigation.

References

YOLOv8
Intel DPT Hybrid Depth Estimation Model : MiDaS
OpenCV

Models

There are no models linked

Datasets

There are no datasets linked

Files

Nov 16, 2024●17 reads●MIT License

3D Object Detection with Monocular Depth Visualization

a
Ivan Apedo

Abstract

Introduction

Key features of the project :

Object Detection using a pre-trained YOLO model.
Object Tracking across frames to maintain object consistency.
Depth Estimation using a transformer-based depth estimation model.
3D Visualization with bounding boxes and a BEV visualization to illustrate object placement in a 3D space.

Methodology

Object Detection

# Initialization in object_detector.py
from ultralytics import YOLO

class ObjectDetector:
    def __init__(self):
        self.model = YOLO('yolov8n.pt')  # Load YOLOv8 model

Relevant Classes: Only certain classes (e.g., persons, vehicles) are considered for tracking, ensuring that irrelevant objects do not clutter the visualization.

Object Tracking

# Tracking objects across frames
def track_objects(self, detections):
    centroid_curr_frame = [(det['centroid'], det['class']) for det in detections]

    # Update tracks if previous frame centroids are available
    for obj_id, (prev_centroid, class_name) in self.tracking_objects.copy().items():
        # Calculate distance to determine if it's the same object
        dist = np.hypot(prev_centroid[0] - centroid[0], prev_centroid[1] - centroid[1])
        if dist < 50:  # Distance threshold
            self.tracking_objects[obj_id] = (centroid, curr_class)

Depth Estimation

# Depth Estimation
from transformers import AutoImageProcessor, AutoModelForDepthEstimation

model_name = "Intel/dpt-hybrid-midas"
self.processor = AutoImageProcessor.from_pretrained(model_name)
self.depth_model = AutoModelForDepthEstimation.from_pretrained(model_name)
self.depth_model.to(self.device)  # Use GPU if available

The estimated depth is visualized using a color map, with closer objects appearing in warmer colors and distant objects in cooler tones.

Visualization

3D Bounding Boxes: Detected objects are annotated with a 3D bounding box using OpenCV, adding a sense of depth.
Bird's-Eye View (BEV): The depth map is used to visualize the scene from a top-down perspective, giving a better understanding of the spatial layout of objects.

# 3D Bounding Box Drawing
def draw_3d_box(self, frame, bbox):
    # 3D effect parameters
    depth = 20  # Depth parameter for 3D effect
    # Draw bounding box
    for i in range(4):
        cv2.line(frame, bottom_rect[i], top_rect[i], (255, 0, 0), 1)

Experiments

Dataset

The primary source of input data is a video file or a live webcam feed.
The detection, tracking, and depth estimation algorithms were tested on various scenes with object densities, and movement complexities.

To replicate the experiment :

Install the required dependencies :

pip install -r requirements.txt

Run :

python main.py

Results

Object Detection

The YOLOv8 model by Ultralytics demonstrated high accuracy in detecting and classifying objects, achieving an average precision above 85% for common objects like cars and pedestrians.

Object Tracking

The tracking system maintained stable IDs for moving objects across frames, effectively distinguishing between objects even when they moved closely.

Depth Visualization

The depth estimation provided a clear visualization of the scene’s depth structure, with closer objects clearly distinguishable from further ones.

The BEV (Bird's-Eye View) visualization was particularly helpful for understanding object placement in the 3D space, enhancing spatial awareness.

Demo

Conclusion

Enhancing the tracking algorithm with more sophisticated methods like Kalman Filters.
Optimizing the depth estimation for faster processing.
Integrating more visualization options, such as 3D object models instead of simple bounding boxes.

This approach is promising for applications requiring real-time processing and spatial awareness, such as robotics and autonomous navigation.

References

YOLOv8
Intel DPT Hybrid Depth Estimation Model : MiDaS
OpenCV

Models

There are no models linked

Datasets

There are no datasets linked