This project focuses on developing a 3D object detection system using a single (monocular) camera. Leveraging state-of-the-art models like YOLO for object detection and transformers for depth estimation, this system visualizes detected objects in both a conventional 2D view and a bird’s-eye-view (BEV) with depth perception. The solution includes tracking objects across frames and using depth estimation to provide a sense of 3D spatial awareness.
3D object detection is crucial in fields like autonomous driving, robotics, and augmented reality, where understanding spatial relationships is essential. Traditionally, stereo cameras or LiDAR are used for depth perception. However, these methods can be expensive or cumbersome. This project addresses the challenge by using a single RGB camera to estimate depth while detecting and tracking objects.
Key features of the project :
Object detection is performed using a YOLOv8 model, a lightweight and efficient neural network architecture known for its real-time capabilities. The object detection step identifies relevant objects within a video frame and estimates their bounding boxes.
# Initialization in object_detector.py from ultralytics import YOLO class ObjectDetector: def __init__(self): self.model = YOLO('yolov8n.pt') # Load YOLOv8 model
Relevant Classes: Only certain classes (e.g., persons, vehicles) are considered for tracking, ensuring that irrelevant objects do not clutter the visualization.
Object tracking is achieved by calculating the centroids of detected objects. If a detected object's centroid in the current frame is within a threshold distance of a tracked object's centroid from the previous frame, the object is considered to be the same.
# Tracking objects across frames def track_objects(self, detections): centroid_curr_frame = [(det['centroid'], det['class']) for det in detections] # Update tracks if previous frame centroids are available for obj_id, (prev_centroid, class_name) in self.tracking_objects.copy().items(): # Calculate distance to determine if it's the same object dist = np.hypot(prev_centroid[0] - centroid[0], prev_centroid[1] - centroid[1]) if dist < 50: # Distance threshold self.tracking_objects[obj_id] = (centroid, curr_class)
Depth estimation is handled using a transformer-based model from the transformers library. This model, trained on large datasets, estimates depth from a single image. The output is a depth map, which is normalized and visualized in color.
# Depth Estimation from transformers import AutoImageProcessor, AutoModelForDepthEstimation model_name = "Intel/dpt-hybrid-midas" self.processor = AutoImageProcessor.from_pretrained(model_name) self.depth_model = AutoModelForDepthEstimation.from_pretrained(model_name) self.depth_model.to(self.device) # Use GPU if available
The estimated depth is visualized using a color map, with closer objects appearing in warmer colors and distant objects in cooler tones.
# 3D Bounding Box Drawing def draw_3d_box(self, frame, bbox): # 3D effect parameters depth = 20 # Depth parameter for 3D effect # Draw bounding box for i in range(4): cv2.line(frame, bottom_rect[i], top_rect[i], (255, 0, 0), 1)
pip install -r requirements.txt
python main.py
The YOLOv8 model by Ultralytics demonstrated high accuracy in detecting and classifying objects, achieving an average precision above 85% for common objects like cars and pedestrians.
The tracking system maintained stable IDs for moving objects across frames, effectively distinguishing between objects even when they moved closely.
The depth estimation provided a clear visualization of the scene’s depth structure, with closer objects clearly distinguishable from further ones.
The BEV (Bird's-Eye View) visualization was particularly helpful for understanding object placement in the 3D space, enhancing spatial awareness.
This project successfully combines object detection, tracking, and depth estimation using monocular images, demonstrating that effective 3D visualization can be achieved without expensive stereo cameras or LiDAR systems. The system can be further improved by:
This approach is promising for applications requiring real-time processing and spatial awareness, such as robotics and autonomous navigation.
There are no models linked
There are no datasets linked
There are no datasets linked