The goal of this project is to implement Real-time video filters on an Nvidia Jetson Nano embedded device. The task involves working with various image processing methods and features specifically to achieve individual tasks.
The following video filters were aimed to achieve on our real-time interface:
1. Background Blur
2. Background replacement
3. Face Distortion
4. Face Filter/ Replacement
5. Creative Filter
Through this, we study various aspects related to embedded systems and image processing. During the implementation phase we tried various operations applied to frames in a video to achieve specific visual effects to enhance certain characteristics.
This project is demonstrated with a Jetson Kit containing all the hardware elements required. We flash a pre-built Nano Image with jetPack OS which is a customized OS containing all the important software libraries.
The Jetson Nano is equipped with a powerful Maxwell architecture GPU, featuring 128 CUDA cores. CUDA (Compute Unified Device Architecture) is a parallel computing platform developed by Nvidia, and the inclusion of CUDA cores in the GPU enables parallel processing, a key aspect for handling computationally intensive tasks.
The Jetson Nano is equipped with 4GB of LPDDR4 RAM, providing ample memory for data storage and retrieval during processing. The memory architecture is designed to support the high-throughput demands of parallel processing, ensuring efficient handling of large datasets
The Pre-built Jet-Pack OS Image that we flashed on an SD card using Balena-Etcher (find here https://etcher.balena.io/) and used the baseline image for our DIY implementation which had environment setup with python 3.6.9, OpenCV 4.5.3 enabled with Cuda, Torch libraries for YOLO-V5 and example scripts.
OpenCV, a widely used computer vision library, is configured with CUDA support on the Nvidia Jetson Nano. CUDA is a parallel computing platform and application programming interface model created by NVidia, and when integrated with OpenCV, it enables accelerated processing on the GPU. This means that image and video processing tasks can leverage the parallel processing capabilities of the Nvidia GPU, resulting in faster and more efficient computations. We implemented the DNN module for our use cases for real time processing
The jetson_inference library is specifically designed for Nvidia Jetson platforms, providing a collection of pre-trained deep learning models and utilities. This library facilitates tasks such as image classification, object detection, and segmentation. By leveraging jetson_inference, developers can streamline the implementation of advanced AI functionalities, saving time and effort in model development
GStreamer is a powerful and flexible open-source multimedia framework that facilitates the construction of pipelines for handling multimedia data, including audio and video. In this project, GStreamer is utilized to enhance the video processing workflow on the NVidia Jetson Nano.
Let's break down the pipeline string:
1. nvarguscamerasrc: This is the source element for NVidia cameras. It captures video frames from NVidia camera devices.
2. Video/ X-raw(memory NVMM): Specifies the format of the raw video frames in memory as NVMM (Nvidia Memory Management). This format is optimized for NVidia hardware.
3. width, height, format, framerate: Sets the width, height, pixel format (NV12), and framerate of the captured video frames. The values are provided as placeholders to be replaced by actual values during runtime.
4. nvvidconv flip-method: Applies video conversion and a flip operation. flip-method determines the flip operation to be applied to the video frames.
5. Video/X-raw, width, height, format: Specifies the format for the video frames after the flip operation. In this case, it's set to BGRx.
6. videoconvert: Converts the video frames to a different format if necessary.
7. Video/X-raw, format: Sets the final format for the video frames as BGR.
8. Appsink drop: Setting to True Configures the sink element to be an appsink, which allows the application to receive the processed video frames. The drop=True parameter indicates that if the buffer cannot be pushed to the appsink, it should be dropped.
Person segmentation is a critical component within the system, and for this purpose, we employ the OpenCV DNN module in conjunction with the TensorFlow-based Mask R-CNN model. This model, pretrained on the COCO dataset, follows a two-stage architecture extending the Faster R-CNN framework. It predicts both bounding boxes and pixel-level segmentation masks simultaneously. The Inception v2 variant, known as Inception ResNet V2, enhances the architecture by incorporating features from both Inception and ResNet. The implementation utilizes the OpenCV DNN module, enabling seamless integration and efficient deployment.
As part of our Person segmentation strategy, we explored the implementation of SegNet, a deep learning architecture characterized by a Convolutional Encoder-Decoder structure. SegNet assigns a class label to each pixel in the input image, making it a powerful tool for semantic segmentation. For our application, we utilized the fcn-resnet18-voc-320x320 SegNet model with pretrained weights trained on the Pascal VOC dataset, encompassing 21 classes. The deployment of SegNet is optimized using the Jetson Inference framework, incorporating TensorRT optimization for Jetson devices. This ensures real-time inference capabilities, a crucial aspect for dynamic video streams.
Central to our system is the PP-Human Segmentation model, a part of the OpenCV DNN module. This model is specifically chosen for its effectiveness in real-time scenarios, accurately delineating individuals from complex backgrounds. Leveraging a DNN-based approach, we efficiently use pre-trained neural network models for rapid and accurate person segmentation within video streams. The PP-Human Segmentation model serves as the foundational step for subsequent background and foreground segmentation. By quantizing the model to ONNX runtime, we achieve a high accuracy of 0.9581 and mIoU of 0.8996, ensuring reliable performance in our system.
Object Detection using Haar feature-based cascade classifiers, introduced by Paul Viola and Michael Jones in their 2001 paper, Rapid Object Detection using a Boosted Cascade of Simple Features, is a robust machine learning approach. The method involves training a cascade function with a substantial dataset of positive and negative samples, where Haar features, akin to convolutional kernels, are extracted as the initial step. These Haar features represent single values obtained by subtracting the sum of pixels under a white rectangle from the sum of pixels under a black rectangle. The algorithm explores all possible sizes and locations for each kernel, and integral images are introduced to streamline the process of finding the sum of pixels under different rectangles, ensuring computational efficiency even with a large number of pixels.
With over 160,000 features generated, Adaboost is employed to determine the significance of each feature, addressing the challenge of selecting the most relevant ones. To optimize computational load, a Cascade of Classifiers is introduced, organizing features into different stages. This staged approach allows for the progressive detection of objects, making the system highly efficient, especially for real-time applications. The use of Haar features and integral images contributes to the overall speed and accuracy of the face detection process, making this approach particularly well-suited for effective object detection, especially in scenarios like face detection.
The input frame is initially converted from the BGR color space to the RGB color space using cv2.cvtColor. This is a common preprocessing step for input the frame image read from the camera to the model.
The resized image (image) is obtained by resizing the RGB frame to a fixed size of 192x192 pixels using cv2.resize. This step standardizes the input size for further processing.
The resized image is then fed into a pre-trained PP-Human Segmentation model for inference, performing image segmentation to identify region of interest.
1. The thresholding step using Otsu's method is common for both background replacement and blur.(cv2.threshold(result,100,255,cv2.THRESH_BINARY+cv2.THRESH_OTSU)) with (binary_mask) and datatype is converted for faster calculations (astype(np.uint8))
2. Finally using the threshold as binary mask and merging them into 3 dimensional shape.
3. (cv2.merge([binary_mask, binary_mask, binary_mask]))
Prior to applying the thresholded mask, the original frame is subjected to Gaussian blur (cv2.GaussianBlur) using a kernel size of (91, 91). This blurs the entire frame (blur_input), but the effect will be applied selectively based on the binary mask.
The original frame is bitwise ANDed with the binary mask.(cv2.bitwise_and(frame,mask)). A Gaussian-blurred version of the input frame is then created, and the complement of the binary mask is used to extract the regions not covered by the mask (cv2.bitwise_and(blur_input, cv2.bitwise_not(mask))) .The final blurred frame is obtained by combining the two using bitwise AND and ADD operations (cv2.add(result_blur_frame, RS2_blur))
Similarly for background Replacement the First 4 steps are repeated the same from Background Blur. Instead of Gaussian Blur, We read the background image using (cv2.imread("beach.jpg")) and use this Image (img_bcg) for our background replacement task where we get the mask from PP-Human Segmentation model inference.
The original frame is then bitwise ANDed with the binary mask(cv2.bitwise_and(frame,mask)), effectively isolating the identified regions. The replacement background (img_bcg) is resized to match the dimensions of the original frame, and the bitwise NOT operation on the mask is used to extract the complementary region(cv2.bitwise_and(img_bcg, cv2.bitwise_not(mask))). The original frame and replacement background are then combined using the bitwise AND and ADD operations, resulting in a new frame with the background replaced (cv2.add(result_frame, RS2))
The transparent Overlay function takes a source image (src), an overlay image (overlay), a position tuple (pos), and a scale factor (scale). It resizes the overlay image, extracts the region of interest (ROI) from the source image, and blends the two images using the alpha channel of the overlay for transparency.
The second function, face_blur, is designed specifically for face blurring. Given the coordinates (x, y, w, h) of a detected face and the original frame (frame), it extracts the region around the face, enlarges it slightly, and applies a transparent overlay of an Iron Man image (ironman). The enlargement is done to ensure that the overlay covers the entire face region adequately.
Here's a brief breakdown of the face_blur function:
The coordinates of the face (x, y, w, h) are used to define a region around the detected face in the original frame (frame).
The Ironman image (ironman) is resized to match the dimensions of the enlarged face region.
The Batman image (batman) is resized to match the dimensions of the enlarged face region.
The transparent Overlay function is called with the face region, IronMan overlay or Batman overlay and the position tuple. This results in blending the IronMan/ Batman image onto the face region with transparency.
This function aims to create a cartoon-style representation of the input frame. Here are the steps involved:
This function appears to generate a mask based on a pre-trained model. Here are the steps:
The program captures video using a GStreamer pipeline, applies different effects based on user-selected options, and displays the processed video on Streamlit. This Streamlit-based Python script utilizes computer vision techniques for real-time video processing. It offers various functionalities from the above functions, including face distortion (radial and swirl), face masking (Iron Man and Batman), background blur, background replacement, and a creative cartoon filter.Users can choose these functionalities from a Streamlit sidebar, creating an interactive experience.
We conducted extensive testing using various person segmentation methodologies to evaluate their performance on the NVidia Jetson Nano. The tested methodologies include:
1. https://github.com/opencv/opencv_zoo
2. https://docs.opencv.org/4.x/d2/d99/tutorial_js_face_detection.html
3. https://github.com/opencv/opencv/tree/master/samples/dnn/dnn_model_runner/dnn_conversion/paddlepaddle
4. https://github.com/dusty-nv/jetson-inference
5.https://docs.opencv.org/3.4/d6/d0f/group__dnn.html
6.https://github.com/spmallick/learnopencv/tree/7285b1f8f663edb21c0dc54c47b2dc307c97ba38/Mask-RCNN
There are no datasets linked
There are no datasets linked