This article aims to develop and compare different
computer vision methods to achieve semantic segmentation of
natural environment images. We tested the segmentation
performance of these methods in natural environment images
using the 2D portion of the WildScenes dataset. Specifically, this
article reproduced the DeepLabV3 model and conducted
experiments using ResNet-50 and MobileNetV3 as its backbone
networks. Meanwhile, we also adopted the SegNet model and set
its backbone network as ResNet-50. Through experiments, we
compared the average intersection to union (mIoU) values of these
three methods in different categories and found that the
DeepLabV3-resnet-50 model performed better than the other two
models overall. We analyzed and compared the results to propose
the reasons for this phenomenon.
Semantic segmentation is an important task in computer
vision, which involves classifying each pixel in an image into
different semantic categories. This classification not only
involves identifying objects in the image, but also involves
precise classification and labeling of each pixel, thereby
achieving a deep understanding of the image content. For
example, in autonomous driving, semantic segmentation can
help vehicles recognize roads, pedestrians, traffic signs, etc.,
thereby improving driving safety. Nowadays, autonomous
driving is gradually making breakthroughs in urban
environments, but how to accurately navigate in natural
environments is still full of challenges. In natural environments,
objects of different categories may have very similar appearance
features, such as different types of trees or flowers. Objects in
the natural environment often obstruct each other, such as leaves
covering some animals, or stones covering some plants. The
goal of this project is to develop and compare different computer
vision methods for semantic segmentation of natural
environment images. This project used the 2D portion of the
WildScenes dataset to test the segmentation performance of
different methods on images in natural environments. The
WildScenes dataset is a benchmark dataset specifically designed
for large-scale 2D and 3D semantic segmentation tasks in
natural environments. This dataset provides high-resolution 2D
images and high-density 3D LiDAR point clouds, accompanied
by precise six degrees of freedom (6-DoF) pose information
(Kavisha et al. (2023)). The data was collected in two different
natural environments in Australia (Venman and Karawatha)
(Kavisha et al. (2023)). We replicated the DeepLabV3 model
used in the papers supported by the dataset and replaced its
backbone. Secondly, we used the SegNet method and observed
the IoU values generated by various methods to determine the
quality of the methods.
A. DeepLabV3-ResNet50
We replicated the DeepLabV3 model mentioned in the
dataset paper and the encoder backbone was ResNet50.
DeepLabV3 is an advanced image semantic segmentation model.
DeepLabV3 mainly consists of Atrous Convolution, Atrous
Spatial Pyramid Pooling (ASPP), Batch Normalization, and
Depth wise Separable Convolution .Atrous Convolution
expands the receptive field of the convolution kernel by
inserting "holes" into it, without increasing computational
complexity and this enables the model to capture a larger range
of contextual information while maintaining high-resolution
feature maps; ASPP uses multi-scale dilated convolution and
global average pooling to fuse features of different scales,
enhancing the model's ability to detect multi-scale targets; Batch
Normalization and Depth wise Separable Convolution further
improve computational efficiency and reduce the computational
burden on the model (Chen et al.(2018)).
B. DeepLabV3-MobileNetV3
Afterwards, we replaced the backbone in the DeepLabV3 m
odel with MobileNetV3 and conducted training.MobileNetV3 i
s a lightweight convolutional neural network that is suitable as
a backbone for various computer vision tasks. It has efficient pe
rformance on mobile and embedded devices.MobileNetV3 achi
eves higher computational efficiency and lower latency through
optimized depth wise separable convolutions and new activatio
n functions such as h-swift(Howard et al.(2019)).Secondly, the
accuracy and performance of the model are balanced. MobileN
-etV3 optimizes its network structure through Neural Architect
ure Search (NAS). NAS automatically searches for the optimal
network structure to ensure efficient performance across device
s with limited resources. (Howard et al.).In situations where res
ources are limited (such as drones, autonomous vehicles, etc.)M
obileNetV3 is a good choice, and by combining some methods
to improve semantic segmentation accuracy, such as the DeepL
abV3 we use, MobileNetV3 can further enhance its performanc
e in semantic segmentation tasks.
C. SegNet-ResNet50
SegNet is a deep learning neural network model used for
image segmentation, particularly suitable for semantic
segmentation tasks. The core segmentation engine of this model
consists of an encoder network, a decoder network, and a pixel
level classification layer. The core segmentation engine of this
model is composed of an encoder network, a matching decoder
network, and a pixel-level classification layer (Badrinarayanan,
V., Kendall, A., & Cipolla, R. (2017)). The innovation of SegNet lies in the way the decoder upsamples its lower
resolution input feature maps. Concretely, the decoder uses the
pooling index calculated in the maximum pooling step of the
corresponding encoder to perform non-linear upsampling. This
way, there is no need to learn upsampling. The upsampled map
is sparse, and then convolved with trainable filters to generate
dense feature maps (Badrinarayanan, V., Kendall, A., & Cipolla,
R. (2017)). We have chosen ResNet50 as the backbone of this
model, since ResNet50 solves the problem of gradient vanishing
in deep networks through residual connections, allowing the
network to go deeper and extract richer features (He et al.
(2016)).
In our project task, we select AutoDL, a low-cost GPU
computing platform. AutoDL is a cloud-based machine learning
service platform that helps users quickly build and deploy
machine learning models. We rented a GPU server on AutoDL
for model training and computation.
The experimental setup we used is as follows.
A. Dataset
•Data sources: The dataset to be used in the group project is
called WildScenes (see links and references at the end of this
document). This is a recently released multimodal dataset
consisting of five sequences of 2D images recorded with a
normal video camera during traversals through two forests:
Venman National Park and Karawatha Forest Park, Brisbane,
Australia.
•Data description: The dataset has 9,306 images of size
2,016 x 1,512 pixels. Each and every one of the images has been
manually annotated.
•Data preprocessing: Data normalization, data augmentation
and transformation, including resizing, random horizontal
flipping, random rotation, and color adjustment, etc;
•Label preprocessing: Including type conversion, resizing,
and label mapping.
•Training/validation/testing division: Randomly divide into
70%, 5%, and 25% respectively.
B. Experimental environment
•Hardware configuration: AutoDL cloud computing GPU
Processor
•Software environment: Windows operating system, Python
programming language, Python IDE (PyCharm), machine
learning libraries (PyTorch), Deep learning models (DeepLab,
ResNet, MobileNet), etc.
C. Experimental steps
•Environment settings and logging: Set the log file path and
configure the logger to record important information during the
training process.
•Define category metadata: Define the names and color
palettes of various categories in the dataset for easy analysis and
visualization in the future.
•CUDA memory configuration: Set CUDA memory to be
expandable to improve memory utilization efficiency.
•Label Mapping and Conversion: Define a label mapping
function to convert specific category labels into target labels or
ignore them; Data augmentation and image/label conversion,
including resizing, random transformation, and standardization.
•Load the dataset: Instantiate training and validation datasets,
specify paths and transformation methods for images and labels.
•Check the balance of the dataset: Count and record the
sample size of each category to understand the balance of the
dataset.
•Configure data loader: Create training and validation data
loaders, set batch size and number of parallel processing worker
threads.
•Model settings: Load various pre trained models and adjust
the final classification layer to fit the number of categories in the
dataset; Use the cross entropy loss function (ignoring pixels with
index 255); Choose the Adam optimizer and set the learning rate
to 0.001.
•Define loss function and optimizer: Use CrossEntropyLoss
as the loss function and set ignore_index to ignore specific labels;
Use Adam optimizer to optimize model parameters.
•Training and Verification Cycle: Conduct multiple epochs
of training and validation.
•Model saving: After each epoch, save the model's state
dictionary for subsequent loading and use.
•Visualization and Analysis: Draw the change curves of loss,
accuracy, and IoU during the training and validation process,
and save them as image files.
•Save results to CSV file: Calculate the final validation
results, including the IoU for each category, and save them to a
CSV file for further analysis.
•Draw plots and visualize
The data obtained indicates that, based on DeepLabV3_resnet-50, we can get the biggest mIoU.This Model demonstrates good learning ability, but fluctuations in validation metrics suggest a possible slight overfitting;Based on DeepLabV3_resnet-50, We achieve the highest accuracy and all indicators are relatively stable;Based on Seg_ResNet-50, We find that the model performs reasonably in all aspects, but has significant fluctuations and weak generalization ability.
In summary, Model DeepLabV3_resnet-50 appears to perform the best, with the highest accuracy and IoU, and relatively stable validation metrics, indicating good generalization ability.
A. DeepLabV3-ResNet50
What worked and did not work: The training and validation
losses have steadily decreased, but the validation loss fluctuates greatly, which may indicate that the model has some overfitting. The training accuracy has increased to about 82%, while the validation accuracy has reached around 78%, indicating that the model has good learning ability but still has room for impro
vement. Both the training and validation of IoU have steadily increased, but fluctuations in the validation of IoU indicate instability in the model when dealing with different samples.
Future work: Adopt stronger regularization methods, such as higher L2 regularization coefficients or increasing Dropout rates, to alleviate overfitting. Further expand the types of data aug
mentation, such as scaling, padding, etc., to improve the robustness of the model.
B. DeepLabV3-MobileNetV3
What worked and did not work: The decreasing trend of training and validation losses is good, and the validation loss is close to the training loss, indicating that the model has good generalization ability. The training accuracy reached about 86%, and the validation accuracy reached around 82%, which is the high
est among the three models. This indicates that the model has achieved a good balance between accuracy and efficiency. Both the training and validation of IoU are high, and the validation of IoU shows small fluctuations, demonstrating the stable performance of the model on diverse datasets.
Future work: Trying different optimizers or adjusting learning rate strategies, such as learning rate schedulers, may further improve performance. The lightweight feature of MobileNetV3
makes it suitable for devices with limited resources, but it may
lead to insufficient feature expression ability. Multi scale feature fusion or hybrid feature extraction networks can be attempted.
C. SegNet-ResNet50
What worked and did not work: The training and validation
losses have steadily decreased, but the validation losses have fluctuated significantly in the early stages. The training accuracy reached 82.5%, and the validation accuracy was about 80%. This performance demonstrates the adaptability of the model in complex tasks. The training and validation of IoU are relatively low, and the fluctuation of IoU validation shows that the model performs differently on different samples.
Future work: The performance of Seg_ResNet-50 can be improved by adding feature fusion layers, especially considering the fusion of multi-scale features. Combining other related tasks
such as edge detection for joint learning can help improve semantic segmentation performance. Add Dropout layers or use mixed regularization methods to suppress overfitting.
For all these models, we can Use Bayesian optimization orgrid search to optimize model hyperparameters, including learning rate, regularization coefficients, etc. If possible, obtain more diverse data, especially including more scene changes, to enhance the model's generalization ability. We can Consider using
Ensemble Learning technology to integrate the advantages of different models to improve overall performance.