The main objective of this work is the analysis of videos from the public test set of
Trash-Icra, which shows underwater waste, and the construction of improved datase
ts for each video present. The videos were examined frame by frame, applying the
SAM algorithm to calculate segmentation masks for the objects in the images. The
accuracy of the masks generated by SAM were then evaluated. In cases of errors, such
as incomplete segmentations of objects or the masking of irrelevant shadows, manual
intervention was performed using the LabelBox software to correct the annotations.
The corrected masks were manually annotated and collected into new datasets spe
cific to each analyzed video. Subsequently, the obtained results were analyzed by
calculating various statistics using the json files provided by LabelBox.
The central part of the thesis delves into the Segment Anything Model (SAM), an
advanced model that leverages deep learning techniques to segment any type of image.
It examines the contextualization of SAM, with application examples such as CLIP
(Contrastive Language-Image Pre-training) and ALIGN (A Large-scale ImaGe and
Noisy-text embedding). The fundamental concepts of SAM, its tasks, the model, the
data engine, and the dataset used, as well as the limitations and possible solutions, are
discussed.
The described process enabled the creation of accurate and reliable datasets for the
segmentation of underwater debris, enhancing the effectiveness of marine and coastal
monitoring through the application of deep learning techniques. The results demon
strate how the integration of manual annotations can significantly improve the per
formance of segmentation algorithms, contributing to more efficient and sustainable
management of marine and coastal resources.
This research provides a significant contribution to the use of deep learning tech
niques for environmental monitoring, highlighting the importance of collaboration
between automation and human intervention to achieve optimal results.
Marine debris poses a growing threat to the health of our planet,
derived from a variety of sources including discarded fishing gear, improperly recycled packaging and discarded plastic
improperly recycled packaging and discarded plastics such as shopping bags or plastic bottles [62].
plastic bottles [62]. These wastes end up in our oceans through various means and
remain, polluting virtually every corner of the earth. Despite efforts to
recycling and other initiatives to reduce the impact of marine litter, the situation
remains critical. The amount of rubbish already in the sea is so vast that it requires specific
specific interventions to be effectively addressed.
In this context, the Trash-ICRA19 dataset emerges as a fundamental resource.
resource. Created for the International Conference on Robotics and Automation
ICRA in 2009, this dataset represents a milestone in the field of applied artificial intelligence.
applied artificial intelligence. With an extensive collection of annotated and categorised images,
Trash-ICRA19 provides a solid basis for the development and evaluation of algorithms of
computer vision algorithms dedicated to the automatic recognition of marine litter.
The growing awareness of the dangers of marine litter has catalysed the interest
scientific community in finding innovative solutions for the management of this problem.
issue. Trash-ICRA19 not only offers high quality data, but also serves as a
bridge for collaboration between researchers, companies and government agencies. This dataset
not only enables the development of waste detection algorithms, but also promotes
and innovative strategies to tackle the challenge of marine litter.
marine litter.
Thus, Trash-ICRA19 represents a crucial starting point for advancing
research and innovation in marine waste management. Its po
its potential impact extends far beyond the scientific community, offering concrete opportunities to
radically transform the way in which we address this global challenge and contribute
to a more sustainable future for generations to come. Despite the importance of this problem, there are few large-scale efforts to
combat it, in part due to the manpower required. It is proposed that a key element of an effective strategy to remove debris from marine environments is the use of
use of autonomous underwater vehicles (AUVs) for the
detection and removal of litter.
The basic question is whether Deep Learning-based visual detection of underwater debris is plausible.
underwater debris is plausible in real time, and how current methods behave in
this field. The detection of marine debris through purely visual means is a
difficult, not to say impossible problem. As with many problems of detection
visual objects, small changes in the environment can cause huge changes in the
changes in the appearance of an object, and nowhere is this more true than in underwater environments.
underwater environments.
It is not only changes in light that affect surface waters, but the changing turbidity of the water can affect the appearance of an object.
water can make objects difficult or completely impossible to detect [63] [64].
detected [63] [64]. Moreover, marine debris is rarely in perfect condition and degrades over time, so the
degrade over time, so detectors must be able to recognise different
types of rubbish in any condition. This problem is difficult to solve
simply because of the huge variety of objects that are considered marine debris.
marine.
As a starting point for the vast problem of detecting all marine litter,
we focus on the detection of plastic, one of the most widespread and harmful types of litter
in the oceans. Even in the limited group of plastic objects, the variety is surprising.
Figure 5.1a and Figure 5.1b show two examples of real marine litter that have
completely different aspects. The plastic bottle in Figure 5.1a is just one
example of the many thousands of different styles of plastic bottles that can be
found on the ocean floor, not to mention plastic bags, containers and other
objects. Plastic is also particularly destructive to the environment, with plastic shopping bags and
plastic shopping bags and packaging that frequently cause the death of marine animals that try to eat them.
marine animals that try to eat them.
To be useful for the goal of removing plastic and other waste, algorithms of
object detection algorithms must be able to run in near real time
on robotic platforms. To assess their readiness for such a deployment,
all networks and models are tested on three different devices, approximating the capabilities of an offline data processing machine, a robotic platform
high-powered platform and a low-powered platform. Several studies have addressed the issue of marine debris detection, with
approaches ranging from assessing the presence of litter in the ocean following
catastrophic events such as tsunamis, to the use of remotely controlled vehicles (ROVs) to
detect and remove debris. Some studies have also explored the use of sonar and
other technologies for underwater debris detection.
However, to enable the visual or sensory detection of underwater debris, a large annual dataset of underwater debris is required.
cessary a large annotated dataset of underwater debris. Fortunately, some of these
Fortunately, some of these datasets exist, although most are not annotated for use in Deep Lear
ning. The work of annotating these datasets can allow the development of Deep Learning based models for
based Deep Learning models for the detection of marine debris in real time, although
their applicability in natural marine environments still needs to be fully understood.
natural marine environments. Some examples of known datasets include The Monterey Bay Aquarium
Research Institute (MBARI) which has collected a dataset of over 22 years to monitor
debris scattered on the seabed off the west coast of the United States
of America, particularly plastic and metal in and around the Monterey Canyon
submarine, which serves to trap and transport debris to the deep ocean floor.
the ocean floor. Another similar example is the work of the Global Oceanographic Data Center, part
of the Japan Agency for Marine Earth Science and Technology (JAMSTEC).
JAMSTEC has made a dataset of deep sea debris available online as part
of the larger J-EDI dataset (JAMSTEC E-Library of Deep-sea Images). This dataset
set contains images dating back to 1982 and provides specific data on debris types in the form of short video clips.
form of short video clips.
The work presented in this thesis benefited from the annotation of these data.
The dataset for this work was obtained from the J-EDI dataset of marine debris, described in detail in section 5.2.1.
described in detail in section 5.2.1. The videos that make up this dataset vary
considerably in quality, depth, objects in the scenes and cameras used. With
hold images of many different types of marine debris, captured from real environments,
providing a variety of objects in different states of deterioration, occlusion and overgrowth.
excessive. In addition, water clarity and light quality vary significantly from video to video.
from video to video. This made it possible to create a dataset for training
that conforms closely to real-world conditions, unlike previous contri
previous contributions, which were mainly based on internally generated datasets. The data
training data were extracted from videos labelled as containing debris, between
years 2000 and 2017. From that part of the data, all the
videos that appeared to contain some kind of plastic. This was done in par
to reduce the problem to a manageable size for the purposes of the project, but
also because plastic is an important type of marine debris [66]. At this point, each video was sampled at a rate of three frames per second to produce images that could be annotated.
images that could be annotated to prepare them for use in learning models.
learning models. This sampling produced over 240,000 frames, which were searched
manually to obtain good examples of plastic marine debris and then annotated.
annotated. The annotation process was completed by a number of volunteers, who
used the freely available tool LabelImg [67]. The final training dataset
The final training dataset used in this work consisted of 5,720 images, with
dimensions of 480x320. The four network architectures selected for this project were chosen from among the most popular and successful object detection networks currently in use. Each
of them has its own advantages and disadvantages, with varying levels of accuracy and speed of
execution speed.
Out of 1130 frames extracted (see section 6.3.1), 390 frames were saved for re-annotation
390 frames, resulting in a percentage of approximately 34.51%. The annotations follow the
following statistics:
Number of ‘envelope’ type annotations: 1977 (78.20%)
Number of ‘fish’ type annotations: 450 (17.80%)
Number of ‘bottle’ type annotations: 101 (3.99%)
Number of ‘ball’ type annotations: 0 (0.0%)
Number of ‘miscellaneous’ type annotations: 0 (0.0%)
TOTAL ANNOTATIONS: 2528
This data representation clearly shows the distribution of the different
categories of objects annotated in the several.mp4 video. The considerable concentration of an
notations in the envelope category compared to the other categories may suggest a prevalence
of this type of object in the frames of the analysed video. This imbalance in the annotations
annotations is reflected in the histogram, where the bar relating to envelopes is significantly
higher than the others.
Distribution of annotations ` The average number of annotations per
image, providing an overview of the density of annotations present.
This value represents a measure of the annotation complexity for each image.
image. In addition, the minimum and maximum annotations per image were calculated,
highlighting the variation in the extent of annotations within the dataset.
Average number of annotations per image: 6.482051282051282
Minimum number of annotations per image: 1
Maximum number of annotations per image: 9
These statistics offer an initial picture of the distribution of annotations, providing
crucial information for the design and optimisation of image analysis algorithms.
image analysis algorithms.
Embeddings statistics An embedding is a numerical representation of
an object or concept in a multidimensional space. Embeddings are
commonly used to represent complex data, such as words, images or
abstract concepts, so that they can be processed by machine learning algorithms.
machine learning algorithms.
In the case of images, and thus of this work, embeddings represent
visual features of the image. These features can be extracted
by intermediate layers of convolutional neural networks trained on image datasets. For
example, an image embedding could include information on colour, shape, texture and other visual aspects.
shape, texture and other visual aspects of the image.
Thus, embeddings are numerical representations that capture the salient features of objects or concepts.
characteristics of objects or concepts in a multidimensional space, allowing machine learning models to process and
machine learning models to process and understand such data more efficiently and meaningfully.
and meaningful. In the context of this work, the statistics of embeddings provide an
an overview of the central distribution and variability of the data in the dataset of
images.
To extract the embeddings, the Labelbox software uses the CLIP-ViT
B/32 model (a specific discussion of the CLIP model is given in section 4.2.1).
This model combines cross-vision image transformation (CLIP)
with the vision-transformers transformation (ViT), both Deep Learning models
widely used in image processing. CLIP was designed for
the learning of generic visual representations across a wide range of
data, while ViT adapts this representation capability to images via transformer mec
canisms of transformers. The CLIP-ViT-B/32 model was pre-trained on a large
of images and text, enabling the extraction of high-quality embeddings for a variety of images.
variety of images. Starting from
this assumption, the average and standard deviation of the embeddings of the video
standard deviation of the embeddings of the component frames of the video under examination.