Person re-identification is a crucial computer vision task, particularly for surveillance systems. This process aims to retrieve the identity of individuals across non-overlapping camera views. Despite significant advancements, the manual annotation process makes real-world deployments less scalable. To address this challenge, data augmentation techniques such as Generative Adversarial Networks (GANs) have proven effective in various studies. However, there is limited research combining unsupervised learning with GAN-based data augmentation for person re-identification. This study proposes leveraging unsupervised learning and GANs to enhance person re-identification. We generate images of the same person with a new pose using GANs and utilize these images across different independent methods to train the person re-identification model. The goal is to improve model robustness and generalization by exposing it to a wider range of poses and appearance variations. We explore four distinct methods to enhance model learning through data augmentation and evaluate their performance. The results demonstrate that our approach consistently outperforms baseline methods, leading to to improve 4.1 percent in Rank 1 and 13.1 percent in mAP.
The primary objective of person re-identification (Re-ID) is to locate and retrieve images of specific individuals captured from different camera views at various locations and times. Specifically, given a video sequence, image, or textual description of a person’s identity, the goal is to accurately match and retrieve this identity from a database known as the gallery set. Most studies define Re-ID as the process of matching pedestrian images recorded by non-overlapping cameras. However, some research extends this definition to include matching images captured by the same camera or overlapping camera views. Person re-identification methods have enormous potential for a wide range of practical applications, from security and surveillance to retail and healthcare. Due to the strong discriminative abilities of deep neural networks, supervised
methods have shown outstanding results in person reidentification (re-ID). However, these methods rely heavily on large amounts of labeled data, which involve expensive annotation processes, making them less feasible for large-scale, real-world re-ID applications. To overcome this limitation, unsupervised approaches that extract discriminative features from unlabeled data have gained popularity. Earlier works in unsupervised person re-ID have leveraged pseudo-labels generated through k-nearest neighbor search [1] or unsupervised clustering [2] for model training. These methods follow a two-stage process: one stage focuses on generating pseudo-labels, while the other uses these labels to train the model. Among the different strategies, clustering-based methods [3] have proven particularly effective, achieving state-of-the-art results. However, the presence of noise in pseudo-labels poses a significant challenge, limiting the performance of these unsupervised methods. To this end, we use a label refinement baseline for
person re-identification. Besides learning type, Recent studies [4] have demonstrated that data augmentation helps a network learn view-invariant features by introducing augmented perspectives of a person, which are crucial for creating strong representations. These methods typically employ standard data augmentation techniques such as ’random flipping’, ’cropping’, and ’color jittering’. A more advanced approach to data augmentation is through the use of Generative Adversarial Networks (GANs) [5]. Unlike traditional methods, GANs can significantly alter features unrelated to identity while maintaining identity-related features, making them particularly useful in Re-ID. In this paper, we use GAN architecture for generating new images in various poses and then using in four different method for augmenting the baseline network. Results of the implementation show that these methods have led to better learning of re-id models and improved the evaluation metrics.
In our approach, we aimed to tackle the unsupervised person reidentification problem by leveraging GANs for data augmentation. As discussed earlier, data augmentation is a powerful strategy for mitigating over-fitting and improving system performance, especially when dealing with the small datasets often used in supervised re-identification. Our proposed approach employs GANs in four distinct methods to enhance a baseline model by incorporating auxiliary features generated through GANs. This augmentation boosts the learning capacity of the model for person re-identification. Below, we provide a summary of the baseline network and our proposed methods in more detail.
##Baseline network
The baseline architecture [6] we employed for unsupervised learning combines both global and part-based information from the images to represent them. In the clustering phase, we extract both global and part features, assigning pseudo-labels through DBSCAN clustering of the global features. Because these obtained pseudo labels contain noise, so a scoring mechanism is adopted to modify the obtained pseudo-labels and these new pseudo-labels is used to learn global and part-based features. To address the issue of unreliability in global and part features, the score has been defined as the Jaccard similarity between the k-nearest neighbors of the global and part features. These scores were then used to refine the noisy labels. During the training phase, it has been addressed using proposed pseudo-label refinement techniques based on cross-agreement: agreement-aware label smoothing (AALS) for part features and part-guided label refinement (PGLR) for global features. The features learned from the trained model are then used to update pseudo-labels in the subsequent clustering phase and re-identification process.
##Generator Module
As data augmentation is critical for improving re-identification, we utilize GANs to generate diverse views of individuals by creating new perspectives through 3D mesh recovery and rotation techniques. This method enhances the feature learning process by preserving both identity-related and structural features, unlike traditional augmentation methods. Our approach consists of three independent modules: a generator for creating new images, and two additional units for clustering and re-identification. These modules work together to improve the re-identification process. The generator module that is inspired by [7] consists of four networks: an identity encoder (
##Simple Augmentation method
In this method, we use the generator module for generating four new images for each image in the training dataset with varying rotated poses (90,135,180,270). After generating all new images, We simply combined these generated images with original images and passed it through the baseline network for clustering, pseudo-label assignment, and model training with refined pseudo-labels to complete the person re-identification process. The goal of this idea is to increase data diversity using GANs and improve feature extraction for identities. By training the model on these features, it learns to recognize individuals regardless of different poses or camera angles, enhancing generalization. The expectation is that generated and original data of the same identity will cluster together during model training, leading to more distinctive features for different identities.
##GAN Injection Method
In this method, we used newly generated images during the clustering stage. Initially, only original images are inputted for independent clustering. After clustering and assigning pseudo-labels, generated images are added to the same clusters as their originals. The goal is to group original and generated images of the same identity together, addressing issues from previous methods that separated them into different clusters. This approach aims to improve the clustering process by ensuring that related images are clustered together. The architecture of this method for assigning pseudo labels is shown in Figure 1.
##Sequential Augmentation Method
After generating images using Generator module and using more generated images in simple augmentation method, because the number of generated images is many times the original images, model may be more biased towards the generated images and ignore the discriminative features of the original images that are lost in the generated images and can reduce the learning efficiency for person re-identification. To tackle this issue, inspired by curriculum learning which organizes training samples in a meaningful sequence, generated images are fed into the model in a specific order over multiple training epochs (e.g., using 90-degree rotations in the first epochs, then 180 degrees, and then). This structured input helps optimize the learning process for re-identification tasks.
##Mesh Combination Method
In this implemented idea, both generated data and meshes extracted from human retrieval algorithms are used to improve person reidentification model learning. The approach utilizes the mesh of each image, which represents individuals in specific poses while removing the background. By combining these meshes with the original images, new images are produced that showcase individuals in a standing pose without any background. This process generates mask images for each original image. For both the mask images and the original images, features are extracted using the previously mentioned re-identification model. By combining the weighted features of the original images with those of the mask images, new features are obtained, which are then used for clustering, agreement scoring, and enhancing the learning of the person re-identification model. We use this idea of combining mesh images in various ways along with other methods mentioned above. For example, we utilize these new mask images by combining them with the features of the original images and applying them in the clustering stages and model training. Alternatively, in another approach, we incorporate sequential data augmentation methods in combination with the mesh technique. The architecture is shown in Figure 2.
In these experiments, we use the Market-1501 dataset, collected in front of a supermarket at Tsinghua University, contains 12,936 images from 751 identities in the training set and 19,732 images from 750 identities in the test set, with an average of 17.2 images per identity in the training set. For evaluation, we extract global features from the query and gallery sets, apply normalization, and then use these features to assess the model’s performance. Evaluation metrics include Rank-1, Rank-5, Rank-10 (Cumulative Matching Characteristic, CMC), and mean Average Precision (mAP). CMC measures matching accuracy in re-identification, with Rank-1 being the most intuitive metric when there is only one ground truth per
query. mAP is more appropriate when multiple matching gallery images exist for each query.
###Implementation details
We first replicate the experiments from the baseline method [6], resizing input images to 128x384. The backbone network used is ResNet-50, pre-trained on Image-Net to accelerate training. Following previous techniques, we apply standard data augmentation methods such as random horizontal flip, random cropping, and random erasing. For clustering, the DBSCAN distance threshold is set to 0.5, with a minimum of 4 samples required to form a cluster. The batch size for training is set to 64, with 16 identities and 4 random images per identity. We train our model using the Adam optimizer with a learning rate of 0.00035, weight decay of 5e-4, for 50 epochs. Each person image is divided uniformly into 3 parts. Our model is implemented using the PyTorch framework. In the Baseline paper, experiments have been performed with 4 Titan RTX GPUs in parallel. But Due to the existence of different resources compared to the baseline network, we have obtained the results with single NVIDIA 1080/3090 Ti GPU. Later, if possible, it is better to evaluate all the experiments on four GPUs and compare the results. Training on the Market-1501 dataset takes approximately 4 hours on single 1080 Ti and 2 Hours on 3090 Ti GPU. Sequential Augmentation method took more than 2 hours. To the extent that it even took a day.
##Simple Data augmentation Method
The first experiment is to investigate the use of data generated by Simple Augmentation method in the training process of the person re-identification model. The results for the baseline network as well as the simple augmentation method using different generated images are given in the Table 1. Results can be seen with row name "Simple Aug". In each row, different combinations of images have been evaluated and the best results are reported in the table. Through various experiments and analyses, it was observed that adding a limited number of generated images improves evaluation metrics compared to the baseline method. However, when examining the number of clusters during training without any generated images, the clusters increase throughout the epochs, with the dataset containing 751 identities. When combining original images with one generated per image , the number of clusters continues to rise, and by the end of training, the clusters nearly double compared to training without generated images. This indicates that adding generated images increases the number of clusters, meaning more images are identified as distinct identities. However, it’s clear that the number of true identities remains 751, and this method only increases the number of images per identity. Because the distribution of generated and original images differs, the model treats the generated images as new identities during training, failing to group them with the original images into the same clusters.
##GAN Injection Method
In previous experiment, we observed that after combining original images with generated images and performing clustering, the model failed to group original images and their generated variants into the same cluster. To address this challenge, we implemented the GAN Injection method. The results of implementing this method with different number of generated images are shown in Table 1. Results can be seen with row name "GAN Injection". The goal of this experiment was to ensure that both types of images from the same identity are grouped into the same cluster. However, this new method introduces potential noise. When pseudo-labels are assigned to original images through clustering, the same labels are also given to their generated counterparts. This can lead to noise in early training epochs, where images from different identities (e.g., individuals wearing similar clothing) are incorrectly clustered together and assigned identical pseudo-labels. As a result, generated images from distinct identities may end up in the same cluster, further amplifying the noise rather than preventing it. While the previous method dealt with incorrect pseudo-labels on a smaller scale, this new approach increases the number of incorrect labels, confusing the model more than the original method. Although the noise decreases through the label correction process outlined in the baseline paper, as it is clear from the results table, adding even a small number of generated images (one image per cluster) improves performance compared to the baseline. However, as the number of generated images increases in the clustering process, the noise also rises, making it difficult for the label correction process to overcome this issue, ultimately leading to a reduction in overall performance.
##Sequential Augmentation Method
In the Previous experiments, we observed that increasing the number of generated images as inputs led to diminishing improvements. To address this, we implemented a Sequential Augmentation approach inspired by curriculum learning. The implementation setup was similar to before, with the key difference being the increase in epochs from 50 to 100. The training process was structured as follows: for example in the first 25 epochs, the model was trained using a combination
of original images and those generated with a 135-degree rotation. For epochs 25 to 50, the generated images with a 90-degree rotation were introduced. During epochs 50 to 75, 270-degree rotated images were used, and the final 25 epochs incorporated images generated with a 180-degree rotation. This sequential method aimed to improve the model’s ability to learn distinctive features across different poses and perspectives. Most of the generated sequence combinations were evaluated and Table 1 with row name "seq" presents the results. Most of the different order of the generated images has been evaluated and for short some of the results has been shown in the table (the order is from left to right). As discussed in the analysis of the simple augmentation method, increasing the number of generated images combined with the original images did not lead to improvements. In fact, it resulted in poorer performance compared to combining one or two generated images with the originals. The reason for this is that when the number of generated images exceeds the original ones, the likelihood of a training batch containing more generated than original images increases. This causes the model’s training process to be more influenced by features from the generated images. Since the distribution of generated images differs from that of the original images, the model tends to focus on learning the noise present in the generated images, neglecting the distinctive features of the originals. This leads to reduced model performance. However, we observed in previous experiments that combining the original images with one generated image improved model learning. To address the problem of different distributions and the more number of generated images, we injected a combination of original images and one generated image sequentially into the model. The results of applying different sequences of generated images are shown in the Table 1. As indicated by the results, using four generated images sequentially improved the model’s learning compared to injecting all four images at once, which reduced the model’s ability to learn distinctive features. This suggests that injecting generated data sequentially
helps prevent model over-fitting on these data. Consequently, the model can learn additional features from the diverse poses of the generated images without neglecting the key distinctive features present in the original images.
##Mesh Combination Method
Several experiments were conducted using a combination of mesh image features, original images, and generated images. The implementation settings were consistent with those described before. Same as sequential augmentation method, the number of epochs was increased from 50 to 100. The experimental results are divided into two sections: fixed weight methods and variable weight methods, with each method’s results presented in table with different name in table row. For the fixed weight approach, the weight assigned to the original images was set to 2, while the weight for the masked images was set to 1. Further details regarding the performance of these experiments will be provided in the subsequent results and analyses. The experiments were structured with both fixed and variable weight settings, and the generated images were added sequentially. Table 1 with row name "fixed combination" presents the results of applying the feature combination method with fixed weights, both with and without generated images. The row of the Table 1 with name "usual fixed combination" shows results from using only the original images with fixed weights, which led to an improvement across all evaluation metrics. This is because the combination of original image features and mesh images provides more focus on the individual in the image, leading to better clustering performance by grouping same-identity images into the same cluster. In the next experiment, generated images were included. As observed in the previous experiments on data augmentation, combining one generated image with the original images during training resulted in further improvements in the evaluation metrics. Thus, in this approach, generated images were applied sequentially to the model. The results indicate improvement in certain evaluation metrics with this sequential approach. The results in the Table 1 with row name "var combination" reports results from applying the feature combination method with variable weights during model training, as well as incorporating generated images sequentially. As seen, accuracy increased compared to both the baseline method and the fixed-weight approach discussed previously. As expected, using the optimal sequence of generated images (135-270-90-180 degrees) produced better results compared to baseline. This suggests that learning the weights of features from both original and mesh images can further improve model performance for identifying same-identity persons. The results show that during training, the model learns to assign more importance to the relevant features from either the original or mesh images, leading to reduced loss function and enhanced person reidentification accuracy. In contrast, with fixed weights, the same importance is given to both sets of features throughout training, which could increase the loss function and reduce the model’s ability to learn distinctive features. This observation is consistent with the expectation that variable weighting improves the model’s performance by better capturing distinguishing characteristics in the images.
Table 1:
In this research, we explored the unsupervised person re- identification problem using GANs. The integration of GANs with unsupervised learning for this task is still in its early stages, with most studies focusing on domain adaptation or fully unsupervised methods. In this research, we proposed a fully unsupervised approach using GANs to generate new data for augmentation and designed new features to enhance person re-identification model performance. We saw that each methods have different improvements in terms of CMC metrics and mAP. The proposed method is easy to implement and a number of further extensions can be made. We can use the pre-trained weights from the CLIP model instead of the ImageNet pre-trained weights on the ResNet model. The CLIP model is a neural network trained on 400 million diverse text-image pairs, while the ImageNet dataset contains only 1 million images. With this idea the models can train on larger and more diverse datasets tend to perform better and are more robust to various challenges.
[1] Yutian Lin, Lingxi Xie, Yu Wu, Chenggang Yan, and Qi Tian. 2020. Unsupervised
person re-identification via softened similarity learning. In Proceedings of the
IEEE/CVF conference on computer vision and pattern recognition. 3390–3399.
[2] Yang Fu, Yunchao Wei, Guanshuo Wang, Yuqian Zhou, Honghui Shi, and
Thomas S Huang. 2019. Self-similarity grouping: A simple unsupervised cross
domain adaptation approach for person re-identification. In proceedings of the
IEEE/CVF international conference on computer vision. 6112–6121.
[3] Hao Chen, Benoit Lagadec, and Francois Bremond. 2021. Ice: Inter-instance
contrastive encoding for unsupervised person re-identification. In Proceedings of
the IEEE/CVF international conference on computer vision. 14960–14969.
[4] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. 2020. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 (2020).
[5] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial
nets. Advances in neural information processing systems 27 (2014).
[6] Yoonki Cho, Woo Jae Kim, Seunghoon Hong, and Sung-Eui Yoon. 2022. Part-based pseudo label refinement for unsupervised person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7308–7318.
[7] Hao Chen, Yaohui Wang, Benoit Lagadec, Antitza Dantcheva, and Francois
Bremond. 2021. Joint generative and contrastive learning for unsupervised
person re-identification. In Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition. 2004–2013.
[8] Angjoo Kanazawa, Michael J Black, DavidW Jacobs, and Jitendra Malik. 2018. End to end recovery of human shape and pose. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7122–7131.
There are no models linked
There are no models linked