This project implements an automated pipeline for face detection, embedding generation, and clustering using deep learning and unsupervised machine learning techniques. The model utilizes MTCNN (Multi-task Cascaded Convolutional Neural Networks) for face detection, and InceptionResNetV1 to generate 512-dimensional facial embeddings based on the VGGFace2 dataset. The embeddings are subsequently clustered using K-Means Clustering to group similar faces. The pipeline also includes detailed evaluations using clustering metrics like Silhouette Score, Davies-Bouldin Score, and Calinski-Harabasz Score, along with visualizations such as t-SNE and PCA. The results demonstrate the effectiveness of the model in organizing large datasets of face images into meaningful clusters, with potential applications in facial recognition, surveillance, and dataset preprocessing.
The dataset used for this project originates from Kaggle’s Face Detection Dataset. This dataset comprises a diverse collection of face images, exhibiting variations in pose, lighting, and background. It’s divided into training and validation sets. The training set comprises 13,386 images, while the validation set contains 2,000 images. These datasets are employed for training and evaluating the face detection and clustering pipelines, respectively.
Face detection is accomplished using the Multi-task Cascaded Convolutional Neural Networks (MTCNN), a highly effective model for detecting faces in images with exceptional accuracy. The MTCNN model processes each image, identifying bounding boxes around detected faces. These bounding boxes are subsequently used to crop the detected faces, which are resized to 160x160 pixels to standardize them before further processing.
To generate feature embeddings for each detected face, the pipeline employs InceptionResNetV1, a deep learning model pre-trained on the VGGFace2 dataset. This model outputs a 512-dimensional embedding that encapsulates unique facial characteristics. These embeddings are used to represent faces in a high-dimensional feature space, where similar faces are anticipated to cluster together.
The embeddings are clustered using the K-Means algorithm, which groups similar embeddings into a predetermined number of clusters (K). The optimal number of clusters is determined through the Elbow Method, which analyzes the trade-off between the number of clusters and the within-cluster variance. The pipeline outputs the cluster centers, along with cluster assignments for each image in the training and validation datasets.
To assess the clustering outcomes, various metrics are computed, including the Silhouette Score, Davies-Bouldin Score, and Calinski-Harabasz Score. These metrics evaluate the separation, compactness, and cohesion of the clusters. Furthermore, dimensionality reduction techniques such as t-SNE and PCA are employed to visualize the embeddings in 2D space, facilitating a clearer understanding of cluster separations and overlaps.
The pipeline successfully detected and embedded faces from the Kaggle Face Detection Dataset. It processed a total of 13,386 images in the training set, successfully detecting and embedding 11,912 faces. This showcases the robustness of the MTCNN model for face detection, although a small percentage of images failed due to challenges like occlusion, extreme poses, or poor image quality.
The embeddings were clustered using K-Means, with the optimal number of clusters (K) determined by the Elbow Method. The clustering results showed some success, with each cluster representing a group of visually similar faces. However, the cluster sizes were slightly imbalanced, with Cluster 1 underrepresented (9.54% of the dataset) compared to the other clusters. The remaining clusters were fairly balanced, ranging from 20% to 25% of the dataset each.
To assess the quality of the clusters, various metrics were calculated. The Silhouette Score was 0.0306, indicating that the clusters were not well-separated in the high-dimensional embedding space. The Davies-Bouldin Score was 4.3044, suggesting that some clusters had significant overlap or high intra-cluster variance. The Calinski-Harabasz Score was 328.9357, reflecting moderate cohesion within clusters and some level of separation between them. While these metrics indicate room for improvement, the results are reasonable considering the complexity of facial clustering.
To gain deeper insights into the clustering, visualizations were generated using t-SNE and PCA. The t-SNE visualization revealed overlapping regions among the clusters, but some degree of separation was observed for Clusters 3 and 4. Similarly, the PCA visualization displayed a similar pattern, suggesting that the embedding space could benefit from further refinement. The bar plot of cluster distributions highlighted the imbalance, with Cluster 1 containing significantly fewer images compared to the other clusters.
Overall, the pipeline successfully demonstrated the ability to cluster similar faces. However, the clustering quality could be enhanced by experimenting with alternative clustering algorithms, fine-tuning the embedding model, or addressing the cluster imbalance.