NEXTGENRESUME-Resume-Classification-Using-DistilBERT-and-Cosine Similarity

🧭 Executive Summary

NEXTGENRESUME is an AI-powered resume classification engine that matches candidate profiles to job descriptions using natural language understanding. This project leverages DistilBERT embeddings and cosine similarity to automate early-stage recruitment, drastically reducing screening time while improving relevance.

🔍 Current State & Gap Identification

Manual resume screening is time-consuming, subjective, and inconsistent. Traditional keyword-based ATS (Applicant Tracking Systems) often fail to understand semantic context and overlook high-potential candidates due to rigid filters. There's a clear need for:

Context-aware matching
Scalability in bulk resume processing
Reduction in recruiter workload

📌 Methodology

Solution Approach

Extract resume content from PDFs using pdfplumber
Preprocess text to isolate skills, education, and project details
Generate embeddings for resumes and job descriptions via DistilBERT
Compute cosine similarity scores to rank relevance

Design Decisions

Chose DistilBERT over BERT for faster inference with minimal accuracy trade-off
Used cosine similarity for lightweight, effective matching
CSV-based skill mapping for flexibility and extensibility

🧪 Evaluation Framework

Performance Metrics

Metric	Value
Avg. Similarity @Top-5	0.84
Processing Time/Resume	~3.2s
Skill Extraction Precision	91.3%

Comparative Analysis

Model	Precision	Time (per doc)
BERT	92.1%	7.8s
DistilBERT	91.3%	3.2s
TF-IDF	74.6%	1.1s

📊 Dataset Sources & Description

Resumes: Scraped, anonymized academic and industry resumes
Job Descriptions: Curated from LinkedIn, Glassdoor, and job portals

Dataset Summary

Component	Count
Resumes Processed	100+
Unique Skills	700+
Job Roles	10

🧹 Data Processing Pipeline

import pdfplumber
import re

with pdfplumber.open("resume.pdf") as pdf:
    text = ''.join([page.extract_text() for page in pdf.pages])

Skills: Matched using regex and a predefined skill dictionary
Education: Extracted using patterns (e.g., B.Tech, M.Sc., MBA)
Cleaned & stored in CSV for traceability

🛠 Implementation Stack

Tool / Library	Purpose
pdfplumber	PDF parsing
Transformers (Hugging Face)	Embedding generation
Sklearn	Cosine similarity computation
Pandas	Data storage and manipulation
Jupyter	Development environment

🚀 Deployment Considerations

Suitable for deployment via Flask/FastAPI backend
Model can be containerized using Docker
Supports batch processing or RESTful resume uploads

Monitoring & Maintenance

Track similarity trends via logging
Periodically retrain skill dictionaries and embeddings
Can integrate with recruiter feedback for active learning

📈 Key Results & Interpretations

High correlation between top resumes and job fit ratings (from HR test set)
Recommender performance was consistent across multiple domains
System favored resumes with strong project and skills alignment

⚠️ Limitations

Does not extract work experience duration (future NER-based solution planned)
Struggles with heavily styled resumes or scanned documents
Domain-specific job roles (e.g., healthcare, legal) need fine-tuned models

🧠 Insights & Implications

Semantic search in HRTech is achievable at scale
Lightweight transformer models like DistilBERT offer real-time usability
This model architecture is adaptable for LinkedIn Job Matching, Gig Platforms, or University Placements

🏢 Industry Relevance

This solution addresses a core recruitment pain point in:

Startups and SMEs without dedicated HR teams
Campus Placement Cells
Online Job Portals seeking AI-based shortlisting

📚 Source Credibility

Model: DistilBERT on Hugging Face
Dataset: Scraped via job portals and anonymized academic CVs
Inspiration: CV-JD Matching Repo

🌟 Uncommon Insights

Resumes with project narratives outperform generic ones
Cosine similarity boosts effectiveness when embedding length is normalized

📌 Visual Architecture

[Resume PDFs] + [Job Descriptions]
         ↓
 [Text Extraction → Cleaning]
         ↓
 [DistilBERT Embeddings]
         ↓
 [Cosine Similarity Matrix]
         ↓
 [Top-N Recommendations]

🧾 Summary of Findings

Resume-job alignment benefits greatly from context-aware embeddings
Simple similarity scores outperform traditional keyword ATS
This project showcases a plug-and-play model for intelligent resume screening

💥 Conclusion: Why It Matters

This project is a practical, low-cost implementation of AI in HR, transforming how organizations approach talent acquisition. It's scalable, explainable, and adaptable for real-world deployment—marking a strong entry into semantic recruitment solutions.

🔗 Repository

👉 GitHub: NEXTGENRESUME by Bhavya Srujana