NEXTGENRESUME is an AI-powered resume classification engine that matches candidate profiles to job descriptions using natural language understanding. This project leverages DistilBERT embeddings and cosine similarity to automate early-stage recruitment, drastically reducing screening time while improving relevance.
π Current State & Gap Identification
Manual resume screening is time-consuming, subjective, and inconsistent. Traditional keyword-based ATS (Applicant Tracking Systems) often fail to understand semantic context and overlook high-potential candidates due to rigid filters. There's a clear need for:
Context-aware matching
Scalability in bulk resume processing
Reduction in recruiter workload
π Methodology
Solution Approach
Extract resume content from PDFs using pdfplumber
Preprocess text to isolate skills, education, and project details
Generate embeddings for resumes and job descriptions via DistilBERT
Compute cosine similarity scores to rank relevance
Design Decisions
Chose DistilBERT over BERT for faster inference with minimal accuracy trade-off
Used cosine similarity for lightweight, effective matching
CSV-based skill mapping for flexibility and extensibility
π§ͺ Evaluation Framework
Performance Metrics
Metric
Value
Avg. Similarity @Top-5
0.84
Processing Time/Resume
~3.2s
Skill Extraction Precision
91.3%
Comparative Analysis
Model
Precision
Time (per doc)
BERT
92.1%
7.8s
DistilBERT
91.3%
3.2s
TF-IDF
74.6%
1.1s
π Dataset Sources & Description
Resumes: Scraped, anonymized academic and industry resumes
Job Descriptions: Curated from LinkedIn, Glassdoor, and job portals
Dataset Summary
Component
Count
Resumes Processed
100+
Unique Skills
700+
Job Roles
10
π§Ή Data Processing Pipeline
import pdfplumber
import re
with pdfplumber.open("resume.pdf")as pdf: text =''.join([page.extract_text()for page in pdf.pages])
Skills: Matched using regex and a predefined skill dictionary
Education: Extracted using patterns (e.g., B.Tech, M.Sc., MBA)
Cleaned & stored in CSV for traceability
π Implementation Stack
Tool / Library
Purpose
pdfplumber
PDF parsing
Transformers (Hugging Face)
Embedding generation
Sklearn
Cosine similarity computation
Pandas
Data storage and manipulation
Jupyter
Development environment
π Deployment Considerations
Suitable for deployment via Flask/FastAPI backend
Model can be containerized using Docker
Supports batch processing or RESTful resume uploads
Monitoring & Maintenance
Track similarity trends via logging
Periodically retrain skill dictionaries and embeddings
Can integrate with recruiter feedback for active learning
π Key Results & Interpretations
High correlation between top resumes and job fit ratings (from HR test set)
Recommender performance was consistent across multiple domains
System favored resumes with strong project and skills alignment
β οΈ Limitations
Does not extract work experience duration (future NER-based solution planned)
Struggles with heavily styled resumes or scanned documents
Domain-specific job roles (e.g., healthcare, legal) need fine-tuned models
π§ Insights & Implications
Semantic search in HRTech is achievable at scale
Lightweight transformer models like DistilBERT offer real-time usability
This model architecture is adaptable for LinkedIn Job Matching, Gig Platforms, or University Placements
π’ Industry Relevance
This solution addresses a core recruitment pain point in:
Resume-job alignment benefits greatly from context-aware embeddings
Simple similarity scores outperform traditional keyword ATS
This project showcases a plug-and-play model for intelligent resume screening
π₯ Conclusion: Why It Matters
This project is a practical, low-cost implementation of AI in HR, transforming how organizations approach talent acquisition. It's scalable, explainable, and adaptable for real-world deploymentβmarking a strong entry into semantic recruitment solutions.