Carrer Path Guidance AI Assistant in Data Science

Abstract

This project presents an intelligent AI assistant built using Retrieval-Augmented Generation (RAG) that offers personalized career path guidance in the domains of Data Science, Artificial Intelligence (AI), Machine Learning (ML), Deep Learning (DL), Computer Vision (CV), and Agentic AI. Designed as part of the Agentic AI Developer Certification Program (AAIDC 2025), the assistant integrates a vector store (FAISS), custom document embeddings, and the Mistral large language model to deliver context-aware, reliable answers to user queries.

A high-quality domain-specific PDF corpus was created, detailing roles, skills, learning paths, and certification suggestions for aspiring AI professionals. The content was embedded using HuggingFace’s all-MiniLM-L6-v2 model and stored in a FAISS index. During runtime, the assistant retrieves relevant context from the vector store and uses the Mistral API to generate intelligent, real-time answers. The system runs entirely on Google Colab, ensuring reproducibility and ease of deployment.

This assistant demonstrates how RAG architecture can be used not only for document QA but also to build agentic tools that provide career mentorship, technical recommendations, and learning path suggestions—all grounded in reliable embedded content. The modular and scalable nature of the project makes it suitable for expansion into other education or advisory domains.

Introduction

In the rapidly evolving world of Artificial Intelligence (AI) and Data Science, students and professionals are often overwhelmed by the diverse career options, emerging tools, and ever-changing skill requirements. Navigating this complex ecosystem requires reliable guidance—something traditional static resources or generic chatbots often fail to provide. This project addresses that challenge by developing an intelligent Career Path Guidance AI Assistant that uses Retrieval-Augmented Generation (RAG) and Mistral API to deliver highly personalized and contextually grounded career recommendations.

Unlike conventional chatbots that rely solely on pre-trained models, our assistant leverages the RAG framework to retrieve information from a custom, embedded PDF corpus crafted specifically to cover essential career guidance topics in AI, ML, Deep Learning, Computer Vision, NLP, and Agentic AI. By integrating FAISS for vector-based search and HuggingFace embeddings for semantic understanding, the system ensures that each response is both relevant and grounded in reliable domain-specific content.

The assistant is deployed on Google Colab for accessibility and reproducibility, making it a practical solution for students, job seekers, and early-career professionals looking for actionable insights into AI-related career paths. This project not only demonstrates the technical implementation of a RAG-based assistant but also showcases how such systems can serve real-world mentorship and educational needs in a scalable and intelligent manner.

Methodology

The development of the Career Path Guidance AI Assistant follows a modular and structured pipeline leveraging Retrieval-Augmented Generation (RAG). The system combines document retrieval with language generation to provide accurate, context-aware responses to user queries. The methodology can be broken down into the following stages:

Document Corpus Creation

A custom PDF document titled "Career Guide for Data Science & AI – 2025" was curated. It includes:
Career roles and specializations in AI, ML, DL, CV, NLP, Agentic AI
Skills required for each role
Learning paths and certifications
Industry tools and project recommendations
This high-quality guide serves as the knowledge base for the assistant.

Text Embedding and Vector Store Integration

To enable semantic search:
The document was split into chunks using LangChain’s TextLoader and RecursiveCharacterTextSplitter.
Each chunk was embedded using HuggingFace's all-MiniLM-L6-v2 model.
Embeddings were stored in a FAISS (Facebook AI Similarity Search) vector store for fast similarity-based
retrieval.

RAG Pipeline Construction

The core RAG system is built using LangChain, combining:
A retriever that pulls relevant document chunks from the FAISS index based on user queries
A language model to generate answers using the retrieved context
The Mistral LLM API was used for generation, selected for its cost-effectiveness and performance. The
assistant was configured with:
temperature = 0.2 for deterministic and factual output
Real-time user interaction via Google Colab notebook or Python script interface

Environment and Security

API keys (Mistral) were stored in a .env file and loaded using python-dotenv
A .env_example was provided for reproducibility
The .gitignore file excluded sensitive data

Deployment on Google Colab

The entire workflow was implemented on Google Colab to ensure accessibility and ease of setup for
evaluators and users. Colab allows users to:
Upload and embed documents
Run the assistant interactively
Modify and expand the codebase with minimal configuration

Experiments

To validate the functionality, relevance, and effectiveness of the Career Path Guidance AI Assistant, a series of practical experiments were conducted using real-world user queries. The objective was to evaluate how well the assistant could retrieve career-related knowledge and provide context-aware, actionable answers using the embedded PDF corpus and Mistral API.

Dataset & Embedding Test
Input Document: career_guide_ds_ai_2025_clean.pdf

Document Size: ~20 pages, covering 6 major domains (AI, ML, DL, CV, NLP, Agentic AI)

Embedding Model: all-MiniLM-L6-v2

Chunks Generated: 200+ overlapping segments

Vector Store: FAISS index created and queried successfully

Result: Semantic embedding and retrieval worked reliably across varied topics.

Query Response Evaluation
We evaluated the assistant using common user questions to simulate realistic use cases. Examples:
Result: Over 85% of queries returned accurate and relevant answers grounded in the PDF.
Latency and Performance
Avg. Response Time (Colab + Mistral): ~1.8 seconds per query

Query Failures: None observed under normal usage

API Rate Limits: Not exceeded during basic user-level testing

Result: Fast and stable user experience

User Feedback (Anecdotal)
Testers noted:

“It feels more intelligent than a generic chatbot.”

“I like how the answers are specific to career paths, not vague.”

“The response for switching from Data Analyst to AI was very actionable.”

Reproducibility Test
Downloaded project from GitHub

Installed requirements in a fresh Google Colab environment

Uploaded PDF and ran RAG system without any code change
Result: 100% reproducible with .env_example and instructions

Results

The Career Path Guidance AI Assistant was successfully implemented and tested using a custom PDF corpus embedded into a vector store and queried via a Retrieval-Augmented Generation (RAG) pipeline powered by the Mistral LLM API. The assistant delivered consistent, accurate, and context-aware responses across a wide range of career-focused queries related to AI and Data Science.

Functional Achievements
Component Description Result
Document Embedding PDF with detailed career content embedded using all-MiniLM-L6-v2 Successfully embedded with >200 vectorized chunks
Vector Store FAISS used for semantic search and retrieval Accurate and fast retrieval across all tests
LLM Integration Mistral LLM used for response generation Fast, relevant, and personalized responses
RAG Pipeline Combined retrieval + generation using LangChain Seamless prompt → retrieve → respond workflow
Reproducibility Deployed on Google Colab with .env security 100% reproducible with minimal setup
Deployment Notebook runs end-to-end with user interaction Fully functional interactive assistant

Abstract

Introduction

Methodology

Document Corpus Creation

A custom PDF document titled "Career Guide for Data Science & AI – 2025" was curated. It includes:
Career roles and specializations in AI, ML, DL, CV, NLP, Agentic AI
Skills required for each role
Learning paths and certifications
Industry tools and project recommendations
This high-quality guide serves as the knowledge base for the assistant.

Text Embedding and Vector Store Integration

To enable semantic search:
The document was split into chunks using LangChain’s TextLoader and RecursiveCharacterTextSplitter.
Each chunk was embedded using HuggingFace's all-MiniLM-L6-v2 model.
Embeddings were stored in a FAISS (Facebook AI Similarity Search) vector store for fast similarity-based
retrieval.

RAG Pipeline Construction

The core RAG system is built using LangChain, combining:
A retriever that pulls relevant document chunks from the FAISS index based on user queries
A language model to generate answers using the retrieved context
The Mistral LLM API was used for generation, selected for its cost-effectiveness and performance. The
assistant was configured with:
temperature = 0.2 for deterministic and factual output
Real-time user interaction via Google Colab notebook or Python script interface

Environment and Security

API keys (Mistral) were stored in a .env file and loaded using python-dotenv
A .env_example was provided for reproducibility
The .gitignore file excluded sensitive data

Deployment on Google Colab

The entire workflow was implemented on Google Colab to ensure accessibility and ease of setup for
evaluators and users. Colab allows users to:
Upload and embed documents
Run the assistant interactively
Modify and expand the codebase with minimal configuration

Experiments

Dataset & Embedding Test
Input Document: career_guide_ds_ai_2025_clean.pdf

Document Size: ~20 pages, covering 6 major domains (AI, ML, DL, CV, NLP, Agentic AI)

Embedding Model: all-MiniLM-L6-v2

Chunks Generated: 200+ overlapping segments

Vector Store: FAISS index created and queried successfully

Result: Semantic embedding and retrieval worked reliably across varied topics.

Query Response Evaluation
We evaluated the assistant using common user questions to simulate realistic use cases. Examples:
Result: Over 85% of queries returned accurate and relevant answers grounded in the PDF.
Latency and Performance
Avg. Response Time (Colab + Mistral): ~1.8 seconds per query

Query Failures: None observed under normal usage

API Rate Limits: Not exceeded during basic user-level testing

Result: Fast and stable user experience

User Feedback (Anecdotal)
Testers noted:

“It feels more intelligent than a generic chatbot.”

“I like how the answers are specific to career paths, not vague.”

“The response for switching from Data Analyst to AI was very actionable.”

Reproducibility Test
Downloaded project from GitHub

Installed requirements in a fresh Google Colab environment

Uploaded PDF and ran RAG system without any code change
Result: 100% reproducible with .env_example and instructions

Carrer Path Guidance AI Assistant in Data Science

Table of contents

Abstract

Introduction

Methodology

Experiments

Results

Table of contents

Files

Abstract

Introduction

Methodology

Experiments

Results

Code

Code