Study Abroad AI Assistant - Project Documentation
Executive Summary
The Study Abroad AI Assistant is a RAG (Retrieval-Augmented Generation) powered application designed to help Ethiopian students navigate international study opportunities, with a current focus on Czech Republic government scholarships. The system combines web scraping, vector database technology, and large language models to provide personalized guidance for scholarship applications.
Project Overview
Purpose
This project addresses the challenge Ethiopian students face when seeking international study opportunities by providing an intelligent, context-aware AI assistant that can answer specific questions about scholarship programs, requirements, deadlines, and application procedures.
Target Users
- Ethiopian students seeking international study opportunities
- Students interested in Czech Republic government scholarships
- Educational advisors and counselors
- International education organizations
Technical Architecture
Core Technologies
Backend Framework
- FastAPI: Modern, high-performance web framework for building APIs
- SQLAlchemy: Python SQL toolkit and Object-Relational Mapping (ORM)
- Alembic: Database migration tool for SQLAlchemy
Database Layer
- PostgreSQL: Primary relational database for user data and chat history
- ChromaDB: Vector database for storing document embeddings
- asyncpg: Asynchronous PostgreSQL driver for high-performance database operations
AI/ML Stack
- LangChain: Framework for developing applications with LLMs
- Groq LLM: High-performance language model for generating responses
- OpenAI Embeddings: Text embedding model for semantic search
- RecursiveCharacterTextSplitter: Document chunking for optimal retrieval
Data Collection
- BeautifulSoup4: Web scraping library for extracting scholarship data
- Requests: HTTP library for making web requests
- Wikipedia API: Collecting comprehensive country information
Infrastructure
- Docker: Containerization for consistent deployment
- Docker Compose: Multi-service orchestration
- pgAdmin: Database administration interface
System Architecture

Data Pipeline
- Data Collection Phase
Czech Government Scholarship Scraping
- Program types (Bachelor's, Master's, PhD)
- Application requirements
- Deadlines and timelines
- Funding details
- Eligibility criteria
- Application procedures
- Education System: University structure, academic calendar, degree recognition
- Work Opportunities: Job market analysis, visa requirements, employment statistics
- Cities: Major cities with universities, cost of living, transportation
- Economy: Economic indicators, currency, inflation rates
- Data Processing Phase
Text Chunking
- Documents are split into optimal chunks using RecursiveCharacterTextSplitter
- Chunk size: 1000 characters with 200 character overlap
- Ensures context preservation while maintaining search efficiency
Embedding Generation
- OpenAI's text-embedding-ada-002 model for semantic embeddings
- 1536-dimensional vectors for each document chunk
- Enables semantic similarity search
Vector Storage
- ChromaDB for persistent vector storage
- Separate collections for different data types:
- scholarships_collection: Scholarship program data
- country_info_collection: General country information
- universities_collection: University-specific data
- cities_collection: City-specific information
- Retrieval Phase
Query Processing
- User questions are embedded using the same model
- Semantic similarity search against relevant collections
- Top-k retrieval (k=5) for context generation
Context Assembly
- Retrieved documents are ranked by relevance
- Context is assembled with source attribution
- Confidence scores calculated based on similarity
API Design
RESTful Endpoints
Session Management
POST /api/v1/new-chat
Content-Type: application/json
{
"title": "Czech Scholarship Inquiry",
"country": "czech_republic"
}
Chat Interface
POST /api/v1/chat
Content-Type: application/json
{
"message": "What are the requirements for Czech government scholarships?",
"country": "czech_republic",
"user_background": {
"nationality": "Ethiopian",
"education_level": "Bachelor",
"field_of_study": "Computer Science"
},
"session_id": "8355e5cf-5558-4680-ba59-f24d578ea569"
}
{
"response": "Based on the Czech government scholarship program...",
"query": "What are the requirements for Czech government scholarships?",
"country": "czech_republic",
"user_background": {...},
"session_id": "8355e5cf-5558-4680-ba59-f24d578ea569",
"sources": [
{
"type": "scholarship_program",
"source": "scholarship_scraper",
"country": "Czech Republic",
"category": "program_details",
"program_name": "Master's Degree"
}
],
"confidence": 0.85,
"error": false
}
Database Schema
Users Table
CREATE TABLE users (
id SERIAL PRIMARY KEY,
email VARCHAR(255) UNIQUE NOT NULL,
full_name VARCHAR(255),
is_active BOOLEAN DEFAULT TRUE,
is_superuser BOOLEAN DEFAULT FALSE,
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
updated_at TIMESTAMP WITH TIME ZONE
);
Sessions Table
CREATE TABLE sessions (
id SERIAL PRIMARY KEY,
session_id VARCHAR(255) UNIQUE NOT NULL,
user_id INTEGER REFERENCES users(id),
title VARCHAR(500),
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
updated_at TIMESTAMP WITH TIME ZONE
);
Chat Messages Table
CREATE TABLE chat_messages (
id SERIAL PRIMARY KEY,
session_id INTEGER REFERENCES sessions(id),
role VARCHAR(50) NOT NULL, -- 'user', 'assistant', 'system'
content TEXT NOT NULL,
timestamp TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);
AI Prompt Engineering
System Prompt Design
The system prompt is carefully crafted to:
- Define the AI's role as a study abroad advisor
- Establish behavioral guidelines (encouraging, supportive)
- Set clear boundaries and safety measures
- Prevent prompt injection attacks
- Maintain ethical guidelines
Context Integration
- Chat history is included for conversation continuity
- User background information is incorporated for personalized responses
- Retrieved documents are formatted with clear source attribution
- Confidence scores help users understand response reliability
Security and Ethics
Data Privacy
- User data is stored securely in PostgreSQL
- Session IDs are UUIDs for anonymity
- No personally identifiable information is logged
Ethical Guidelines
- AI responses are designed to be encouraging but realistic
- Clear disclaimers when information is incomplete
- No assistance with unethical or illegal activities
- Transparent source attribution for all information
Prompt Injection Prevention
- System instructions are protected from manipulation
- AI refuses to reveal internal prompts or instructions
- Robust handling of attempts to bypass safety measures
Response Quality
- Confidence Scoring: 0.0-1.0 scale based on document relevance
- Source Attribution: Clear identification of information sources
- Context Preservation: Chat history maintained across sessions
- Response Time: < 3 seconds for typical queries
- Vector Search: Sub-second retrieval from ChromaDB
- Database Operations: Asynchronous queries for scalability
Deployment Architecture
Docker Containerization
services:
app:
build: .
ports:
- "8000:8000"
environment:
- GROQ_API_KEY=${GROQ_API_KEY}
- OPENAI_API_KEY=${OPENAI_API_KEY}
depends_on:
- db
- pgadmin
db:
image: postgres:15
environment:
- POSTGRES_DB=study_abroad_db
- POSTGRES_USER=postgres
- POSTGRES_PASSWORD=password
volumes:
- postgres_data:/var/lib/postgresql/data
pgadmin:
image: dpage/pgadmin4
environment:
- PGADMIN_DEFAULT_EMAIL=admin@admin.com
- PGADMIN_DEFAULT_PASSWORD=admin
ports:
- "5050:80"
Environment Configuration
- Development: Local Docker setup with hot reloading
- Production: Container orchestration with proper secrets management
- Scaling: Horizontal scaling capability through stateless design
Future Enhancements
-
Phase 2: Enhanced Features
- Multi-Country Support: Expand beyond Czech Republic
- User Authentication: Secure user accounts and profiles
- File Upload: Document processing for application materials
- Application Tracking: Progress monitoring for scholarship applications
- Multi-Modal Support: Image and document analysis
- Personalized Recommendations: ML-based scholarship matching
- Predictive Analytics: Success probability assessment
- Natural Language Processing: Advanced query understanding
- Mobile Application: Native iOS/Android apps
- Integration APIs: Connect with university systems
- Analytics Dashboard: Usage insights and improvements
- Community Features: Peer-to-peer support network