JapaPolicy AI: Your Intelligent UK Immigration Assistant

JapaPolicy AI simplifies the complexities of UK immigration rules for professionals, students, families, and employers, particularly those relocating from Nigeria. Built using a robust Retrieval-Augmented Generation (RAG) pipeline, it provides precise, context-aware answers grounded in a comprehensive library of official GOV.UK immigration documents. Ask natural language questions and receive accurate, cited answers directly drawn from the latest rules and visa guidance.

The primary interface is an intuitive Streamlit web application, making trustworthy immigration information easily accessible.

The Problem Solved

UK immigration rules are extensive, frequently updated, and spread across numerous PDF documents. Finding specific, reliable answers often requires sifting through hundreds of pages of dense legal text. JapaPolicy AI addresses this by allowing users to query the official documentation directly using everyday language.

System Architecture

JapaPolicy AI employs a classic RAG pipeline:

Document Loading & Preprocessing:

Official UK immigration PDFs (Skilled Worker, Graduate, Global Talent, Appendix rules, Sponsor Guidance, etc.) are loaded from the /data directory.
Documents are intelligently split into smaller, overlapping text chunks using RecursiveCharacterTextSplitter. The splitting strategy respects meaningful boundaries like paragraphs, headings, and list items to preserve context.

Vector Database & Indexing (Offline Process):

The build_db.py script handles the one-time indexing process:

Text chunks are converted into vector embeddings using the powerful sentence-transformers/all-mpnet-base-v2 model.
These embeddings are stored persistently in a local ChromaDB vector collection (uk_immigration_docs) configured for Cosine Similarity, ensuring efficient semantic search.
A parallel keyword index is built using BM25Okapi for hybrid search capabilities.

This pre-computation step ensures the Streamlit application starts quickly.

Retrieval (Runtime):

User queries are preprocessed by expanding key terms (e.g., "spouse" becomes "spouse partner husband wife...") to improve recall.
A hybrid search strategy is employed:
Semantic Search: The query is embedded, and ChromaDB retrieves the most similar chunks based on cosine similarity.
Keyword Search: BM25 identifies chunks containing exact keywords from the query.
Reciprocal Rank Fusion (RRF): Results from semantic and keyword searches are combined and re-ranked to produce the most relevant context passages.

The system includes a fallback mechanism: if hybrid search yields too few results, it reverts to pure semantic search to ensure sufficient context is provided.

Generation (Runtime):

The top-ranked, relevant text chunks (with source citations) are formatted into a context block.
This context, along with the original user question, is passed to the Google Gemini 2.5 Pro Large Language Model via a carefully crafted prompt template (prompts/prompt_template.md).
The LLM generates a clear, structured answer strictly based on the provided context, citing the source document and page number for each key piece of information.

Technical Highlights

Grounded Generation: All answers are derived solely from the loaded GOV.UK PDFs, preventing hallucination and ensuring information is based on official sources.
Hybrid Search: Combines dense vector (semantic) search with sparse keyword (BM25) search using Reciprocal Rank Fusion (RRF) for optimal relevance retrieval. Includes intelligent fallback to pure semantic search if hybrid results are too sparse.
Query Expansion: Automatically expands user queries with relevant synonyms and related terms (e.g., "ILR" -> "Indefinite Leave to Remain settlement...") to improve search recall.
Optimized Indexing: Uses sentence-transformers/all-mpnet-base-v2 for high-quality embeddings and ChromaDB for persistent local storage, enabling fast application startup. Configured explicitly for Cosine Similarity.
Confidence Scoring: Provides a High/Medium/Low confidence score based on the semantic similarity of the retrieved documents, helping users gauge answer reliability.
Structured Output: LLM is prompted to provide answers in a consistent "Eligibility / Requirements / Next Steps" format where applicable, enhancing readability.
Clear Source Attribution: Every answer includes citations pointing to the specific source PDF and page number used by the LLM.

Key Features

Natural Language Queries: Ask questions conversationally (e.g., "Can my wife work if she joins me on a Skilled Worker visa?").
Accurate & Cited Answers: Get responses grounded in official documents, complete with source references.
Comprehensive Knowledge Base: Indexed from 50+ key GOV.UK immigration PDFs (configurable by adding files to /data).
User-Friendly Web Interface: Built with Streamlit for easy interaction, including chat history and a 'Clear Chat' option.
Developer Contact: Easily accessible contact information for feedback or inquiries.
(Potentially) Low Cost: Leverages high-quality open-source embedding models and potentially free tiers of LLM APIs (subject to usage limits and API provider terms).

Project Structure

JapaPolicy/
│
├── streamlit_app.py        # Main Streamlit application script
├── build_db.py             # Script to build/update the vector database (RUN ONCE)
│
├── src/                    # Core Python modules
│   ├── app.py              # Contains RAGAssistant class, query processing
│   └── vectordb.py         # Handles ChromaDB interactions, embedding, search
│
├── chroma_db/              # Persistent vector database storage (Created by build_db.py)
│   └── ... (database files)
│
├── data/                   # Directory for input PDF documents
│   ├── Skilled Worker visa.pdf
│   └── ... (other GOV.UK PDFs)
│
├── prompts/                # Prompt templates for the LLM
│   └── prompt_template.md
│
├── .env                    # Local environment variables (API Keys - DO NOT COMMIT)
├── .gitignore              # Specifies files/folders for Git to ignore
├── requirements.txt        # Python dependencies
└── README.md               # This file

Setup and Installation

Clone the Repository:

git clone [https://github.com/Ojey-egwuda/JapaPolicy-AI-UK-Immigration-Policy-Assistant.git](https://github.com/Ojey-egwuda/JapaPolicy-AI-UK-Immigration-Policy-Assistant.git)
cd JapaPolicy-AI-UK-Immigration-Policy-Assistant

Create a Virtual Environment: (Recommended):

python -m venv venv
venv\Scripts\activate

Install Dependencies:

pip install -r requirements.txt

Set Up Environment Variables:
Create a file named .env in the project root directory

GOOGLE_API_KEY="YOUR_ACTUAL_GOOGLE_API_KEY_HERE"

Add PDF Documents:
Place all the official GOV.UK PDF documents you want to index into the data/ folder. Ensure the folder exists in the project root.

Running the Application

There are two main steps:

Build the Vector Database (Run ONCE or when PDFs change):

Ensure your virtual environment is active.
Run the build script from the project root directory

python build_db.py

This process will load PDFs from /data, chunk them, generate embeddings, and save the index to the /chroma_db folder. This may take a significant amount of time (e.g., 20+ minutes) depending on the number of documents and your computer's speed. You only need to do this initially and whenever you add/update PDFs in the /data folder.

Run the Streamlit Web Application:

Ensure your virtual environment is active.
Run the Streamlit app from the project root directory:

streamlit run streamlit_app.py

Your web browser should automatically open to the application's interface. The app will connect to the pre-built chroma_db index for fast startup.

Evaluation

The system was tested on a suite of common immigration questions covering visa switching, dependent eligibility, salary thresholds, and ILR requirements. Key findings:

Retrieval Accuracy: Consistently retrieved relevant sections from the correct GOV.UK documents.
Answer Grounding: Generated answers stayed strictly within the context provided by the retrieved documents. Zero hallucinations observed.
Structured Output: Successfully followed the requested Eligibility/Requirements/Next Steps format for applicable queries.
Citation Correctness: Source citations (file and page) accurately reflected the documents used for generation.

Use Cases

Prospective Immigrants (e.g., from Nigeria): Quickly understand eligibility for visas like Skilled Worker or Graduate routes, including salary and documentation requirements.
Current Visa Holders: Clarify rules for extensions, switching visa categories, bringing dependents, or applying for Indefinite Leave to Remain (ILR).
Employers/Sponsors: Verify sponsorship duties, check eligible occupation codes, and understand rules for Certificates of Sponsorship (CoS).
Students: Understand post-study work options (Graduate visa) and pathways to sponsored work.
Families: Navigate requirements for partners and children joining or accompanying main applicants.

Future Features

Real-Time GOV.UK Updates: Integrate with official APIs (if available) or web scraping for automatic detection and indexing of rule changes.
Multimodal Queries: Extend capabilities to understand information within tables or potentially visa application form screenshots.
Localized Advice: Add features specifically helpful for users from certain regions (e.g., currency conversions, specific document requirements).
User Feedback Loop: Allow users to rate answer helpfulness to identify areas for improvement.
Analytics Dashboard: Provide insights into common queries and potential gaps in the knowledge base.

Contributing

Contributions, feedback, and suggestions are welcome! Please feel free to open an issue on GitHub to report bugs or propose new features. Pull requests for improvements are also appreciated.

Summary

JapaPolicy AI aims to democratize access to complex UK immigration information. By leveraging a robust RAG pipeline grounded in official GOV.UK documents, it provides users with reliable, cited, and easy-to-understand answers through a conversational web interface, saving significant time and effort compared to manual document searches.

JapaPolicy AI: Your Intelligent UK Immigration Assistant

The primary interface is an intuitive Streamlit web application, making trustworthy immigration information easily accessible.

The Problem Solved

System Architecture

JapaPolicy AI employs a classic RAG pipeline:

Document Loading & Preprocessing:

Official UK immigration PDFs (Skilled Worker, Graduate, Global Talent, Appendix rules, Sponsor Guidance, etc.) are loaded from the /data directory.
Documents are intelligently split into smaller, overlapping text chunks using RecursiveCharacterTextSplitter. The splitting strategy respects meaningful boundaries like paragraphs, headings, and list items to preserve context.

Vector Database & Indexing (Offline Process):

The build_db.py script handles the one-time indexing process:

Text chunks are converted into vector embeddings using the powerful sentence-transformers/all-mpnet-base-v2 model.
These embeddings are stored persistently in a local ChromaDB vector collection (uk_immigration_docs) configured for Cosine Similarity, ensuring efficient semantic search.
A parallel keyword index is built using BM25Okapi for hybrid search capabilities.

This pre-computation step ensures the Streamlit application starts quickly.

Retrieval (Runtime):

User queries are preprocessed by expanding key terms (e.g., "spouse" becomes "spouse partner husband wife...") to improve recall.
A hybrid search strategy is employed:
Semantic Search: The query is embedded, and ChromaDB retrieves the most similar chunks based on cosine similarity.
Keyword Search: BM25 identifies chunks containing exact keywords from the query.
Reciprocal Rank Fusion (RRF): Results from semantic and keyword searches are combined and re-ranked to produce the most relevant context passages.

The system includes a fallback mechanism: if hybrid search yields too few results, it reverts to pure semantic search to ensure sufficient context is provided.

Generation (Runtime):

The top-ranked, relevant text chunks (with source citations) are formatted into a context block.
This context, along with the original user question, is passed to the Google Gemini 2.5 Pro Large Language Model via a carefully crafted prompt template (prompts/prompt_template.md).
The LLM generates a clear, structured answer strictly based on the provided context, citing the source document and page number for each key piece of information.

Technical Highlights

Grounded Generation: All answers are derived solely from the loaded GOV.UK PDFs, preventing hallucination and ensuring information is based on official sources.
Hybrid Search: Combines dense vector (semantic) search with sparse keyword (BM25) search using Reciprocal Rank Fusion (RRF) for optimal relevance retrieval. Includes intelligent fallback to pure semantic search if hybrid results are too sparse.
Query Expansion: Automatically expands user queries with relevant synonyms and related terms (e.g., "ILR" -> "Indefinite Leave to Remain settlement...") to improve search recall.
Optimized Indexing: Uses sentence-transformers/all-mpnet-base-v2 for high-quality embeddings and ChromaDB for persistent local storage, enabling fast application startup. Configured explicitly for Cosine Similarity.
Confidence Scoring: Provides a High/Medium/Low confidence score based on the semantic similarity of the retrieved documents, helping users gauge answer reliability.
Structured Output: LLM is prompted to provide answers in a consistent "Eligibility / Requirements / Next Steps" format where applicable, enhancing readability.
Clear Source Attribution: Every answer includes citations pointing to the specific source PDF and page number used by the LLM.

Key Features

Natural Language Queries: Ask questions conversationally (e.g., "Can my wife work if she joins me on a Skilled Worker visa?").
Accurate & Cited Answers: Get responses grounded in official documents, complete with source references.
Comprehensive Knowledge Base: Indexed from 50+ key GOV.UK immigration PDFs (configurable by adding files to /data).
User-Friendly Web Interface: Built with Streamlit for easy interaction, including chat history and a 'Clear Chat' option.
Developer Contact: Easily accessible contact information for feedback or inquiries.
(Potentially) Low Cost: Leverages high-quality open-source embedding models and potentially free tiers of LLM APIs (subject to usage limits and API provider terms).

Project Structure

JapaPolicy/
│
├── streamlit_app.py        # Main Streamlit application script
├── build_db.py             # Script to build/update the vector database (RUN ONCE)
│
├── src/                    # Core Python modules
│   ├── app.py              # Contains RAGAssistant class, query processing
│   └── vectordb.py         # Handles ChromaDB interactions, embedding, search
│
├── chroma_db/              # Persistent vector database storage (Created by build_db.py)
│   └── ... (database files)
│
├── data/                   # Directory for input PDF documents
│   ├── Skilled Worker visa.pdf
│   └── ... (other GOV.UK PDFs)
│
├── prompts/                # Prompt templates for the LLM
│   └── prompt_template.md
│
├── .env                    # Local environment variables (API Keys - DO NOT COMMIT)
├── .gitignore              # Specifies files/folders for Git to ignore
├── requirements.txt        # Python dependencies
└── README.md               # This file

Setup and Installation

Clone the Repository:

git clone [https://github.com/Ojey-egwuda/JapaPolicy-AI-UK-Immigration-Policy-Assistant.git](https://github.com/Ojey-egwuda/JapaPolicy-AI-UK-Immigration-Policy-Assistant.git)
cd JapaPolicy-AI-UK-Immigration-Policy-Assistant

Create a Virtual Environment: (Recommended):

python -m venv venv
venv\Scripts\activate

Install Dependencies:

pip install -r requirements.txt

Set Up Environment Variables:
Create a file named .env in the project root directory

GOOGLE_API_KEY="YOUR_ACTUAL_GOOGLE_API_KEY_HERE"

Add PDF Documents:
Place all the official GOV.UK PDF documents you want to index into the data/ folder. Ensure the folder exists in the project root.

Running the Application

There are two main steps:

Build the Vector Database (Run ONCE or when PDFs change):

Ensure your virtual environment is active.
Run the build script from the project root directory

python build_db.py

This process will load PDFs from /data, chunk them, generate embeddings, and save the index to the /chroma_db folder. This may take a significant amount of time (e.g., 20+ minutes) depending on the number of documents and your computer's speed. You only need to do this initially and whenever you add/update PDFs in the /data folder.

Run the Streamlit Web Application:

Ensure your virtual environment is active.
Run the Streamlit app from the project root directory:

streamlit run streamlit_app.py

Your web browser should automatically open to the application's interface. The app will connect to the pre-built chroma_db index for fast startup.

Evaluation

The system was tested on a suite of common immigration questions covering visa switching, dependent eligibility, salary thresholds, and ILR requirements. Key findings:

Retrieval Accuracy: Consistently retrieved relevant sections from the correct GOV.UK documents.
Answer Grounding: Generated answers stayed strictly within the context provided by the retrieved documents. Zero hallucinations observed.
Structured Output: Successfully followed the requested Eligibility/Requirements/Next Steps format for applicable queries.
Citation Correctness: Source citations (file and page) accurately reflected the documents used for generation.

Use Cases

Prospective Immigrants (e.g., from Nigeria): Quickly understand eligibility for visas like Skilled Worker or Graduate routes, including salary and documentation requirements.
Current Visa Holders: Clarify rules for extensions, switching visa categories, bringing dependents, or applying for Indefinite Leave to Remain (ILR).
Employers/Sponsors: Verify sponsorship duties, check eligible occupation codes, and understand rules for Certificates of Sponsorship (CoS).
Students: Understand post-study work options (Graduate visa) and pathways to sponsored work.
Families: Navigate requirements for partners and children joining or accompanying main applicants.

Future Features

Real-Time GOV.UK Updates: Integrate with official APIs (if available) or web scraping for automatic detection and indexing of rule changes.
Multimodal Queries: Extend capabilities to understand information within tables or potentially visa application form screenshots.
Localized Advice: Add features specifically helpful for users from certain regions (e.g., currency conversions, specific document requirements).
User Feedback Loop: Allow users to rate answer helpfulness to identify areas for improvement.
Analytics Dashboard: Provide insights into common queries and potential gaps in the knowledge base.

Contributing

Contributions, feedback, and suggestions are welcome! Please feel free to open an issue on GitHub to report bugs or propose new features. Pull requests for improvements are also appreciated.

Summary

JapaPolicy AI: Your Intelligent UK Immigration Assistant

JapaPolicy AI: Your Intelligent UK Immigration Assistant

Table of contents

JapaPolicy AI: Your Intelligent UK Immigration Assistant

The Problem Solved

System Architecture

Technical Highlights

Key Features

Project Structure

Setup and Installation

Running the Application

Evaluation

Use Cases

Future Features

Contributing

Summary

Table of contents

JapaPolicy AI: Your Intelligent UK Immigration Assistant

The Problem Solved

System Architecture

Technical Highlights

Key Features

Project Structure

Setup and Installation

Running the Application

Evaluation

Use Cases

Future Features

Contributing

Summary

Code

Code