
Implementing AI systems—whether generative AI, large language models (LLMs), autonomous agents, or machine learning models—introduces significant new cybersecurity risks beyond traditional IT vulnerabilities. As of early 2026, reports from sources like the World Economic Forum, NIST, SentinelOne, OWASP, and industry forecasts highlight that AI-related vulnerabilities are among the fastest-growing cyber risks.
This particular project uses a subset of the available Text and PDF documents that are focused on Cybersecurity for AI implementations and NIST guidelines using Retrieval Augmented Generation.
Retrieval-Augmented Generation (RAG) improves AI accuracy by anchoring responses in trusted, external documents rather than relying on pre-trained, static data. By first retrieving relevant information and then generating answers based on that context, RAG minimizes hallucinations and provides reliable and domain-specific information. Users can search for contextually relevant information by submitting a query.
Organizations implementing AI in 2026 face an overwhelming volume of cybersecurity guidance. Key sources include the OWASP Top 10 for LLM Applications, the new OWASP Top 10 for Agentic Applications , and NIST's preliminary Cybersecurity Framework Profile for Artificial Intelligence among others. This proliferation creates challenges: information overload, difficulty applying controls to specific use cases, time-consuming cross-referencing, hallucination risks from general LLMs, and compliance needs for traceable sources.
The goal of this project was to build a Retrieval-Augmented Generation (RAG) bot that resolves these issues by indexing documents, grounding responses in verified sources for accuracy and citations. It delivers fast, auditable answers—reducing research time dramatically—while adapting to new guidance and supporting compliance.
Every RAG system operates in two phases. The first step or "phase" is document Insertion where the documents are organized and the second "phase" is retrieval.

The first step is to store the knowledge base documents into the /data directory. This project supports .PDF and .txt files.

Modern text embedding models have strict input limits on max token size. Long documents often exceed 20,000–100,000+ tokens, so you can't embed them whole. Chunking ensures compliance while preparing data for vector search.
In this project we are using HuggingFace models sentence-transformers which has a limit of 512 tokens, while research papers often exceed 50,000 tokens.
The documents are split into smaller 500 character chunks with a 50 character overlap using Langchain's RecursiveCharacterTextSplitter for retrieval accuracy.
Vector Embeddings are dense numerical representations of data—such as words, sentences, paragraphs, images, audio, or even user profiles—turned into a list (array) of numbers, called a vector.
In this project each chunk gets converted into a 384-dimensional vector representation that captures its semantic meaning using the sentence-transformers/all-MiniLM-L6-v2 model.
A vector database (also called a vector store or vector search engine) is a specialized database designed to store, index, and query high-dimensional vector embeddings efficiently.
In this project embeddings and corresponding metadata are stored in a ChromaDB vector database.

The user's question gets converted to a vector using the same embedding model you used for your documents. This ensures the question and documents exist in the same vector space.
The system then searches the ChromaDB vector database to find chunks whose embeddings are closest to the question's embedding. This will return sections from one or multiple publications that discuss concepts related to the question.
The most relevant chunks get retrieved from the database—maybe 3-5 sections that best match the user's question.
These retrieved chunks, along with a system prompt, get sent to an LLM (Gemini, OpenAI, or Groq.). The LLM generates a response based on this focused, relevant context.
The LLM has everything it needs to answer a question that is relevant to the question that is asked. It will not answer out of context questions.
-Python 3.11
-A virtual environment
-One API key (Google Gemini) Quick Start
-Create and activate a virtual environment macOS/Linux: python3.11 -m venv .venv source .venv/bin/activate
-Windows: python -m venv .venv .venv\Scripts\activate
Install dependencies pip install -r requirements.txt
Add your API key--Create a .env file at the root directory and add the following:
GOOGLE_API_KEY=your_gemini_api_key_here
GOOGLE_MODEL=gemini-2.5-flash
OPENAI_API_KEY=your_openai_api_key_here OPENAI_MODEL=gpt-4o-mini
GROQ_API_KEY=your_groq_api_key_here GROQ_MODEL=llama-3.1-8b-instant
EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
The first run will build embeddings and may take a minute.
ChromaDB data is stored in chroma_db/.
Document Loading and Chunking

Query with context

Out of scope query

Short term memory verification

Streamlit UI for chat
You can run: python src/streamlit_app.py or Run with python (auto-launch)
The number of sources queried can be adjusted with the settings slide bar

This RAG assistant is focused on Cybersecurity for AI. Example questions:
What is Cybersecurity for AI?
What are best practices for implementing AI?
What are some pitfalls when implementing AI?
Principles for the Secure Integration of Artificial Intelligence in Operational Technology
Humberto Santander Pelaez, Manuel. “Controlling Network Access to ICS Systems.” Diaries (blog), SANS
Technology Institute Internet Storm Center. July 3, 2023. https://isc.sans.edu/diary/30000.
“Overview of IEC 61508 & Functional Safety.” International Electrotechnical Commission (IEC),
(PowerPoint, 2022).
https://assets.iec.ch/public/acos/IEC%2061508%20&%20Functional%20Safety2022.pdf?2023040501.
Shulhoff, Sander. “Basic LLM Settings.” Learn Prompting. Last modified March 10, 2025.
https://learnprompting.org/docs/intermediate/configuration_hyperparameters.
Guidelines for secure AI system development
This document is published by the UK National Cyber Security Centre (NCSC), the US Cybersecurity and Infrastructure Security Agency (CISA), and the following international partners:
National Security Agency (NSA)
Federal Bureau of Investigation (FBI)
Australian Signals Directorate’s Australian Cyber Security Centre (ACSC)
Canadian Centre for Cyber Security (CCCS)
New Zealand National Cyber Security Centre (NCSC-NZ)
Chile's Government CSIRT
National Cyber and Information Security Agency of the Czech Republic (NUKIB)
Information System Authority of Estonia (RIA)
National Cyber Security Centre of Estonia (NCSC-EE)
French Cybersecurity Agency (ANSSI)
Germany’s Federal Office for Information Security (BSI)
Israeli National Cyber Directorate (INCD)
Italian National Cybersecurity Agency (ACN)
Japan’s National center of Incident readiness and Strategy For Cybersecurity (NISC)
Japan’s Secretariat of Science, Technology and Innovation Policy, Cabinet Office (CSTI)
Nigeria's National Information Technology Development Agency (NITDA)
Norwegian National Cyber Security Centre (NCSC-NO)
Poland Ministry of Digital Affairs
Poland’s NASK National Research Institute (NASK)
Republic of Korea National Intelligence Service (NIS)
Cyber Security Agency of Singapore (CSA)
Integrating LangChain and ChromaDB, this project delivers a functional RAG-based QA research assistant that grounds LLM responses in proprietary data to ensure accuracy. Now you can use this project to customize and experiment with different documents, models (Groq and OpenAI) and different chunk sizes and overlap. The architecture provides a scalable framework, serving as a launchpad for future agentic workflows and multi-tool integrations.