Abstract

This publication outlines the development of a production-ready, rule-grounded conversational AI system intended to respond to cricket law questions based solely on the official MCC 2017 Laws of Cricket. The system utilizes a minimalist Retrieval-Augmented Generation (RAG) model with persistent vector storage, semantic retrieval, and highly constrained generation. The system enforces hallucination prevention through mandatory retrieval, carefully controlled prompting, and fixed fallback responses. This project illustrates the potential of grounded AI systems to provide high levels of reliability, transparency, and correctness in rule-grounded domains without reliance on external knowledge or inference.

Introduction
1.1 Background and Motivation
1.2 Problem Statement
1.3 Objectives
1.4 Scope
System Architecture
2.1 Architectural Overview
2.2 Component Layers
Methodology
3.1 Document Ingestion
3.2 Text Processing
3.3 Chunking Strategy
3.4 Embedding Pipeline
3.5 Vector Storage
3.6 Retrieval Process
3.7 Generation Control
3.8 Failure Handling
Hallucination Control Framework
Implementation Details
Limitations
Future Work
Conclusion
Repository

1. Introduction

1.1 Background and Motivation

Although Large Language Models (LLMs) have revolutionized conversational AI, they are unreliable in domains that require very high levels of strict factual correctness. Rulebooks, legal texts, and regulatory literature require grounded answers, not probabilistic outputs. In sports law, for example, incorrect answers can propagate misinformation and misinterpretation of official rules.

This project describes a grounded AI system capable of answering cricket rule questions based solely on the official MCC 2017 Laws of Cricket. The system avoids inference, speculation, and external knowledge, ensuring that every answer is directly traceable to the official source.

1.2 Problem Statement

General-purpose LLMs are prone to hallucinations when answering structured rule-based queries. They may generate rules out of thin air, refer to outdated knowledge from their training data, or infer information not explicitly stated in official literature, making them unsuitable for domains requiring high levels of correctness.

The main problem with current AI systems is the lack of grounded generation and source-bounded reasoning.

1.3 Objectives

Build a rule-grounded AI assistant
Enforce retrieval-before-generation
Prevent hallucinations
Ensure document traceability
Maintain deterministic system behavior
Build a minimal reproducible RAG pipeline
Enable persistent vector storage
Provide transparent system design
Support safe rule-based AI use cases

1.4 Scope

Single authoritative document system
Rule-based domain only
Text-only interaction
CLI-based interface

2. System Architecture

2.1 Architectural Overview

The system follows a modular pipeline architecture that separates data ingestion, vector storage, retrieval, and generation into independent components. This ensures clarity, reproducibility, and maintainability.

2.2 Component Layers

Data Layer

This document is a PDF and is the entire official rulebook.

"Laws of Cricket 2017 Edition, Marylebone Cricket Club (MCC)"

No other documents, summaries, or external data sources are used. This constraint ensures traceability and limits ambiguity in retrieved answers.

Processing Layer

PDF Loader

Recursive Text Splitter

Embedding Layer

Sentence-transformer embedding model

Storage Layer

Chroma persistent vector database

Retrieval Layer

Semantic similarity search

Generation Layer

LLM with strict grounding prompt

Interface Layer

CLI application

3. Methodology

3.1 Document Ingestion

The MCC 2017 PDF is loaded using a PDF document loader.

3.2 Text Processing

Extracted text is normalized for segmentation.

3.3 Chunking Strategy

Recursive character-based chunking using:

Chunk size: 500 characters
Overlap: 100 characters

3.4 Embedding Pipeline

Each chunk is embedded using:

sentence-transformers/all-mpnet-base-v2

3.5 Vector Storage

Embeddings are stored in a persistent Chroma database to ensure:

Stability
Reproducibility
No re-embedding


### Document Ingestion and Vectorization Pipeline

+---------------------+
|      MCC PDF        |
+---------------------+
          |
          v
+---------------------+
|     PDF Loader      |
+---------------------+
          |
          v
+---------------------+
|    Text Chunking    |
| (Recursive Splitter)|
+---------------------+
          |
          v
+-----------------------------+
| Sentence-Transformer Embeds |
|  (all-mpnet-base-v2 model) |
+-----------------------------+
          |
          v
+-----------------------------+
|  Chroma Persistent Vector DB|
+-----------------------------+

3.6 Retrieval Process

Semantic similarity search retrieves top-k relevant chunks.

3.7 Generation Control

LLM generation is constrained by:

Context-only answering
No inference
No external knowledge
No hallucination

3.8 Failure Handling

If retrieval fails, system returns:

"I don't know. No relevant information found."


### Query Pipeline

+------------------+
|    User Input    |
+--------+---------+
         |
         v
+---------------------+
| Query Validation| 
+--------+------------+
         |
         v
+-----------------+
|    Retriever     | 
+--------+--------+
         |
         v
+----------------------+
| LLM Generation  | 
+--------+-------------+
         |
         v
+-----------------------+
| Response Output |
+-----------------------+

4. Hallucination Control Framework

4.1 Prompt Constraints

Strict system instruction enforces context-only answers.

4.2 Retrieval Dependency

Generation is impossible without retrieval.

4.3 Fallback Mechanism

Fixed refusal response when no data exists.

4.4 Deterministic Generation

Zero-temperature model configuration ensures stability.

5. Implementation Details

5.1 Technology Stack

Python
LangChain
ChromaDB
HuggingFace Embeddings
Groq LLM API
dotenv

5.2 VectorDB Module

Handles document ingestion, chunking, embedding, and storage.

5.3 RAGAssistant Module

Handles retrieval, prompting, and generation.

6. Limitations

Single document usage
No quantitative metrics
No ranking model
No benchmarking
No multi-hop reasoning

7. Future Work

Multi-document support
Rule version comparison
Automated evaluation
Benchmark datasets
Web interface
Visual dashboards

8. Conclusion

This project demonstrates a grounded, minimal RAG architecture for rule-based AI systems. By enforcing retrieval-first generation and strict context grounding, the system achieves high reliability and hallucination resistance. It serves as a reference model for safe AI design in correctness-critical domains.

9. Repository

Project Repository: https://github.com/m-noumanfazil/cricket-rag-cli-using-langchain

CricketRAG: A Grounded RAG System for Authoritative Cricket Rule Question Answering