This article presents a comprehensive methodology for automatically generating high-quality, domain-specific question-answer datasets for Large Language Model (LLM) fine-tuning. The proposed pipeline combines intelligent web scraping, document processing, automated Q&A generation, and rigorous quality assessment to create training datasets that significantly outperform traditional manual or generic approaches.
The fine-tuning of Large Language Models for domain-specific applications presents a fundamental challenge: acquiring high-quality training data. Traditional approaches either rely on expensive manual annotation processes or utilize generic datasets that fail to capture domain-specific nuances. This work introduces an automated pipeline that addresses these limitations while maintaining dataset quality standards suitable for production deployments.
The proposed system implements a six-stage pipeline:
A critical innovation of this approach is the implementation of a carefully calibrated quality distribution:
This distribution ratio was empirically determined through extensive testing and represents an optimal balance for achieving both domain expertise and appropriate boundary recognition.
The web scraping component utilizes LangGraph to implement an intelligent agent-based approach:
def extract_web_content(topic: str, chunk_count: int) -> List[str]: """ Extracts domain-specific content chunks from web sources. Args: topic: Domain or subject area for content extraction chunk_count: Target number of content chunks to generate Returns: List of filtered, high-quality content chunks """ # Implementation details in agent_webscraper/agent.py
Key Features:
PDF documents are processed using Docling for structured content extraction:
def process_pdf_documents(data_directory: str) -> List[str]: """ Extracts structured content chunks from PDF documents. Args: data_directory: Path to directory containing PDF files Returns: List of processed content chunks ready for Q&A generation """ # Implementation in chunk_generation.py
The core Q&A generation utilizes Google Gemini 2.0 Flash with carefully crafted prompts:
def generate_qa_pairs(chunks: List[str], domain: str) -> List[Dict]: """ Generates domain-specific question-answer pairs from content chunks. Args: chunks: List of content chunks for processing domain: Target domain for contextualized generation Returns: List of structured Q&A pairs with metadata """ # Centralized prompts in prompts.py # Implementation in syntheticdatageneration.py
Prompt Engineering Strategy:
The system employs domain-agnostic prompt templates that can be customized for any subject area:
def generation_prompt_template(domain: str) -> str: return f""" You are an expert in {domain}. Generate question-answer pairs that demonstrate deep understanding. Requirements: - Questions should be specific and detailed - Answers must be comprehensive and accurate - Include practical examples where relevant - Maintain professional tone throughout - For off-topic questions, politely decline and redirect Format each response as structured JSON... """
Each generated Q&A pair undergoes dual-dimensional evaluation:
def assess_quality(qa_pair: Dict) -> Tuple[int, int]: """ Evaluates Q&A pairs on accuracy and style dimensions. Returns: Tuple of (accuracy_score, style_score) both on 1-10 scale """ # Implementation in dataquality_check.py
Evaluation Criteria:
Accuracy Score (1-10):
Style Score (1-10):
Quality Threshold: Both scores must exceed 6 for inclusion in the final dataset.
PDF Documents + Web Sources
↓
Content Chunking
↓
Q&A Generation → dataset/raw.json
↓
Data Flattening → dataset/unfiltered.json
↓
Quality Assessment → dataset/quality_results.json
↓
Filtering & Formatting → final_dataset/filtered.json
↓
Model Training (QLoRA)
def create_chat_template(domain: str) -> str: """ Generates domain-specific chat templates for consistent model behavior. """ return f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are an expert assistant specializing in {domain}. Provide accurate, detailed responses within your domain expertise. <|eot_id|><|start_header_id|>user<|end_header_id|> {{{{ user_message }}}}<|eot_id|><|start_header_id|>assistant<|end_header_id|> """
Traditional approaches to dataset generation face several limitations:
This methodology addresses these limitations through:
All system prompts and parameters are centralized in prompts.py
for easy customization:
# Domain-specific customization DOMAIN = "your_target_domain" CHUNK_SIZE = 500 QUALITY_THRESHOLD = 6 GENERATION_RATE_LIMIT = 4 # seconds between API calls
This automated pipeline demonstrates that high-quality, domain-specific datasets can be generated efficiently without manual annotation. The combination of intelligent content extraction, rigorous quality assessment, and optimized training procedures produces models that exhibit strong domain expertise while maintaining appropriate response boundaries.
The methodology is particularly valuable for organizations requiring specialized AI assistants but lacking the resources for extensive manual dataset creation. Future work will focus on expanding the quality assessment framework and exploring multi-modal content integration.
The complete implementation is available as an open-source template:
Repository: Custom LLM Dataset Generation Template
The repository includes all source code, configuration files, and documentation necessary to reproduce this methodology for domain-specific fine-tuning projects.
This work was conducted as part of ongoing research into automated dataset generation for specialized LLM applications. All code and methodologies are provided under open-source licensing for research and commercial use.