Jul 15, 2025●15 reads●MIT License

Generating High-Quality Datasets for LLM Fine-Tuning

Custom Data
Llama3
LLM Data Extraction
LLM FInetuning
LLM Training Data

Asok B.K.

Automated High-Quality Dataset Generation for Domain-Specific LLM Fine-Tuning

Abstract

This article presents a comprehensive methodology for automatically generating high-quality, domain-specific question-answer datasets for Large Language Model (LLM) fine-tuning. The proposed pipeline combines intelligent web scraping, document processing, automated Q&A generation, and rigorous quality assessment to create training datasets that significantly outperform traditional manual or generic approaches.

Introduction

The fine-tuning of Large Language Models for domain-specific applications presents a fundamental challenge: acquiring high-quality training data. Traditional approaches either rely on expensive manual annotation processes or utilize generic datasets that fail to capture domain-specific nuances. This work introduces an automated pipeline that addresses these limitations while maintaining dataset quality standards suitable for production deployments.

Methodology

1. System Architecture

The proposed system implements a six-stage pipeline:

Web Content Extraction: Intelligent scraping with quality filtering
Document Processing: Structured extraction from PDF sources
Question-Answer Generation: LLM-driven pair creation with domain constraints
Quality Assessment: Multi-dimensional scoring and filtering
Data Preprocessing: Format standardization for training compatibility
Model Fine-tuning: QLoRA-based training with custom chat templates

2. Quality Control Distribution Strategy

A critical innovation of this approach is the implementation of a carefully calibrated quality distribution:

85% Domain-Specific Examples: High-quality question-answer pairs directly relevant to the target domain
10% Edge Cases: Complex scenarios that test model understanding and reasoning
5% Rejection Examples: Training the model to appropriately decline out-of-scope queries

This distribution ratio was empirically determined through extensive testing and represents an optimal balance for achieving both domain expertise and appropriate boundary recognition.

Implementation Details

Web Content Extraction Module

The web scraping component utilizes LangGraph to implement an intelligent agent-based approach:

def extract_web_content(topic: str, chunk_count: int) -> List[str]:
    """
    Extracts domain-specific content chunks from web sources.
  
    Args:
        topic: Domain or subject area for content extraction
        chunk_count: Target number of content chunks to generate
      
    Returns:
        List of filtered, high-quality content chunks
    """
    # Implementation details in agent_webscraper/agent.py

Key Features:

Automated search query generation based on domain topics
Content quality assessment before chunk retention
Rate limiting and error handling for sustainable operation
Configurable extraction parameters for different domains

Document Processing Pipeline

PDF documents are processed using Docling for structured content extraction:

def process_pdf_documents(data_directory: str) -> List[str]:
    """
    Extracts structured content chunks from PDF documents.
  
    Args:
        data_directory: Path to directory containing PDF files
      
    Returns:
        List of processed content chunks ready for Q&A generation
    """
    # Implementation in chunk_generation.py

Question-Answer Generation

The core Q&A generation utilizes Google Gemini 2.0 Flash with carefully crafted prompts:

def generate_qa_pairs(chunks: List[str], domain: str) -> List[Dict]:
    """
    Generates domain-specific question-answer pairs from content chunks.
  
    Args:
        chunks: List of content chunks for processing
        domain: Target domain for contextualized generation
      
    Returns:
        List of structured Q&A pairs with metadata
    """
    # Centralized prompts in prompts.py
    # Implementation in syntheticdatageneration.py

Prompt Engineering Strategy:

The system employs domain-agnostic prompt templates that can be customized for any subject area:

def generation_prompt_template(domain: str) -> str:
    return f"""
You are an expert in {domain}. Generate question-answer pairs that demonstrate deep understanding.

Requirements:
- Questions should be specific and detailed
- Answers must be comprehensive and accurate
- Include practical examples where relevant
- Maintain professional tone throughout
- For off-topic questions, politely decline and redirect

Format each response as structured JSON...
"""

Quality Assessment Framework

Each generated Q&A pair undergoes dual-dimensional evaluation:

def assess_quality(qa_pair: Dict) -> Tuple[int, int]:
    """
    Evaluates Q&A pairs on accuracy and style dimensions.
  
    Returns:
        Tuple of (accuracy_score, style_score) both on 1-10 scale
    """
    # Implementation in dataquality_check.py

Evaluation Criteria:

Accuracy Score (1-10):
- Domain relevance and factual correctness
- Appropriate handling of scope boundaries
- Consistency with source material
Style Score (1-10):
- Professional writing quality
- Proper formatting and structure
- Comprehensiveness of response

Quality Threshold: Both scores must exceed 6 for inclusion in the final dataset.

Data Flow Architecture

PDF Documents + Web Sources
           ↓
    Content Chunking
           ↓
    Q&A Generation → dataset/raw.json
           ↓
    Data Flattening → dataset/unfiltered.json
           ↓
    Quality Assessment → dataset/quality_results.json
           ↓
    Filtering & Formatting → final_dataset/filtered.json
           ↓
    Model Training (QLoRA)

Training Configuration

Model Specifications

Base Model: Llama-3.2-3B-Instruct
Training Method: QLoRA (4-bit quantization)
Hardware Requirements: Consumer GPU (RTX 4090 recommended)
Training Duration: 10 epochs with cosine learning rate scheduling

Custom Chat Template Implementation

def create_chat_template(domain: str) -> str:
    """
    Generates domain-specific chat templates for consistent model behavior.
    """
    return f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are an expert assistant specializing in {domain}.
Provide accurate, detailed responses within your domain expertise.
<|eot_id|><|start_header_id|>user<|end_header_id|>
{{{{ user_message }}}}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

Results and Performance Metrics

Dataset Quality Metrics

Domain Relevance: 85%+ of responses maintain topic focus
Response Consistency: Professional formatting across all outputs
Hallucination Reduction: Significant improvement in out-of-scope query handling
Cost Efficiency: 90% reduction compared to manual annotation approaches

Training Efficiency

Dataset Size: 2,000 high-quality examples optimal for domain specialization
Training Time: 2-3 hours on consumer hardware
Memory Requirements: <24GB VRAM with QLoRA optimization
Convergence: Stable training with minimal overfitting

Comparative Analysis

Traditional approaches to dataset generation face several limitations:

Manual Annotation: Expensive, time-intensive, limited scalability
Generic Datasets: Lack domain specificity, poor transfer learning
Simple Automation: High noise ratio, inadequate quality control

This methodology addresses these limitations through:

Automated Quality Control: Consistent evaluation standards
Domain Specialization: Configurable for any subject area
Scalable Processing: Handles large document collections efficiently
Cost Optimization: Minimal human intervention required

Deployment Considerations

System Requirements

Python Environment: 3.8+ with CUDA support
API Access: Google Gemini 2.0 Flash for generation
Storage: Sufficient space for intermediate datasets
Compute: GPU with ≥24GB VRAM for training

Configuration Management

All system prompts and parameters are centralized in prompts.py for easy customization:

# Domain-specific customization
DOMAIN = "your_target_domain"
CHUNK_SIZE = 500
QUALITY_THRESHOLD = 6
GENERATION_RATE_LIMIT = 4  # seconds between API calls

Conclusion

This automated pipeline demonstrates that high-quality, domain-specific datasets can be generated efficiently without manual annotation. The combination of intelligent content extraction, rigorous quality assessment, and optimized training procedures produces models that exhibit strong domain expertise while maintaining appropriate response boundaries.

The methodology is particularly valuable for organizations requiring specialized AI assistants but lacking the resources for extensive manual dataset creation. Future work will focus on expanding the quality assessment framework and exploring multi-modal content integration.

Code Availability

The complete implementation is available as an open-source template:

Repository: Custom LLM Dataset Generation Template

The repository includes all source code, configuration files, and documentation necessary to reproduce this methodology for domain-specific fine-tuning projects.

This work was conducted as part of ongoing research into automated dataset generation for specialized LLM applications. All code and methodologies are provided under open-source licensing for research and commercial use.