This document presents a Retrieval-Augmented Generation (RAG) framework optimized for scientific and technical information using Korean-language academic datasets. Our system integrates state-of-the-art retrieval and generation techniques to enhance accuracy, reliability, and usability in scientific question answering (QA) tasks.
Key contributions include:
Hybrid Retriever Architecture: Combining BM25 with Kiwi Tokenizer and Sentence Embedding Retriever into an Ensemble Retriever, achieving higher retrieval accuracy.
LLM Evaluation for Answer Quality: Incorporating GPT-based evaluation metrics to ensure contextual relevance and factual accuracy.
Optimization of Chunking Methods: Identifying 500-token chunk size with a 50-token overlap as the best configuration for retrieval and generation.
LLM Selection & Prompt Optimization: Qwen2-7B-Instruct was chosen for balanced performance across retrieval and answer generation tasks.
Advanced RAG Enhancements: Reranker and HyDE integration to refine retrieval accuracy, with notable improvements in answer generation.
The developed framework provides a robust and scalable solution for scientific literature QA, offering domain-specific reliability and enhanced retrieval precision.
1. Introduction
1.1 Research Motivation & Problem Statement
Large Language Models (LLMs) often hallucinate, relying solely on pre-trained data without incorporating external factual references. This is a critical issue for scientific and technical information retrieval, where accuracy and credibility are paramount. To address this, we develop a RAG system specialized for scientific information, integrating structured retrieval with generative AI to ensure high-fidelity, evidence-backed answers.
1.2 Scope & Contributions
This study explores RAG optimizations for scientific datasets, specifically:
Evaluating open-source RAG frameworks for performance on Korean-language scientific corpora.
Integration of KONI-Llama models for further Korean-language fine-tuning
6. References
Xiaonan Li, Changtai Zhu, Linyang Li, Zhangyue Yin, Tianxiang Sun, Xipeng Qiu,“LLatrieval: LLM-Verified Retrieval for Verifiable Geneartion”, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024), 2024.
Luyu Gao, Xueguang Ma, Jimmy Lin, Jamie Callan, "Precise Zero-Shot Dense Retrieval without Relevance Labels," Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023), 2023.
Hwanjun Song, Hang Su, Igor Shalyminov, Jason Cai, Saab Mansour, "FineSurE: Fine-grained Summarization Evaluation using LLMs", Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL2024), 2024.