
The project is aimed at creating, training, and using a language model tailored for working with lecture materials in PDF format. It encompasses a complete data processing pipeline – from text extraction to model fine-tuning using modern techniques such as LoRA, as well as preparation for semantic search. The speed and quality of learning directly depend on the power of your PC! For example, I have a gtx 1070 and 32 gb of RAM, which is not the most powerful hardware, is it? The only problem I've encountered is very slow learning and response generation.
extract_text.pypdfminer library, and saving it in a text format. This process prepares the data for subsequent processing stages.split_text.pyaugment_text.pynlpaug library. The script generates several variants of each text file by randomly swapping words.create_dataset.pytrain_test_split function from scikit-learn. This structuring facilitates efficient model training.train_llama.pyTrainer from Hugging Face Transformers;create_faiss_index.pydata/
Contains:
train, val, and test)scripts/
This directory contains all scripts for data processing and model training:
extract_text.py – text extraction from PDFssplit_text.py – splitting text into chunksaugment_text.py – text augmentationcreate_dataset.py – dataset creation and splittingtrain_llama.py – language model fine-tuningchat_interface_web.py – interacting with the AI in a web browser (locally). Very simple interface, designed for conveniencemodels/
Contains saved models, training logs, and fine-tuning results.
Additional details and installation instructions are provided in the README.md file.
The project represents a comprehensive solution for building a language model capable of processing lecture materials. It automates data preparation—from extracting text from PDFs to splitting, augmenting, and structuring the dataset—and then employs modern fine-tuning techniques (such as LoRA) to adapt the LLaMA model. Potential integration with FAISS opens up prospects for implementing semantic search and interactive systems, making the project a valuable tool for educational and research purposes.