The project is aimed at creating, training, and using a language model tailored for working with lecture materials in PDF format. It encompasses a complete data processing pipeline – from text extraction to model fine-tuning using modern techniques such as LoRA, as well as preparation for semantic search. The speed and quality of learning directly depend on the power of your PC! For example, I have a gtx 1070 and 32 gb of RAM, which is not the most powerful hardware, is it? The only problem I've encountered is very slow learning and response generation.
extract_text.py
pdfminer
library, and saving it in a text format. This process prepares the data for subsequent processing stages.split_text.py
augment_text.py
nlpaug
library. The script generates several variants of each text file by randomly swapping words.create_dataset.py
train_test_split
function from scikit-learn. This structuring facilitates efficient model training.train_llama.py
Trainer
from Hugging Face Transformers;create_faiss_index.py
data/
Contains:
train
, val
, and test
)scripts/
This directory contains all scripts for data processing and model training:
extract_text.py
– text extraction from PDFssplit_text.py
– splitting text into chunksaugment_text.py
– text augmentationcreate_dataset.py
– dataset creation and splittingtrain_llama.py
– language model fine-tuningchat_interface_web.py
– interacting with the AI in a web browser (locally). Very simple interface, designed for conveniencemodels/
Contains saved models, training logs, and fine-tuning results.
Additional details and installation instructions are provided in the README.md
file.
The project represents a comprehensive solution for building a language model capable of processing lecture materials. It automates data preparation—from extracting text from PDFs to splitting, augmenting, and structuring the dataset—and then employs modern fine-tuning techniques (such as LoRA) to adapt the LLaMA model. Potential integration with FAISS opens up prospects for implementing semantic search and interactive systems, making the project a valuable tool for educational and research purposes.