A variety of programs already exist to convert OCR PDF files to the EPUB format. However, the vast majority of these programs do manual logic-based conversions. Although these types of conversions are often less resource intensive and can be run on lower powered computers, there are a number of issues associated with them:
Because of the way that large language models work under the hood, they are often much more effective at differentiating between how the text from a page can be converted, without having to implement complex logic into the program.
Using a simple tkinter interface, the user is able to select a PDF file on their computer to be scanned. After this, they can then select the name of the chapter, the starting page as well as the ending page of the chapter. The user will then click to have the program run, whereby the text from the chapters will then be analysed by the LLM (currently only works with Meta's Llama 3.1 8 billion parameter model), where it will then return the correctly formatted text back into a form that can be inserted into a EPUB file.
State of the program
The program currently works, however, in a limited capacity. PDF files can be analysed, and converted into EPUB files. There are a number of improvements that I plan on making to this program in the future, including with things relating to the GUI and reducing the codebase by deleting code that repeats in certain areas.
This program's main limitation is related to the quality of the LLMs that run on it. From my own personal experience of experimenting with the more complex prompts on LLMs such as Claude 3 and GPT-4o with simple texts, the output has always been perfect, with zero errors, and it has always been able to output the result in such a way that an EPUB file can interpret and handle correctly. Using the Llama 3.1 8 billion parameter models locally however, in spite of it being very good much of the time with the simple prompt, there are occasionally issues that are encountered, such as unnecessary footnotes being added.
Until we get more powerful models that can work locally and that produce results akin to those of the currently more powerful models, this program will be artificially limited by the LLM itself, and not the external programming and logic surrounding it.
This is mostly a fun task for myself, allowing me to experiment with Pythons default GUI tkinter, some of the third-party Python libraries that let you interact with PDF and EPUB files (pymupdf and ebooklib), as well as briefly working with Ollama and Langchain for the AI side.
I was inspired to do this project after going through many of the free PDF books on the Internet Archive's website, archive.org. Thinking about the incredible amount of knowledge within those books, and thinking about the fact that if you had wanted to convert it to a separate file format it would take humans significantly longer to do, and be much more complicated than current methods, made me think that there had to be a better way.
If you are able to extract this kind of content much easier in the future, then you'd be able to speed up significantly the access of content like this. Additional features would include adding additional metadata to this kind of content that wasn't currently there. There is significant work that can and must be done to improve currently LLMs to work on this kind of content, but I believe that if the will was there, it would get done.
There are no datasets linked
There are no models linked
There are no models linked
There are no datasets linked