![Untitled design (2).png](Untitled%20design%20(2).png)--DIVIDER--# Abstract **Manual labeling of narrative clinical documents not only burdens healthcare workers but also increases the risk of classification errors and delays. Integrating artificial intelligence (AI) in the medical field enables the automation of this classification process. The research aims to identify the optimal combination of feature engineering technique and classification model for accurate narrative document classification. A robust preprocessing pipeline is also introduced to enhance data quality and model performance. The analysis includes three categories of feature engineering techniques: feature representation, word embeddings, and contextual embeddings, as well as three types of classification models: traditional machine learning algorithms, sequence models, and BERT transformers. The results indicate that a voting classifier with TF-IDF representation outperforms other combinations, achieving an accuracy of 89% and an F1-score of 88.6%. These findings demonstrate the potential for AI to streamline clinical documentation processes and prove that integrating such tools enhances healthcare providers’ performance.** --DIVIDER--# Introduction Healthcare workers spend approximately 27% of their working hours on direct patient care and 49% on desk work, including labeling clinical documents. Despite this significant investment of time, clinical documents are often mislabeled, misplaced, and lost, which leads to delays in medical work. In 2019, the World Health Organization (WHO) reported that around 2.6 million deaths in low- and middle-income countries were due to medical errors, with approximately 58% of reported medication errors linked to usability issues in clinical documents. To address these challenges, artificial intelligence (AI) can assist the healthcare system by helping medical staff work more effectively and reducing their workload in tasks like automated document classification. Since almost 80% of clinical data are locked in an unstructured format, Natural Language Processing (NLP), a sub-field of AI, is designed to process and analyze this unstructured data. There is still limited research focused on classifying narrative clinical documents. These documents, which contain free-text descriptions of patients' conditions, are crucial for clinical decision-making. Their unstructured format makes them difficult to automatically process without effective pre-processing techniques. Our research objective is to identify the optimal combination of feature engineering techniques and classification models that enhances accuracy and efficiency in classifying narrative clinical documents, thereby reducing the manual workload of healthcare workers. Our key contributions to the medical field include: - A comprehensive comparison of feature engineering techniques—feature representation, word embeddings, and contextual embeddings. - A detailed evaluation of classification models, including traditional machine learning, sequence models, and transformers. - The introduction of a robust pre-processing pipeline specifically tailored to the challenges of unstructured narrative clinical data, significantly improving the input quality for classification models. --DIVIDER--# Methodology The methodology involves a systematic approach to transforming clinical documents into predicted labels through a series of well-defined steps: text pre-processing, feature engineering, classification model, and validation & evaluation.

Processes.

## Dataset Due to privacy concerns, obtaining clinical datasets while preserving patient privacy is challenging. That is why previous research that relies on real documents tends to refrain from sharing them. Therefore, the dataset is an open-source dataset from Kaggle website. The dataset comprises 500 clinical documents in plain text format, with an average length of 2300 words. The documents lack a defined or specific format, and the variance in document length is significant, necessitating advanced techniques for analysis. These documents were originally in paper format and were digitized using OCR (Optical character recognition) technique. The dataset is divided into 5 medical categories, including neurology, Radiology, Discharge Summary, General Medicine, and Gastroenterology, with approximately 100 documents per category. Due to the small dataset, data augmentation is necessary to increase its size and enhance model performance, robustness, and generalization. ChatGPT prompts were used to generate free-text documents, inspired by the layout and writing style of documents in the dataset. Related procedures for each class are referenced through these links [neurology](https://reliantmedicalgroup.org/medical-services/neurology/common-procedures/), [radiology](https://www.mayoclinic.org/departments-centers/radiology/sections/tests-procedures/orc-20469692), and [gastroenterology](https://www.medicalnewstoday.com/articles/327441#what-procedures-do-they-perform) to give ChatGPT the procedure’s name. 250 documents were generated with 50 documents in each class, ensuring a balanced representation across all categories. ## 1) Text Pre-processing It aims to enhance the quality of the input data by addressing various linguistic irregularities present in unstructured documents to ensure that the subsequent stages operate on standardized data, hence improving the classification performance. The 5 following techniques are followed sequentially to address the data quality and consistency issues. The pre-processing steps include: - **Acronym Expansion**: is achieved by creating a dictionary of commonly used acronyms within each class and mapping them to their expanded forms in lowercase. The dictionary contains 195 acronyms with their relevant expansion. Since the usage of abbreviations is very common in clinical text, this step facilitates model comprehension of unseen expressions, enhancing the interpretation in clinical document processing. - **Regular Expression (RE) Matching**: includes the usage of the RE library to cleanse the text from various forms of noise that could potentially mislead the model during analysis. To ensure the anonymity of individuals mentioned in the documents, patients’, doctors’, and hospitals’ names and titles were removed. Irrelevant elements that do not contribute to the semantic content such as digits, dates, extra spaces, newline characters, single characters, and special characters were eliminated. - **Tokenization**: is the process of breaking down a text into units (tokens). Unlike simple splitting which helps handle the ambiguity between words and special cases. This step is essential for initiating any NLP task and ensuring precise text analysis. - **Lemmatization**: is a technique that accurately reduces words to their root form. This process maps verbs to their infinitive forms and plural nouns to singular. Its benefits include reducing dictionary size and eliminating unrelated words, ensuring that related words are consolidated into one, thereby facilitating better classification. - **Stop-word Removal**: involves eliminating commonly used words that do not carry meaningful information, such as articles, prepositions, and conjunctions. The main aim is to reduce text data dimensionality and improve computational efficiency. A total of 233 words were added to the predefined lists of stop-words from both Spacy and NLTK libraries for elimination. It includes frequent adjectives, medical expressions, and adverbs used in clinical contexts. Below is a comparison of a discharge summary document before and after applying pre-processing steps. The average document size was significantly reduced from 2300 to 750 words, indicating successful data refinement.

Discharge Summary Document Before Pre-processing

Discharge Summary Document Atfer Pre-processing

## 2) Feature Engineering Feature engineering focuses on transforming raw text into a numerical format suitable for model learning. The choice of technique was aligned with the classifier type to optimize overall model performance. The techniques used are: - **Feature Representation**: Bag of Words, TF-IDF, and Doc2Vec. - **Word Embedding**: Static embeddings like GloVe and FastText, along with dynamic embeddings. - **Contextual Embedding**: Using ClinicalBERT embeddings to capture word context from surrounding text. ## 3) Classification Models We classified algorithms into three categories: - **A) Traditional Machine Learning Models**: Random Forest (RF), Support Vector Machine (SVM), Naïve Bayes (NB), K-nearest neighbors (KNN) , and Multilayer Perceptron (MLP). - **B) Sequence Models**: Recurrent Neural Networks (RNN), Long Short-Term Memory (LSTM), and Gated Recurrent Units (GRU). - **C) Transformer Models**: Transfer learning with BERT variants such as BERT Base, BioClinicalBERT, PubMedBERT, SciBERT, and ClinicalBERT. Our goal was to identify the top-performing model within each category. The table below provides a summary of the classification models across these categories, along with the feature engineering techniques applied. | Category | Feature Engineering | Classification Models | |--------------------|-------------------------------------------------------------|-----------------------------------------------------------------------------------------| | Category 1 | Bag of Words, TF-IDF, Doc2Vec | RF, SVM, NB, KNN, and MLP | | Category 2 | Word Embedding:
Static Embedding (GloVe & FastText)
Dynamic Embedding | RNN, LSTM, and GRU | | Category 3 | Contextual Embedding | BERT Base, BioClinicalBERT, PubMedBERT, SciBERT, ClinicalBERT | ## 4) Validation & Evaluation The dataset was split into 80% for training and 20% for testing. Hyperparameter optimization techniques, including grid search for ML models, Keras Tuner for sequence models, and k-fold cross-validation, were used to optimize the performance of all models. Evaluation metrics include: $$ \text{Accuracy} = \frac{TP + TN}{TP + FP + TN + FN} $$ $$ \text{Precision} = \frac{TP }{TP + FP} $$ $$ \text{Recall} = \frac{TP }{TP + FN} $$ $$ \text{F1-Score} = \frac{2 * Precision * Recall }{Precision + Recall} $$ Our main focus will be on accuracy, as the balanced dataset ensures that accuracy is a reliable indicator of overall performance across classes. We will also focus on the F1-score to ensure that the model performs well overall, accurately identifies each class and avoids bias by balancing precision and recall across classes. --DIVIDER--# Results We used Python programming language to implement the classification models. The NLP and deep learning libraries and frameworks employed in the implementation are Natural Language Toolkit (NLTK), SpaCy, Regex (re-module), Scikit-learn, Gensim, TensorFlow, PyTorch, Transformers from Hugging Face website. ### A) Traditional ML classifiers Table below illustrates the optimized performance of each algorithm using TF-IDF and Doc2Vec. Since we have limited data, TF-IDF generally achieved better results than Doc2Vec. Using the TF-IDF technique, KNN demonstrates the highest accuracy and precision values, while SVM achieves the highest recall and F1-score values. | Model | TF-IDF Accuracy | TF-IDF Precision | TF-IDF Recall | TF-IDF F1-Score | Doc2Vec Accuracy | Doc2Vec Precision | Doc2Vec Recall | Doc2Vec F1-Score | |---------------------|------------------|-------------------|---------------|------------------|------------------|-------------------|----------------|-------------------| | RF | 0.80 | 0.81 | 0.81 | 0.80 | 0.86 | 0.86 | 0.859 | 0.858 | | NB | 0.85 | 0.85 | 0.84 | 0.844 | 0.77 | 0.77 | 0.77 | 0.77 | | KNN | **0.89** | **0.885** | 0.87 | 0.876 | 0.76 | 0.77 | 0.76 | 0.75 | | SVM | 0.88 | 0.884 | **0.88** | **0.88** | 0.84 | 0.84 | 0.85 | 0.84 | | MLP | 0.85 | 0.85 | 0.846 | 0.848 | 0.81 | 0.81 | 0.81 | 0.81 | | Voting Classifier | **0.89** | **0.893** | **0.885** | **0.886** | 0.858 | 0.858 | 0.855 | 0.854 | Therefore, we decided to combine both of them to complement each other. As a result, the voting classifier achieved the best performance. ### B) Sequence Models The architecture of each model consists of an embedding layer followed by sequence layers with defined dropout and recurrent dropout, and a dense layer containing 5 neurons (representing the number of classes) with a softmax activation function to classify the document. Table below illustrates the performance of each model based on predefined metrics. GRU has the highest performance among other sequence models since it can capture long-term dependencies better than RNN and converges faster than LSTM before the occurrence of overfitting. ### C) Transfer Learning using BERT Transformers To fine-tune the BERT models, we used the PyTorch library to perform additional data pre-processing steps to prepare the data in a format suitable for the transformer. This pipeline included tokenization, padding, and truncation to ensure uniform data formatting. During training, a dense layer was added on top of the pre-trained BERT model while freezing its layers to classify the document. Table below shows the performance of each model, and it is evident that the ClinicalBERT transformer outperforms all other models since it is trained on discharge summary documents that have the same characteristics as our dataset. | Type | Model | Accuracy | Precision | Recall | F1-score | |------------------|---------------------|----------|-----------|--------|----------| | Sequence Models | RNN | 0.61 | 0.55 | 0.50 | 0.52 | | | LSTM | 0.80 | 0.80 | 0.78 | 0.79 | | | GRU | **0.88** | **0.88** | **0.88** | **0.88** | | Transformers | BERT Base | 0.70 | 0.75 | 0.71 | 0.70 | | | BioClinicalBERT | 0.80 | 0.80 | 0.80 | 0.80 | | | PubMedBERT | 0.81 | 0.83 | 0.82 | 0.81 | | | SciBERT | 0.81 | 0.84 | 0.82 | 0.82 | | | ClinicalBERT | **0.85** | **0.85** | **0.84** | **0.84** The voting classifier achieved the best performance across three categories, achieving an accuracy of 0.89 and an F1-score of 0.886. The GRU model outputs close results with an accuracy of 0.88 and an F1-score of 0.878. The least achieving model was ClinicalBERT with an accuracy of 0.85 and an F1-score of 0.84. The reason deep learning and transformers did not perform better than machine learning is that they require a large amount of data for training or tuning their parameters. However, they produced comparable results, which proves that increasing the dataset size will definitely improve their performance. By comparing the best results before and after applying all the model steps, it is clear that the accuracy has improved by 35% and the F1-score by 32%. This highlights the critical role of the introduced pre-processing pipeline in improving model performance.

effect_preprocessing Effect of Pre-processing

--DIVIDER--# Web Application Demo We have integrated the high-performing model into the ‘CliniDocPredictor’ web application demo, which provides three core features: document upload, content display, and label prediction. The frontend is built using React.js, while the backend is powered by Django. The primary goal of the demo is to showcase the model’s capabilities, particularly on new documents generated that have not been seen by the model either in training or testing. A recording of the application in action can be viewed in the following YouTube video. :::youtube[CliniDocPredictor]{#OLQi3J8RsVM} --DIVIDER--# Conclusion The study evaluated and compared various classification models and feature engineering techniques to identify the best combination for accurately labeling narrative clinical documents. To address the small dataset size, data augmentation increased the dataset by 50%, and a robust pre-processing pipeline improved model performance. The top-performing model was a voting classifier with the TF-IDF feature engineering technique, achieving an accuracy of 89% and an F1-score of 88.6%. This work serves as a proof of concept and could be a viable solution for adoption on a larger scale to enhance healthcare decision-making. By documenting all relevant information about patients, this system has the potential to reduce the workload on healthcare providers and, most importantly, save patients' lives. --DIVIDER--# Limitations After conducting our analysis, it is essential to address the research limitations. Firstly, the dataset size is a significant constraint. Access to a larger dataset is needed to improve the model’s generalizability. Additionally, incorporating more categories to include all departments in the hospital or healthcare unit, as well as documents outside the medical field, would allow the model to differentiate between medical-related documents and unrelated content, classifying the latter as 'other.' Lastly, the limited computational resources provided by the free version of Colab affected the fine-tuning of BERT models, which require substantial computational power.--DIVIDER--# Future Work A practical application is to integrate the model with a system database to automate the data entry of all clinical documents. The provided model will take unlabeled documents as input and decide to label them with their corresponding class and feed them to the hospital system. As a result, by entering the patient identification number, we can access the patient history with all relevant clinical documents. The figure below explains the pipeline of this solution and how each part will interact with the others.

Solution

Additionally, researchers will benefit from well-organized documents to achieve more effective research outcomes. Finally, with minor adjustments to pre-processing techniques and fine-tuning of the relevant BERT models, the approach can be applied to various domains, not just the medical field.