FDA Drug Marketing Category Classifier

💊 FDA Product Classification using Machine Learning and Deep Learning Models

This project presents a robust multi-class classification pipeline to predict the marketing category of FDA-listed pharmaceutical products. Leveraging a blend of classical ML models and deep learning, the pipeline handles structured metadata to categorize each product into NDA, ANDA, OTC, BLA, or UNAPPROVED.

📌 Project Highlights

Task: Multi-class classification of FDA drug entries by marketing category
Input Data: Structured metadata (product name, route, substance name, year, etc.)
Models Used: Logistic Regression, Random Forest, XGBoost, Deep Learning (Keras)
Techniques: Label Encoding, Normalization, Hyperparameter Tuning, Embedding Layers
Evaluation: Confusion Matrix, ROC Curve, Accuracy, AUC
Visuals: SHAP values, Feature Importance, Distribution plots

🧪 Dataset

Source: Public FDA database (processed version available below)
📂 Dataset on Hugging Face

🛠️ Preprocessing Overview

The dataset includes text, categorical, and numeric fields. Preprocessing included:

Grouping categories (e.g., combining subtypes under main classes like NDA)
Label Encoding of categorical fields (e.g., route, substance name)
Normalization of numeric features (year, name lengths)
Missing value handling

✨ Feature Engineering

Categorical variables encoded using LabelEncoder
Numerical features normalized using mean-std scaling
Feature importance visualized using SHAP and built-in model metrics:

🎯 Model Training

Logistic Regression

Baseline model for reference

Random Forest

Tuned using GridSearchCV

XGBoost Classifier

Strong performance, especially for BLA and NDA

Deep Learning (Keras)

Initial dense model + improved model with Embedding layers

Improved Deep Learning

Uses embedding layers, tuned architecture, better generalization

📊 Distribution Plots

📁 Model & Dataset Access

📦 Dataset on Hugging Face
🤖 All Trained Models on Hugging Face

📈 Evaluation Summary

Model	Accuracy	ROC AUC	Notes
Logistic Regression	~0.83	0.81	Baseline
Random Forest	~0.96	0.97	Best for UNAPPROVED/NDA
XGBoost	~0.97	0.98	Consistently strong
Deep Learning	~0.96	0.97	Good with tuning
DL (Improved)	~0.99+	0.99+	Best overall generalization

🔍 Key Takeaways & Future Work

Deep learning outperformed traditional ML after sufficient tuning
Category imbalance addressed via grouping; future work can explore SMOTE
SHAP revealed strong impact of route and proprietary name features

Future Directions:

Experiment with BERT-style tabular encoders or TabNet
Integrate external drug description texts for enrichment
Add explainability dashboard (Gradio or Streamlit)

🧠 Summary

This project showcases a complete, reproducible pipeline for real-world regulatory data classification using both traditional ML and deep learning. It highlights preprocessing, feature importance, visual evaluation, and production-ready model deployment.

🎯 Deployed models and data are hosted on Hugging Face for public access and further experimentation.

GitHub