I propose a robust hierarchical machine learning pipeline that classifies e-commerce products into both top-level and fine-grained bottom-level categories. This is based on multimodal metadata (text, categorical, numeric). The pipeline leverages advanced feature engineering, dimensionality reduction, and boosting-based classification techniques.
Hereโs a clean academic-style flowchart of the full pipeline:

The project started by aggregating over 350+ Parquet files (train + test), performing cleaning and enrichment on product metadata.
import glob import pandas as pd files = glob.glob("/data/train/*.parquet") etsy_train_df = pd.concat([pd.read_parquet(f) for f in files]) # Cleaning text fields def clean_text(text): text = re.sub(r'[^a-zA-Z0-9\s]', '', text.lower()) return " ".join([lemmatizer.lemmatize(w) for w in word_tokenize(text) if w not in stop_words]) etsy_train_df['description_clean'] = etsy_train_df['description'].apply(clean_text)
I process three types of features:
Then I perform dimensionality reduction using TruncatedSVD:
from sklearn.decomposition import TruncatedSVD X_text = tfidf_vectorizer.fit_transform(etsy_train_df['combined_text']) svd = TruncatedSVD(n_components=300, random_state=42) X_text_svd = svd.fit_transform(X_text)
I first predict the top category using XGBoost with 5-Fold Cross Validation:
model = xgb.XGBClassifier(tree_method='gpu_hist', n_estimators=300, max_depth=15) scores = cross_val_score(model, X_train_svd, y_top, cv=5, scoring='f1_macro')



For each predicted top category, I train a dedicated classifier (XGBoost/LightGBM) using the same reduced feature space plus top category probabilities as features.
for top_cat in unique_top_categories: X_top = X_train_enhanced[y_top_train == top_cat] y_bottom = y_bottom_train[y_top_train == top_cat] clf = lgb.LGBMClassifier(...) clf.fit(X_top, y_bottom)
This shows how predicted top categories relate to bottom categories:

| Model | Accuracy | Macro F1 | Weighted F1 |
|---|---|---|---|
| Top Category (XGBoost) | 80.88% | 0.7778 | 0.8052 |
| Bottom Category (Hierarchical) | ~25.19% | ~0.2464 | Varies |


๐ฎ In future work, I aim to:
Due to NDA constraints, the Etsy dataset cannot be shared. For access, please request permission from the original data provider.
โ ๏ธ Dataset not available due to NDA restrictions with Etsy Inc.
This project delivers a scalable and interpretable hierarchical product categorization system. By breaking down the classification task into top and bottom predictions, I align with the way marketplaces organize products and demonstrate strong performance across diverse product types