This project presents a machine learning-based approach to sales forecasting using the Random Forest Classifier. The primary objective is to predict sales categories based on historical sales data and key business features such as product type, seasonality, promotions, and regional performance. The model leverages feature engineering and extensive exploratory data analysis (EDA) to identify patterns and trends within the dataset. The Random Forest algorithm was selected for its robustness and ability to handle non-linear relationships and feature interactions. Performance evaluation metrics such as accuracy, precision, recall, and confusion matrix were used to assess the model’s effectiveness. The results demonstrate that the model can provide valuable insights into future sales performance, supporting more informed business planning and decision-making.
Data Collection
The dataset was collected from historical sales records, including features such as product categories, store locations, promotional activity, seasonality indicators, and past sales performance.
Data Preprocessing
Missing values were identified and appropriately handled through imputation techniques.
Categorical variables were encoded using label encoding and one-hot encoding where necessary.
Outliers were detected and treated to reduce skewness in the data.
Numerical features were scaled to ensure uniformity and improve model performance.
Exploratory Data Analysis (EDA)
Statistical summaries and visualizations (such as heatmaps, bar charts, and box plots) were used to understand the distribution and relationships among variables.
Correlation analysis helped in identifying the most influential features affecting sales.
Feature Engineering
New features were derived to capture seasonality (e.g., month, day of week, holiday flags).
Interaction features between products and stores were created to enhance the model’s predictive power.
Model Selection
The Random Forest Classifier was chosen for its robustness, ability to handle large feature sets, and resistance to overfitting.
The classification approach was used to categorize sales into different predefined levels (e.g., Low, Medium, High).
Model Training and Validation
The dataset was split into training and testing sets (e.g., 80/20 split).
Cross-validation was applied to fine-tune hyperparameters such as the number of estimators and maximum depth.
Performance was evaluated using metrics like accuracy, precision, recall, F1-score, and confusion matrix.
Model Evaluation
The final model was tested on unseen data to evaluate generalization performance.
Feature importance was extracted from the Random Forest to interpret the most significant predictors.
Conclusion and Insights
The model showed high accuracy in classifying sales categories and highlighted key drivers influencing sales trends.
These insights can be used by businesses to optimize inventory planning, marketing strategies, and resource allocation.
The sales forecasting model was developed using a Random Forest Regressor and evaluated through K-Fold Cross-Validation to ensure its robustness and generalization capability. The key evaluation metrics are summarized below:
Average Mean Squared Error (MSE):
4247.37
This low MSE value indicates that the model’s predictions are close to the actual sales values, with minimal large errors.
Average R-squared (R²) Score:
0.9911
The high R² score demonstrates that the model explains approximately 99.1% of the variance in the sales data, suggesting excellent predictive performance.
Average Mean Absolute Error (MAE):
36.64
On average, the model's predictions deviate from the actual values by around 36.64 units, reflecting strong accuracy.
These results confirm that the Random Forest model effectively captures complex relationships within the data and can be reliably used for forecasting future sales. The high R-squared and low error values make it a valuable tool for data-driven business decision-making.