Project Description
The Time Series Regression Analysis project was a project that exposed me to a hands — on practical experience to combine python analysis and machine learning to predict future sales of the assigned company. The Corporation Favorita wants to ensure that they always have the right quantity of products in stock. To do this I have decided to build a series of machine learning models to forecast the demand of products in various locations. The marketing and sales team have provided you with some data to aid this endeavor. Corporation Favorita aims to optimize its inventory management by accurately forecasting the demand for various products across its stores in Ecuador. The goal is to ensure that each store has the right quantity of products in stock to meet customer demand while minimizing overstocking or stockouts.
Introduction
The predicting grocery sales were to forecast product sales for Corporacion Favorita, an Ecuadorean Grocery shop with hundreds of stores and over thousands of unique products.
As part of the project, the following processes were followed:
Initial data import, cleanup, and overview.
Exploratory Data Analysis — This provided more insight to uncover the need for more data cleaning.
Model Development . (I performed data processing, built models, chose algorithms, chose best-performing model, etc).
Model evaluation and comparison.
Project Objective
The objective is to build machine learning models that can predict unit sales for different product families at Favorita stores accurately. These models will help optimize inventory levels, improve sales forecasting accuracy, and ultimately enhance customer satisfaction by ensuring product availability.
Hypothesis & Questions
The following hypothesis and questions were developed to achieve the project objective.
Hypothesis
Null Hypothesis (Ho): Holidays do not have a significant effect on the sales.
Alternate Hypothesis (Ha): Holidays have a significant effect on the sales.
Questions
The following questions guided the regression analysis:
Is the train dataset complete (has all the required dates)?
Which dates have the lowest and highest sales for each year (excluding days the store was closed)?
Compare the sales for each month across the years and determine which month of which year had the highest sales.
Did the earthquake impact sales?
Are certain stores or groups of stores selling more products? (Cluster, city, state, type)
Are sales affected by promotions, oil prices and holidays?
What analysis can we get from the date and its extractable features?
Which product family and stores did the promotions affect.
What is the difference between RMSLE, RMSE, MSE (or why is the MAE greater than all of them?)
Does the payment of wages in the public sector on the 15th and last days of the month influence the store sales.
Data Import, Clean-up, and Exploratory Data Analysis (EDA)
The following libraries were imported to aid EID, analysis, and machine learning models.
Feature Engineering
In order to optimize stock management and accurately predict demand for products, it is important to process and engineer features in the data. Feature processing involves cleaning, transforming and preparing the raw data, while feature engineering involves creating new features or variables that are more useful for modeling. Feature processing is to check if the data is complete and if any missing values need to be imputed. Feature engineering is where new features are created from the existing data. This will enable the model to learn from seasonal trends and adjust its predictions accordingly.
Hypothesis Testing
Null Hypothesis (Ho): Holidays do not have a significant effect on the sales.
Alternate Hypothesis (Ha): Holidays have a significant effect on the sales.
I used Mann-Whiteny to test to the Hypothesis because our sales has a high positive skweness and the distribution is not normal and reject the Null hypothesis so holidays do not have a significant effect on the sales.
Machine Learning and Modeling
At this part of the project, 4models were trained and validated. These are linear regression model, XGBRegressor, ARIMA, SARIMA, Prophet models.
The following evaluation metrics were used to evaluate the performance of the models.
— Mean Absolute Error (MAE)
— Mean Squared Error (MSE)
— Root Mean Squared Error (RMSE)
— Root Mean Squared Logarithmic Error (RMSLE)
A results DataFrame was created to store the scores of the evaluation metrics used to evaluate these models. After training each model, the model will be evaluated and the scores of the evaluation metrics saved in the results DataFrame accordingly. At the end of the process, the scores of the evaluation metrics were collected in the results DataFrame.
Conclusion
The evaluation results indicate that the SARIMA model outperforms both the ARIMA and Prophet models in predicting sales for Corporation Favorita based on all three metrics — MAE, RMSE, and RMSLE.
Recommendations
Consider exploring hybrid models that combine the strengths of SARIMA, ARIMA, and Prophet models to further improve forecasting accuracy.
Key insight
1.SARIMA Model:
MAE: The SARIMA model exhibits a MAE of approximately $57,429.31. This metric indicates the average absolute difference between the predicted and actual sales values, reflecting the model’s accuracy in forecasting.
RMSE: With an RMSE of around $82,988.77, the SARIMA model measures the average magnitude of error between predicted and observed values, emphasizing the model’s ability to capture deviations from actual sales data.
RMSLE: The SARIMA model achieves an RMSLE of 0.09, highlighting the relative error between predicted and actual sales values on a logarithmic scale. A lower RMSLE suggests better accuracy in predicting sales trends.
MAE: The ARIMA model yields a MAE of approximately $111,904.55, indicating a higher average error in predicting sales for Corporation Favorita compared to the SARIMA model.
RMSE: With an RMSE of around $131,392.51, the ARIMA model demonstrates a larger magnitude of error, suggesting less accuracy in capturing sales fluctuations compared to the SARIMA model.
RMSLE: The ARIMA model achieves an RMSLE of 0.15, indicating a higher relative error compared to the SARIMA model, which signifies less accuracy in sales prediction on a logarithmic scale.
3.Prophet Model
MAE: The Prophet model records the highest MAE of approximately $121,510.38, suggesting it has the highest average error in predicting sales among the three models.
RMSE: With an RMSE of around $136,012.85, the Prophet model shows the largest error in capturing sales variations, indicating it is the least effective at predicting precise sales values.
RMSLE: The Prophet model achieves an RMSLE of 0.16, which is the highest among the three models, suggesting it has the least accurate prediction of sales trends on a logarithmic scale.
Finally
Thank you so much for your time to read.
find me on LinkedIn: www.linkedin.com/in/alice-mbera
Link to the project repo on GitHub: https://github.com/alicembera/Time-Series-Regression-Analysis-