Machine Learning for Retail Sales Forecasting — Features Engineering
Understand the impacts of additional features related to stock-out, store closing date, or cannibalization on a Machine Learning model for sales forecasting.
Discover the power of Machine Learning for retail sales forecasting with features engineering.
As a data scientist, how can you improve your company's forecasts?
Based on the last Makridakis Forecasting Competitions feedback, machine learning models can reduce forecasting errors by 20% to 60% compared to benchmark statistical models.
Their major advantage is the capacity to include external features that heavily impact the variability of your sales.
For example, e-commerce cosmetics sales are driven by special events (promotions) and how you advertise a reference on the website (first page, second page, etc.).
How can you use this additional information to improve the accuracy?
Feature engineering is based on analytical concepts and business insights to understand what could drive your sales.
In this article, we will try to understand the impact of several features on the accuracy of a model using the M5 Forecasting competition dataset.
SUMMARY
I. Introduction
1. Data set
2. Initial Solution using LGBM
3. Features Analysis
II. Experiment
1. Additional features
2. Results
III. Conclusion
1. Generative AI
2. Next Steps
M5 Forecasting Dataset
Dataset of Retail Sales Transactions
This analysis will be based on the M5 Forecasting dataset of Walmart store sales records.
- 1,913 days for the training set and 28 days for the evaluation set
- 10 stores in 3 states (USA)
- 3,049 unique in 10 stores
- 3 main categories and 7 departments (sub-category)
The objective is to predict sales for all products in each store in the following 28 days, right after the available dataset. We have to perform 30,490 forecasts for each day in the prediction horizon.
We’ll use the validation set to measure the performance of our model.
Initial Solution using Machine Learning Algorithm LGBM
As a base model, we will use a clear and concise notebook shared by Anshul Sharma in Kaggle. (Link)
The idea is to understand how we can improve the accuracy of the model only by adding additional features (without touching the hyperparameters or changing the algorithm).
In this notebook, you will find all the different steps to build a quite good model with a reasonable computing time:
- Import and processing of raw data
- Exploratory Data Analysis
- Features Engineering
- i) Seasonality: week number, day, month, day of the week
- ii) Pricing: the weekly price of an item in each store, special events
- iii) Trends: sales lags (n-p days), average volume per {item, (item +store)}, …
- iv) Categorical Variables encoding: item, store, department, category, state
4. Model Training: 1 model LightGBM per store
Features Engineering to improve the model
To emphasize the impact of features engineering, we will not change the model and only look at which features we use.
Let us split the features used in this notebook into different buckets.
—
Bucket 1: Transactional Data
# Item id
'id', 'item_id',
# Store, Category, Department
'dept_id', 'cat_id', 'store_id', 'state_id'
# Transaction time
'd', 'wm_yr_wk', 'weekday', 'wday', 'month', 'year'
# Sales Qty, price and promotional events
'sold', 'event_name_1', 'event_type_1', 'event_name_2', 'event_type_2', 'sell_price'
events and sell_price
Capture the impact on sales of a special event on an item of selling price XXX.
What could be the impact of a special event with -20% reduction on sales of baby formula the second week of the month?
Open Question
What would be the impact on the accuracy if we do one-hot encoding for the categorical features?
—
Bucket 2: Sales Lags and Average
# Sales lag n = sales quantity of day - n
'sold_lag_1', 'sold_lag_2', 'sold_lag_3', 'sold_lag_7', 'sold_lag_14', 'sold_lag_28'
# Sales average by
'item_sold_avg', 'state_sold_avg',
'store_sold_avg', 'cat_sold_avg', 'dept_sold_avg',
# Sales by XXX and YYYY
'cat_dept_sold_avg', 'store_item_sold_avg', 'cat_item_sold_avg',
'dept_item_sold_avg', 'state_store_sold_avg',
'state_store_cat_sold_avg', 'store_cat_dept_sold_avg'
lags
Measure the week-on-week or month-on-month (7 days, 28 days) similarities to capture the periodicity of sales due to people shopping at these frequencies.
Do you have relatives going to the hypermarket every Saturday to shop for the whole week?
💡 Follow me on Medium for more articles related to 🏭 Supply Chain Analytics, 🌳 Sustainability and 🕜 Productivity.
Find the full code in my Github repository: Link (Follow me :D)
Features Engineering Strategies
Additional features
Based on business insights or common sense, we will add additional features built with existing ones to help our model to capture all the key factors impacting your customer demand.
—
Bucket 3: Rolling Mean and Rolling Mean applied on lag
# Rolling mean on actual sales
'rolling_sold_mean', 'rolling_sold_mean_3', 'rolling_sold_mean_7',
'rolling_sold_mean_14', 'rolling_sold_mean_21', 'rolling_sold_mean_28'
# Rolling mean on lag sales
'rolling_lag_7_win_7', 'rolling_lag_7_win_28', 'rolling_lag_28_win_7', 'rolling_lag_28_win_28'
rolling_sold_mean_n
Measure the average sales of the last n days.
Rolling mean is sometimes used alone as a benchmark model for statistical forecasting.
Code
rolling_lag_n_win_p
Measure the average sales of a p days windows ending n days ago.
Code
BUSINESS INSIGHTS
Sunglasses seasonality
If the rolling mean of the last 7 days is 35% higher than the average sales of the week before, that means you have started the summer season.
—
Bucket 4: Sales Trend and Rolling Maximum
# Selling Trend
'selling_trend', 'item_selling_trend',
# Rolling max
'rolling_sold_max', 'rolling_sold_max_1', 'rolling_sold_max_2', 'rolling_sold_max_7', 'rolling_sold_max_14', 'rolling_sold_max_21',
'rolling_sold_max_28'
Selling trend
Measure the gap between the daily sales and the average.
Code
Rolling max
What is the maximum sales in the last the n days?
Code
Spoiler: this feature will have an important impact on your accuracy.
—
Bucket 5: Stock-Out and Store Closed
# Stock-out id
'stock_out_id'
# Store closed
'store_closed'
stock-out
Explain that you have zero sales because of stock availability issues.
Code
—
Bucket 6: Price Relative to the same item in other stores or other items in the sub-category
# Relative delta price with the same item in other stores
'delta_price_all_rel'# Relative delta price with the previous week
'delta_price_weekn-1'# Relative delta price with the other items of the sub-category
'delta_price_cat_rel'
delta_price_weekn-1
Capture the price evolution week by week.
BUSINESS INSIGHTS
Promotions for Slow Movers
In order to reduce their inventory and purge slow movers, stores may apply aggressive pricing to boost sales.
delta_price_all_rel: Sales Cannibalization at store level
Several stores competing for sales of the same item because of price difference.delta_price_cat_rel: Sales Cannibalization at sub-category level
Several items of the same sub-category competing for sales.
Code
Results
After running a loop of training with the six different buckets (using the same hyperparameter with the Kaggle notebook), we have the following results:
—
STEP 1 to STEP 2: -29% RMSE Error
Sales lags are positively impacting the accuracy of your model:
BUSINESS INSIGHTS
Your sales of today are highly impacted by previous days' sales.
—
STEP 2 to STEP 3: -118% RMSE Error
BUSINESS INSIGHTS
The top 3 features are all related to the sales of the last three days.
Question
Based on this insight, what could be the performance of a model like Exponential Smoothing who is taking a ponderate sum of the previous sales to compute the forecasts.
—
STEP 3 to STEP 4: -12% RMSE Error
Rolling max features are taking the lead at the top of the features.
—
STEP 4 to STEP 5: -0.1% RMSE Error
BAM!
I am devastated to see that the potential main added value of this article, showing the impact of stock-out or store closing, has a limited impact on the accuracy of the model.
—
STEP 5 to STEP 6: -1.75% RMSE Error
The model accuracy is slightly better, but we do not see any added features in the top 20.
Conclusion
This analysis shows the positive impact of sales lags, rolling max, and other features on the model’s accuracy.
Understand the results
The results in terms of model accuracy are quite satisfying.
However, some people still need to be more satisfied that there is no correlation between some of the newly added features and the model's performance.
Therefore, the next step will be to work on these features and the model (let’s remember that we did not touch the initial model here) to see if there is any possibility of using these features to forecast your sales better.
Implement Inventory Management Rules
Now that you have your forecasting model, you need to implement an inventory management rule to manage store replenishment.
You can take inspiration from these three articles, where we try to implement rules assuming a deterministic or stochastic demand.
Generative AI: Machine Learning x GPT
After the recent adoption of Large Language Models (LLMs) like GPT, we can enhance the user experience of analytics products with intelligent agents.
In this article, I shared my first experiment, the design of a LangChain Agent connected to a TMS.
The outputs are impressive, as we have an agent that can answer operational questions by querying a database autonomously.
What if we create a super agent for Inventory Management?
My objective is to equip a GPT agent with
- Python Scripts of Inventory Rules and Light Forecasting Models
- Context, articles, and knowledge about Forecasting, Demand Planning, and Inventory Management
So, we have an agent that can find the proper inventory rules, set the safety, and test it with a light demand forecasting model.
For more information,
About Me
Let’s connect on Linkedin and Twitter, I am a Supply Chain Engineer using data analytics to improve logistics operations and reduce costs.
For consulting or advice on analytics and sustainable supply chain transformation, feel free to contact me via Logigreen Consulting.
If you are interested in Data Analytics and Supply Chain, have a look at my website.
💌 New articles straight in your inbox for free: Newsletter
📘 Your complete guide for Supply Chain Analytics: Analytics Cheat Sheet