![](https://neoshare.net/wp-content/uploads/2021/01/1Zr1H_-UPF2-PRZAzA0ql7A-750x420.png)
Time Series Analysis is a widely used method in business in order to get useful pieces of information such as demand forecasting, identify seasonal products, demand pattern categorization and other characteristics. Here we are going to focus on Time Series forecasting (using Statistical / Machine Learning / Deep Learning model to predict future values) & demand pattern categorization (categorizing products based on quantity and time).
In this blog, I am going to explain how we can fit multiple (1000+) time series models using Statistical (Classical Models), Machine Learning & Deep Learning models, time-series feature engineering & demand pattern categorization. This series will have the following 5 parts:
Part 1: Data Cleaning & Demand categorization.
Part 2: Fit statistical Time Series models (ARIMA, ETS, CROSTON etc.) using fpp3 (tidy forecasting) R Package.
Part 3: Time Series Feature Engineering using timetk R Package.
Part 4: Fit Machine Learning models (XGBoost, Random Forest, etc.) & Hyperparameter tuning using modeltime & tidymodels R packages.
Part 5: Fit Deeplearning models (NBeats & DeepAR) & Hyperparameter tuning using modeltime, modeltime.gluonts R packages.
Let’s get started!
PS: This is not the ONLY method to tackle this problem. However, this is one way to tackle this problem.
The Data
The data I’m using is from the Food Demand Forecasting hackathon in AnalyticsVidhya. The goal of this hackathon is to forecast the number of orders for each meal/centre combos to a food delivery service. We have a total of 3,548 meal/centre combos (i.e. 77 centres & 51 meals), meaning that 3,548-time series models will have to be fitted. This technique in a business environment is also known as Scalable Forecasting.
Let’s import libraries.
pacman::p_load(tidyverse, magrittr) # data wrangling packagespacman::p_load(lubridate, tsintermittent, fpp3, modeltime, timetk, modeltime.gluonts, tidymodels, modeltime.ensemble, modeltime.resample) # time series model packagespacman::p_load(foreach, future) # parallel functionspacman::p_load(viridis, plotly) # visualizations packagestheme_set(hrbrthemes::theme_ipsum()) # set default themes
Now read the train data to fit time series models and submission data to predict future values.
meal_demand_tbl <- read.csv(unz("data/raw/train_GzS76OK.zip", "train.csv")) # reading train datanew_tbl <- read.csv("data/raw/test_QoiMO9B.csv") # the data need to forecastskimr::skim(meal_demand_tbl %>%
# remove id
select(-id) %>%
# make center & meal id factors
mutate(center_id = factor(center_id),
meal_id = factor(meal_id))) # summary of data
The Data Preprocessing
In this stage, data preprocessing steps were performed. This data was then transformed to time-series data (i.e. to tsibble object: this is a special type of data that handles time series models in fpp3 package).
The above summary shows that there are 51 types of meals sold in 77 centres, which makes a total of 3,548-time series data, with each time series data consisting of 145 weeks. Here we will need to forecast the number of orders ( num_orders
) for each meal/centre combo. Furthermore, by looking at the column complete_rate
, we can see that there are no missing values in variables.
The column week
is in numbers from 1–145, so we will need to change this to dates. We will also remove combos (meal/centre) that did not require forecasting.
date_list <- tibble(id = seq(1, 155, 1),
week_date = seq(from = as.Date("2016-01-02"), by = "week", length.out = 155))master_data_tbl <- meal_demand_tbl %>%
left_join(date_list, by = c("week" = "id")) %>% # joining the date
inner_join(distinct(new_tbl, meal_id, center_id), by = c("meal_id", "center_id")) %>% # remove combos that did not want to forecast
select(week_date, num_orders, everything(), -c(week, id))
Now let’s transform the train and submission data into complete data i.e. make irregular time series data to regular time series data by inserting new date
rows. These newly created date
rows make missing values for num_orders
& other variables. Hence, zero was imputed for the variablenum_orders
by assuming that no sales occurred on these specific weeks and for the other variables, we replaced them with their corresponding previous week values.
For example, the following time series data (Table 1)shows that after the 4th week there is data missing up to 7th week. Table 2 shows the completed data with the new entries for those missing weeks (i.e. weeks 5 & 6).
Then emailer_for_promotion
& homepage_featured
variables are transformed into a factor.
master_data_tbl <- master_data_tbl %>%
as_tsibble(key = c(meal_id, center_id), index = week_date) %>%
## num_urders missing value imputation ----
fill_gaps(num_orders = 0, .full = end()) %>% # make it complete by max week dates
## X variables missing value imputation ----
group_by_key() %>%
fill_(fill_cols = c("emailer_for_promotion", "homepage_featured", "base_price", "checkout_price")) %>% # filling other variables
ungroup() %>%
## change variables to factor ----
mutate(emailer_for_promotion = factor(emailer_for_promotion),
homepage_featured = factor(homepage_featured))
A similar operation is carried out withsubmission
file.
## New Table (Submission file) data wrangling ----
new_tbl <- new_tbl %>%
left_join(date_list, by = c("week" = "id")) %>% # joining the date
full_join(new_data(master_data_tbl, n = 10), by = c("center_id", "meal_id", "week_date")) %>%
as_tsibble(key = c(meal_id, center_id), index = week_date) %>%
group_by_key() %>%
fill_(fill_cols = c("emailer_for_promotion", "homepage_featured", "base_price", "checkout_price")) %>% # filling other variables
ungroup() %>%
# change variables to factor
mutate(emailer_for_promotion = factor(emailer_for_promotion),
homepage_featured = factor(homepage_featured))
The Time Series Food data Visualizing
Plot 1: Number of orders by Centres
master_data_tbl %>%
# Randomly Pick 4 Centres
distinct(center_id) %>%
sample_n(4) %>% # Joining the transaction data
left_join(master_data_tbl) %>%
group_by(week_date, center_id) %>% # aggregate to centres
summarise(num_orders = sum(num_orders, na.rm = T)) %>%
as_tsibble(key = center_id, index = week_date) %>%
fill_gaps(num_orders = 0, .full = end()) %>%
autoplot(num_orders) +
scale_color_viridis(discrete = T)
The above plot shows that the first few weeks of transactions for Center #24 are 0; these transactions have been removed. However, there are continuous transactions after this time period which have been included in the data to fit the model.
master_data_tbl <- master_data_tbl %>%
filter(center_id != 24) %>%
bind_rows(master_data_tbl %>%
filter(center_id == 24 & week_date > as.Date("2016-07-16"))) # remove entries before 2016/07/16 for center 24
Plot 2: Number of Orders by Meal ID’s
master_data_tbl %>%
# Randomly Pick 4 Meals
distinct(meal_id) %>%
sample_n(3) %>%
# Joining the transaction data
left_join(master_data_tbl) %>%
group_by(week_date, meal_id) %>%
summarise(num_orders = sum(num_orders, na.rm = T)) %>%
as_tsibble(key = meal_id, index = week_date) %>%
fill_gaps(num_orders = 0, .full = end()) %>%
autoplot(num_orders) +
scale_color_viridis(discrete = T)
The above plot shows the introduction of new meals, making a shorter time series data. So there is a possibility that these types of time series data should be treated separately by Cross-Validation method.