Multiple Hotel Segments Demand Forecasting

July 30, 2021 Arga Adyatama

34 minute read

Background

Hospitality industry is growing, with more and more people spending their money for vacation and leisure activities. People may only lodge into a hotel when it’s a holiday season or a special event, thus the demand for staying room is not equally distributed accross the year. To maximize the revenue gained by the hotel, the management often employed a pricing strategy, one of them being raising the room rate when the demand is high and making a promo when the demand is low. Thus, the ability to accurately forecast the future demand is very important and became a vital part on the pricing scheme. The demand for different segment of customer may differ and forecasting become harder as it may requires different model for different segment.

This post will focus on fitting and tuning different forecasting models using purrr package on a real dataset .

Library and Setup

Below is the list of required packages if you wish to reproduce the results. The full source code for this post is available at my github repository .

# Data Wrangling
library(tidyverse)
library(lubridate)

# Visualization
library(ggthemes)
library(scales)

# Time Series
library(forecast)
library(tseries)
library(padr)

# Machine Learning
library(rsample)
library(recipes)

Hotel Demand Forecasting

Import Data

Let’s import the dataset. The data is acquired from Nuno et al. (2019) . The data consists of around 119,390 booking transactions from 2 hotel: an anonymous city hotel from Lisbon and a resort hotel from Algarve. The dataset comprehend bookings due to arrive between the 1st of July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were canceled. There is so much to explore from this data, but we will only focus on demand forecasting.

hotel <- read.csv("data_input/hotel_bookings.csv")

head(hotel, 10)

#>           hotel is_canceled lead_time arrival_date_year arrival_date_month
#> 1  Resort Hotel           0       342              2015               July
#> 2  Resort Hotel           0       737              2015               July
#> 3  Resort Hotel           0         7              2015               July
#> 4  Resort Hotel           0        13              2015               July
#> 5  Resort Hotel           0        14              2015               July
#> 6  Resort Hotel           0        14              2015               July
#> 7  Resort Hotel           0         0              2015               July
#> 8  Resort Hotel           0         9              2015               July
#> 9  Resort Hotel           1        85              2015               July
#> 10 Resort Hotel           1        75              2015               July
#>    arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
#> 1                        27                         1                       0
#> 2                        27                         1                       0
#> 3                        27                         1                       0
#> 4                        27                         1                       0
#> 5                        27                         1                       0
#> 6                        27                         1                       0
#> 7                        27                         1                       0
#> 8                        27                         1                       0
#> 9                        27                         1                       0
#> 10                       27                         1                       0
#>    stays_in_week_nights adults children babies meal country market_segment
#> 1                     0      2        0      0   BB     PRT         Direct
#> 2                     0      2        0      0   BB     PRT         Direct
#> 3                     1      1        0      0   BB     GBR         Direct
#> 4                     1      1        0      0   BB     GBR      Corporate
#> 5                     2      2        0      0   BB     GBR      Online TA
#> 6                     2      2        0      0   BB     GBR      Online TA
#> 7                     2      2        0      0   BB     PRT         Direct
#> 8                     2      2        0      0   FB     PRT         Direct
#> 9                     3      2        0      0   BB     PRT      Online TA
#> 10                    3      2        0      0   HB     PRT  Offline TA/TO
#>    distribution_channel is_repeated_guest previous_cancellations
#> 1                Direct                 0                      0
#> 2                Direct                 0                      0
#> 3                Direct                 0                      0
#> 4             Corporate                 0                      0
#> 5                 TA/TO                 0                      0
#> 6                 TA/TO                 0                      0
#> 7                Direct                 0                      0
#> 8                Direct                 0                      0
#> 9                 TA/TO                 0                      0
#> 10                TA/TO                 0                      0
#>    previous_bookings_not_canceled reserved_room_type assigned_room_type
#> 1                               0                  C                  C
#> 2                               0                  C                  C
#> 3                               0                  A                  C
#> 4                               0                  A                  A
#> 5                               0                  A                  A
#> 6                               0                  A                  A
#> 7                               0                  C                  C
#> 8                               0                  C                  C
#> 9                               0                  A                  A
#> 10                              0                  D                  D
#>    booking_changes deposit_type agent company days_in_waiting_list
#> 1                3   No Deposit  NULL    NULL                    0
#> 2                4   No Deposit  NULL    NULL                    0
#> 3                0   No Deposit  NULL    NULL                    0
#> 4                0   No Deposit   304    NULL                    0
#> 5                0   No Deposit   240    NULL                    0
#> 6                0   No Deposit   240    NULL                    0
#> 7                0   No Deposit  NULL    NULL                    0
#> 8                0   No Deposit   303    NULL                    0
#> 9                0   No Deposit   240    NULL                    0
#> 10               0   No Deposit    15    NULL                    0
#>    customer_type   adr required_car_parking_spaces total_of_special_requests
#> 1      Transient   0.0                           0                         0
#> 2      Transient   0.0                           0                         0
#> 3      Transient  75.0                           0                         0
#> 4      Transient  75.0                           0                         0
#> 5      Transient  98.0                           0                         1
#> 6      Transient  98.0                           0                         1
#> 7      Transient 107.0                           0                         0
#> 8      Transient 103.0                           0                         1
#> 9      Transient  82.0                           0                         1
#> 10     Transient 105.5                           0                         0
#>    reservation_status reservation_status_date
#> 1           Check-Out              2015-07-01
#> 2           Check-Out              2015-07-01
#> 3           Check-Out              2015-07-02
#> 4           Check-Out              2015-07-02
#> 5           Check-Out              2015-07-03
#> 6           Check-Out              2015-07-03
#> 7           Check-Out              2015-07-03
#> 8           Check-Out              2015-07-03
#> 9            Canceled              2015-05-06
#> 10           Canceled              2015-04-22

Data Description:

hotel : Hotel (Resort Hotel or City Hotel)
is_canceled : Value indicating if the booking was canceled (1) or not (0)
lead_time : Number of days that elapsed between the entering date of the booking into the PMS and the arrival date
arrival_date_year : Year of arrival date
arrival_date_month : Month of arrival date
arrival_date_week_number : Week number of year for arrival date
arrival_date_day_of_month : Day of arrival date
stays_in_weekend_nights : Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
stays_in_week_nights : Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
adults : Number of adults
children : Number of children
babies : Number of babies
meal : Type of meal booked. Categories are presented in standard hospitality meal packages:
- Undefined/SC – no meal package
- BB – Bed & Breakfast
- HB – Half board (breakfast and one other meal – usually dinner)
- FB – Full board (breakfast, lunch and dinner)
country : Country of origin. Categories are represented in the ISO 3155–3:2013 format
market_segment : Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”
distribution_channel : Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”
is_repeated_guest : Value indicating if the booking name was from a repeated guest (1) or not (0)
previous_cancellations : Number of previous bookings that were cancelled by the customer prior to the current booking
previous_bookings_not_canceled : Number of previous bookings not cancelled by the customer prior to the current booking
reserved_room_type : Code of room type reserved. Code is presented instead of designation for anonymity reasons.
assigned_room_type : Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons.
booking_changes : Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation
deposit_type : Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories:
- No Deposit – no deposit was made
- Non Refund * a deposit was made in the value of the total stay cost
- Refundable – a deposit was made with a value under the total cost of stay.
agent : ID of the travel agency that made the booking
company : ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons
days_in_waiting_list : Number of days the booking was in the waiting list before it was confirmed to the customer
customer_type : Type of booking, assuming one of four categories:
- Contract - when the booking has an allotment or other type of contract associated to it
- Group – when the booking is associated to a group
- Transient – when the booking is not part of a group or contract, and is not associated to other transient booking
- Transient-party – when the booking is transient, but is associated to at least other transient booking
adr : Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights
required_car_parking_spaces : Number of car parking spaces required by the customer
total_of_special_requests : Number of special requests made by the customer (e.g. twin bed or high floor)
reservation_status : Reservation last status, assuming one of three categories:
- Canceled – booking was canceled by the customer
- Check-Out – customer has checked in but already departed
- No-Show – customer did not check-in and did inform the hotel of the reason why
reservation_status_date : Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel

Data Preprocessing

Before we analyze the data, you may notice that the date is scattered in separate columns. We will unite them together to get a proper arrival date column.

hotel <- hotel %>% 
  unite("arrival_date", arrival_date_year, arrival_date_month, arrival_date_day_of_month, sep = "-") %>% 
  mutate(arrival_date = ymd(arrival_date))

head(hotel)

#>          hotel is_canceled lead_time arrival_date arrival_date_week_number
#> 1 Resort Hotel           0       342   2015-07-01                       27
#> 2 Resort Hotel           0       737   2015-07-01                       27
#> 3 Resort Hotel           0         7   2015-07-01                       27
#> 4 Resort Hotel           0        13   2015-07-01                       27
#> 5 Resort Hotel           0        14   2015-07-01                       27
#> 6 Resort Hotel           0        14   2015-07-01                       27
#>   stays_in_weekend_nights stays_in_week_nights adults children babies meal
#> 1                       0                    0      2        0      0   BB
#> 2                       0                    0      2        0      0   BB
#> 3                       0                    1      1        0      0   BB
#> 4                       0                    1      1        0      0   BB
#> 5                       0                    2      2        0      0   BB
#> 6                       0                    2      2        0      0   BB
#>   country market_segment distribution_channel is_repeated_guest
#> 1     PRT         Direct               Direct                 0
#> 2     PRT         Direct               Direct                 0
#> 3     GBR         Direct               Direct                 0
#> 4     GBR      Corporate            Corporate                 0
#> 5     GBR      Online TA                TA/TO                 0
#> 6     GBR      Online TA                TA/TO                 0
#>   previous_cancellations previous_bookings_not_canceled reserved_room_type
#> 1                      0                              0                  C
#> 2                      0                              0                  C
#> 3                      0                              0                  A
#> 4                      0                              0                  A
#> 5                      0                              0                  A
#> 6                      0                              0                  A
#>   assigned_room_type booking_changes deposit_type agent company
#> 1                  C               3   No Deposit  NULL    NULL
#> 2                  C               4   No Deposit  NULL    NULL
#> 3                  C               0   No Deposit  NULL    NULL
#> 4                  A               0   No Deposit   304    NULL
#> 5                  A               0   No Deposit   240    NULL
#> 6                  A               0   No Deposit   240    NULL
#>   days_in_waiting_list customer_type adr required_car_parking_spaces
#> 1                    0     Transient   0                           0
#> 2                    0     Transient   0                           0
#> 3                    0     Transient  75                           0
#> 4                    0     Transient  75                           0
#> 5                    0     Transient  98                           0
#> 6                    0     Transient  98                           0
#>   total_of_special_requests reservation_status reservation_status_date
#> 1                         0          Check-Out              2015-07-01
#> 2                         0          Check-Out              2015-07-01
#> 3                         0          Check-Out              2015-07-02
#> 4                         0          Check-Out              2015-07-02
#> 5                         1          Check-Out              2015-07-03
#> 6                         1          Check-Out              2015-07-03

Exploratory Data Analysis

Market Segments

For each hotel, we have several market segments as mentioned earlier in the data description. In order to maximize our revenue, we will forecast the most profitable market segment. First, we will look at the number of transactions for each market segment on each hotel.

hotel %>% 
  count(hotel, market_segment, is_canceled) %>% 
  group_by(hotel) %>% 
  mutate(total = sum(n),
         ratio = n/total,
         is_canceled = ifelse(is_canceled == 0, "No", "Yes")) %>% 
  ungroup() %>% 
  mutate(market_segment = tidytext::reorder_within(market_segment, ratio, hotel)) %>%  
  ggplot(aes(ratio, market_segment, fill = is_canceled)) +
  geom_col(position = "dodge") +
  labs(x = "Percentage", y = "Market Segment", 
       title = "Hotel Demand by Market Segment",
       fill = "Booking Canceled") +
  facet_wrap(~hotel, scales = "free_y") +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_fill_manual(values = c("skyblue", "firebrick")) +
  tidytext::scale_y_reordered() +
  theme_pander() +
  theme(legend.position = "top")

Based on the result, most of the booking done via travel agent, either online or offline, which combined together contributes more than 40% of the non-canceled total transactions. The other segment don’t have much transactions, but perhaps we would like to see the revenue generated by each market segment by looking at the Average Daily Rate (ADR). Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights. Therefore, the higher ADR means more revenue generated for each staying night.

hotel %>% 
  filter(is_canceled == F) %>% 
  group_by(hotel, market_segment) %>% 
  summarise(adr = sum(adr)) %>% 
  mutate(total = sum(adr),
         ratio = adr/total) %>% 
  ungroup() %>% 
  mutate(market_segment = tidytext::reorder_within(market_segment, ratio, hotel)) %>% 
  ggplot(aes(ratio, market_segment)) +
  geom_col(fill = "firebrick") +
  facet_wrap(~hotel, scales = "free_y") +
  tidytext::scale_y_reordered() +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  theme_pander() +
  labs(x = NULL, y = NULL, 
       title = "ADR Contributions by Market Segments")

Total ADR generated from online travel agent is the highest both in the city hotel and the resort hotel. The segment of offline travel agent and direct has a little margin with each other, with direct segment has the higher contribution in resort hotel even though it has lower number of transactions. This give us the top 3 market segments, both in term of quantity and profitability. We will focus on this segments for the rest of the analysis.

Time Series Analysis

We want to forecast the demand of lodging for both city hotel and resort hotel. Based on the previous exploration, we will use only the data from segment online travel agent, offline travel agent and direct. We will consider both canceled and non-canceled transactions to reflect the demand.

hotel_city <- hotel %>%
  filter(hotel == "City Hotel",
         market_segment %>% str_detect("TA|Direct")) 

hotel_resort <- hotel %>%
  filter(hotel == "Resort Hotel",
         market_segment %>% str_detect("TA|Direct")) 

hotel_agg_city <- hotel_city %>% 
  group_by(arrival_date) %>% 
  summarise(demand = n())

hotel_agg_resort <- hotel_resort %>% 
  group_by(arrival_date) %>% 
  summarise(demand = n())

head(hotel_agg_city, 10)

#> # A tibble: 10 × 2
#>    arrival_date demand
#>    <date>        <int>
#>  1 2015-07-01       79
#>  2 2015-07-02        8
#>  3 2015-07-03        3
#>  4 2015-07-04        6
#>  5 2015-07-05        8
#>  6 2015-07-06       10
#>  7 2015-07-07       18
#>  8 2015-07-08        8
#>  9 2015-07-09        6
#> 10 2015-07-10        7

Let’s look at the series of demand for each hotel.

hotel_agg_city   %>%
  mutate(hotel = "City Hotel") %>% 
  bind_rows(hotel_agg_resort %>% mutate(hotel = "Resort Hotel")) %>% 
  ggplot(aes(arrival_date, demand)) +
  geom_line(color = "dodgerblue4") +
  theme_pander() +
  facet_wrap(~hotel, ncol = 1) +
  labs(x = NULL, y = "Demand", title = "Hotel Booking Demand")

The demand for city hotel has a higher fluctuation compared to the resort hotel. This may be caused by several factors, including the room capacity, since we don’t know the room capacity for each hotel. There are also some spikes in demand for city hotel during the late 2015. We will inspect further by dividing the series by market segment. We will also need to make the series have constant interval of time, which is a daily interval in our case.

# summarizing data
hotel_agg_city <- hotel_city %>% 
  group_by(arrival_date, market_segment) %>% 
  summarise(demand = n())

hotel_agg_resort <- hotel_resort %>% 
  group_by(arrival_date, market_segment) %>% 
  summarise(demand = n())

# join city hotel and resort hotel
hotel_agg <- hotel_agg_city %>%
  mutate(hotel = "City Hotel") %>% 
  bind_rows(hotel_agg_resort %>% mutate(hotel = "Resort Hotel")) %>% 
  ungroup()

# padding
start_interval <-  ymd(range(hotel$arrival_date)[1])
end_interval <- ymd(range(hotel$arrival_date)[2])

hotel_pad <- hotel_agg %>% 
  group_by(hotel, market_segment) %>% 
  pad(start_val = start_interval, end_val = end_interval) %>% 
  replace_na(list(demand = 0)) %>% 
  ungroup()

hotel_pad %>% 
  ggplot(aes(arrival_date, demand, color = market_segment)) +
  geom_line() +
  theme_pander() +
  theme(legend.position = "top") +
  facet_wrap(~hotel, ncol = 1, scales = "free_y") +
  labs(x = NULL, y = "Demand", title = "Hotel Demand", color = NULL) +
  scale_color_manual(values = c("firebrick", "skyblue", "orange"))

The behaviour for each segment is quite different, so we will forecast them separately. At this point, we will have 6 different time series to be forecasted for 3 different segments on each hotel.

Seasonality Analysis

Before we proceed to create a forecasting model. We will try to gain more insight regarding the customer behaviour by looking at the seasonality of the demand. Since we have too many series, we will only explore the most lucrative market segment, that is the segment of online TA. We will look at the behaviour for both the city hotel and the resort hotel.

Weekly Seasonality

First, we will look at the weekly seasonality. The weekly seasonality will help us to understand when does people more frequently do check in? Our common sense will tell us that perhaps weekend should be the one where people start to check in. However, weekly seasonality may have a weak strength since people are not regularly go to vacation or rent a hotel.

The following figures is the weekly seasonality of the online TA segment for the city hotel. It has a negative seasonality on the weekend (Saturday and Sunday) and has a high positive seasonality in Tuesday. Thus, the hotel is more likely to have less visitor on the weekend, perhaps because the hotel is not designed to be a vacation hotel and more of a business or transit hotel.

city_weekly <- hotel_agg_city %>% 
  filter(market_segment == "Online TA") %>% 
  ts(frequency = 7) %>% 
  decompose()

city_weekly$seasonal %>% 
  matrix(ncol = 7, byrow = T) %>%
  t() %>% 
  as.data.frame() %>% 
  select(V1) %>% 
  mutate(day = wday(head(hotel_agg_city %>% filter(market_segment == "Online TA") %>% pull(arrival_date) , 7), label = T)) %>% 
  ggplot(aes(day, V1, fill = V1)) +
  geom_col(show.legend = F) +
  geom_hline(yintercept = 0 ) +
  labs(x = NULL, y = "seasonality", title = "Weekly Seasonality of City Hotel, Online TA Segment ") +
  scale_fill_gradient2(low = "firebrick4", mid = "skyblue", high = "dodgerblue4") +
  theme_pander()

Next, we will look at the weekly seasonality of the Online TA segment of the Resort Hotel as well.

resort_weekly <- hotel_agg_resort %>% 
  filter(market_segment == "Online TA") %>% 
  ts(frequency = 7) %>% 
  decompose()

resort_weekly$seasonal %>% 
  matrix(ncol = 7, byrow = T) %>%
  t() %>% 
  as.data.frame() %>% 
  select(V1) %>% 
  mutate(day = wday(head(hotel_agg_resort %>% filter(market_segment == "Online TA") %>% pull(arrival_date) , 7), label = T)) %>% 
  ggplot(aes(day, V1, fill = V1)) +
  geom_col(show.legend = F) +
  geom_hline(yintercept = 0 ) +
  labs(x = NULL, y = "seasonality", title = "Weekly Seasonality of Resort Hotel, Online TA Segment ") +
  scale_fill_gradient2(low = "firebrick4", mid = "skyblue", high = "dodgerblue4") +
  theme_pander()

The customer behaviour is quite the same, with strong positive seasonality happened during Wednesday and negative seasonality during Sunday. It is natural that the number of arrival is dropping during Sunday, since people will go back to their and home do the ordinary activities on Monday. Another reason is perhaps because Monday is the day that most of the monuments/museusms etc are closed in Portugal so if people checked in on Sunday evening the don’t have much to visit during the next day. The peak seasonality in Wednesday may signal that people want to spent more time for the upcoming weekend or people just want to avoid the crowd during weekend.

Monthly Seasonality

We will also check the monthly seasonality and see at what month does it reach its highest and lowest point. The first one is the city hotel.

city_monthly <- hotel_agg_city %>% 
  filter(market_segment == "Online TA") %>% 
  mutate(date = floor_date(arrival_date, "month")) %>% 
  group_by(date) %>% 
  summarise(demand = sum(demand)) %>% 
  ungroup() 

monthly_ts <- ts(city_monthly,frequency = 12) %>% 
  decompose()

monthly_ts$seasonal %>% 
  matrix(ncol = 12, byrow = T) %>%
  t() %>% 
  as.data.frame() %>% 
  select(V1) %>% 
  mutate(month = month(head(city_monthly$date, 12), label = T)) %>% 
  ggplot(aes(month, V1, fill = V1)) +
  geom_col(show.legend = F) +
  geom_hline(yintercept = 0 ) +
  labs(x = NULL, y = "seasonality", title = "Monthly Seasonality of City Hotel, Online TA Segment ") +
  scale_fill_gradient2(low = "firebrick4", mid = "skyblue", high = "dodgerblue4") +
  theme_pander()

The next one is the online TA segment in Resort Hotel. The seasonality reach its highest point during October and same with the city hotel, it reach the lowest point on March.

resort_monthly <- hotel_agg_resort %>% 
  filter(market_segment == "Online TA") %>% 
  mutate(date = floor_date(arrival_date, "month")) %>% 
  group_by(date) %>% 
  summarise(demand = sum(demand)) %>% 
  ungroup() 

monthly_ts <- ts(resort_monthly,frequency = 12) %>% 
  decompose()

monthly_ts$seasonal %>% 
  matrix(ncol = 12, byrow = T) %>%
  t() %>% 
  as.data.frame() %>% 
  select(V1) %>% 
  mutate(month = month(head(resort_monthly$date, 12), label = T)) %>% 
  ggplot(aes(month, V1, fill = V1)) +
  geom_col(show.legend = F) +
  geom_hline(yintercept = 0 ) +
  labs(x = NULL, y = "seasonality", title = "Monthly Seasonality of resort Hotel, Online TA Segment ") +
  scale_fill_gradient2(low = "firebrick4", mid = "skyblue", high = "dodgerblue4") +
  theme_pander()

Based on both graphs, the high and positive seasonality happens around May-June and September-October. The highest negative seasonality happens in March. Both Lisbon and Algarve are located in Portugal. According to Audley Travels , the best time to visit Portugal is in spring (March-May), when the country is in bloom and waking after the winter. You could also go in fall (between September and October) when the sun is still shining, the weather is warm, and many of the crowds have dispersed. However, the negative seasonality in March and April perhaps tell us that the weather is still too cold to travel around and people love to spend more time to go for vacation during September and October. The summer holiday of school, which is span from late June to early September, have a good influence toward the city hotel seasonality.

September and October are two of the best months to visit Portugal. The weather is still warm and pleasant, and the temperatures are much more manageable for sightseeing or hiking. It’s also a wonderful time to visit many of Portugal’s wineries with the grape harvest in full swing. The beaches are also much quieter.
Audley Travels

For the next sections, we will focus on the forecasting of the demand using various machine learning methods.

Cross-Validation

We will split the data into training dataset and testing dataset, with testing dataset consists of the last 30 days from the full dataset.

# get total number of observations
n_data <- hotel %>% 
  count(arrival_date) %>%
  nrow()

# data train
hotel_train <- hotel_pad %>% 
  group_by(hotel, market_segment) %>% 
  slice( 1:(n_data-30) ) 

# data test
hotel_test <- hotel_pad %>% 
  group_by(hotel, market_segment) %>% 
  slice( -(n_data-30:n_data) )

tail(hotel_train, 10)

#> # A tibble: 10 × 4
#> # Groups:   hotel, market_segment [1]
#>    arrival_date market_segment demand hotel       
#>    <date>       <chr>           <dbl> <chr>       
#>  1 2017-07-23   Online TA          26 Resort Hotel
#>  2 2017-07-24   Online TA          46 Resort Hotel
#>  3 2017-07-25   Online TA          25 Resort Hotel
#>  4 2017-07-26   Online TA          32 Resort Hotel
#>  5 2017-07-27   Online TA          33 Resort Hotel
#>  6 2017-07-28   Online TA          36 Resort Hotel
#>  7 2017-07-29   Online TA          51 Resort Hotel
#>  8 2017-07-30   Online TA          23 Resort Hotel
#>  9 2017-07-31   Online TA          44 Resort Hotel
#> 10 2017-08-01   Online TA          49 Resort Hotel

Forecasting Methods

We will do forecasting for each segment of each hotel. This is done to capture the pattern of each series since they have different characteristics and doing an aggregated forecast may result in higher error. Thus, we will have 6 different series, 3 for each hotel. Since we have 6 series to forecast, manually fitting and tuning the model will be tedious and take a long time. We will use purrr to efficiently fitting and evaluating the model in order to get the best model for each series based on the Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) values. MAE is chosen due to it’s interpretability while RMSE is chosen because it is sensitive to large errors. We don’t use Mean Absolute Percentage Error because it did not perform really well when the actual data has or close to zero value, despite being easier to interpret.

\[MAE = \frac{\Sigma_{i = 1}^{n} (y_i- \hat{y_i})}{n}\]

\[RMSE = \sqrt\frac{\Sigma_{i = 1}^{n} (y_i-\hat{y_i} )^2}{n}\]

Before we proceed, we will do forecasting method for the aggregated time series by combining all demand into a single series. This will function as a benchmark for the subsequent forecasting.

train_agg <- hotel_train %>% 
  group_by(arrival_date) %>% 
  summarise(demand = sum(demand))

test_agg <- hotel_test %>% 
  group_by(arrival_date) %>% 
  summarise(demand = sum(demand))

head(train_agg, 10)

#> # A tibble: 10 × 2
#>    arrival_date demand
#>    <date>        <dbl>
#>  1 2015-07-01      118
#>  2 2015-07-02       52
#>  3 2015-07-03       43
#>  4 2015-07-04       55
#>  5 2015-07-05       53
#>  6 2015-07-06       55
#>  7 2015-07-07       53
#>  8 2015-07-08       34
#>  9 2015-07-09       38
#> 10 2015-07-10       49

With a weekly seasonality and using ARIMA method, here is the result of the forecast on the next 30 days. The forecast give us MAE of 21, which mean that the model will have different around 21 demands compared to the actual testing dataset in average and RMSE around 26.

train_ts <- ts(train_agg$demand, frequency = 7)

arima_agg <- auto.arima(train_ts)

forecast_agg <- forecast(arima_agg, h = 30)

accuracy(forecast_agg, test_agg$demand) %>%
  as.data.frame()

#>                      ME     RMSE      MAE       MPE     MAPE      MASE
#> Training set  0.4580128 37.73934 28.13519 -10.16052 27.86560 0.7628458
#> Test set     -8.0150470 26.31801 21.25028  -8.68376 15.88266 0.5761712
#>                    ACF1
#> Training set 0.00200618
#> Test set             NA

We will see if by separating the series will give us better forecast performance.

First, we will nest the dataset, making our data into a list of 6 separate time series.

# nesting data train
train_list <- hotel_train %>% 
  select(-arrival_date) %>% 
  unite("series", hotel, market_segment, sep = "_") %>% 
  nest(-series)

# nesting data test
test_list <- hotel_test %>% 
  select(-arrival_date) %>% 
  unite("series", hotel, market_segment, sep = "_") %>% 
  nest(-series)

head(train_list)

#> # A tibble: 6 × 2
#>   series                     data              
#>   <chr>                      <list>            
#> 1 City Hotel_Direct          <tibble [763 × 1]>
#> 2 City Hotel_Offline TA/TO   <tibble [763 × 1]>
#> 3 City Hotel_Online TA       <tibble [763 × 1]>
#> 4 Resort Hotel_Direct        <tibble [763 × 1]>
#> 5 Resort Hotel_Offline TA/TO <tibble [763 × 1]>
#> 6 Resort Hotel_Online TA     <tibble [763 × 1]>

The column data consists of a list of the demand for each series. For example, the following is the series for the City Hotel with Direct Segment.

train_list$data[[1]] %>% 
  head()

#> # A tibble: 6 × 1
#>   demand
#>    <dbl>
#> 1      0
#> 2      0
#> 3      0
#> 4      0
#> 5      0
#> 6      0

Preprocessing Specification

We will try several preprocess approach since there is a possibility that transformed data are performing better than the original scale. We will use the following treatment:

No data transformation
Squared value
Scaled value, data will be normalized using z-score
Log transformation

recipe_spec <- list(
  normal_spec = function(x) x, # no transformation
  squared_spec = function(x) sqrt(x), # square the demand value
  scale_spec = function(x) scale(x), # normalize the demand value with Z-score
  log_spec = function(x) log(x+1) # convert demand to log
) %>% 
  enframe(name = "preprocess_name", value = "preprocess_spec")

recipe_spec

#> # A tibble: 4 × 2
#>   preprocess_name preprocess_spec
#>   <chr>           <list>         
#> 1 normal_spec     <fn>           
#> 2 squared_spec    <fn>           
#> 3 scale_spec      <fn>           
#> 4 log_spec        <fn>

We also need to create a function to reverse the scaled to value into the original value for later model evaluation. For example, if the data is being square rooted, then we need to scale it back by squaring the data.

# reverse function to scale back for model evaluation
reverse_spec <- list(
  
  normal_spec = function(x, y) {
    y <- y
    return(x)
    },
  
  squared_spec = function(x, y) {
    y <- y
    return(x^2)
    },
  
  scale_spec = function(x, y) x * sd(y) + mean(y),
  
  log_spec = function(x,y) {
    y <- y
    return(exp(x)-1)
  }
) %>% 
  enframe(name = "reverse_name", value = "reverse_spec")

# joint the preprocess and the scale-back function
recipe_spec <- recipe_spec %>% 
  bind_cols(reverse_spec)

Seasonality Specification

The next step is to specify the seasonal period for the series. We will try several seasonal period, including:

weekly seasonality
monthly seasonality
annual seasonality
weekly and monthly seasonality (multi-seasonal)
weekly and annual seasonality (multi-seasonal)

seasonal_forecast <- list(
  weekly = function(x) ts(x, frequency = 7),
  monthly = function(x) ts(x, frequency = 7*4),
  weekly_monthly = function(x) msts(x, seasonal.periods = c(7, 7*4)),
  weekly_annual = function(x) msts(x, seasonal.periods = c(7, 365)),
  annual = function(x) ts(x, frequency = 365)
  ) %>% 
  enframe(name = "season_name", value = "season_spec")

Impute Outlier

We will also try to preprocess the data by whether an outlier should be replaced or not. If the outlier is replaced, we will identify the outlier and estimate the replacement using the tsoutliers function. Residuals are identified by fitting a loess curve for non-seasonal data and via a periodic STL decomposition for seasonal data.

outlier_spec <- list(
  normal_spec = function(x) x, # no transformation
  
  out_spec = function(x){
    outlier_place <- tsoutliers(x)
    x[outlier_place$index] <- outlier_place$replacement
    return(x)
  }
) %>% 
  enframe(name = "outlier_name", value = "out_spec")

Model Specification

Next, we will specify the model the data will be fit into. The model includes:

ARIMA
STL + ETS model
STL + ARIMA

method_forecast <- list(
  arima  = function(x) auto.arima(x),
  stl_ets = function(x) stlm(x, method = "ets"),
  stl_arima = function(x) stlm(x, method = "arima")
  ) %>% 
  enframe(name = "model_name", value = "model_spec")

Model Fitting

Below is the combination for each specification on each series. For 6 different series, we will have 5 different seasonality specification, 3 different models and other specifications. Therefore, we will 540 different combinations. We will choose the best model based on the RMSE and MAE value on the testing dataset. For example, the first row is the time series for City Hotel with Direct Segment. This data will be transformed using log (log_spec) with annual seasonality and fitted with time series model using ARIMA.

# combine the data with the specification
train_crossing <- crossing(train_list, recipe_spec, seasonal_forecast, outlier_spec, method_forecast )

test_crossing <- crossing(test_list, recipe_spec, seasonal_forecast, outlier_spec, method_forecast)

train_crossing %>% 
  head()

#> # A tibble: 6 × 12
#>   series     data      preprocess_name preprocess_spec reverse_name reverse_spec
#>   <chr>      <list>    <chr>           <list>          <chr>        <list>      
#> 1 City Hote… <tibble … log_spec        <fn>            log_spec     <fn>        
#> 2 City Hote… <tibble … log_spec        <fn>            log_spec     <fn>        
#> 3 City Hote… <tibble … log_spec        <fn>            log_spec     <fn>        
#> 4 City Hote… <tibble … log_spec        <fn>            log_spec     <fn>        
#> 5 City Hote… <tibble … log_spec        <fn>            log_spec     <fn>        
#> 6 City Hote… <tibble … log_spec        <fn>            log_spec     <fn>        
#> # … with 6 more variables: season_name <chr>, season_spec <list>,
#> #   outlier_name <chr>, out_spec <list>, model_name <chr>, model_spec <list>

The following code produce the transformation process for the data before fitted into model.

# model fitting and evaluation
transformed_data <- map2(.x = train_crossing$data,
                         .y = train_crossing$preprocess_spec,
                         .f = ~exec( .y, .x) 
                         )

The following code do all process from transforming data into time series, fitting them into the model and forecast demands for the next 30 days.

# fitting and forecasting
forecast_map <- transformed_data %>% 
  
  # Convert data to time series with different seasonalities
  map2(.y = train_crossing$season_spec,
       .f = ~exec(.y, .x)) %>% 
  
  # Transform outlier
  map2(.y = train_crossing$out_spec,
       .f = ~exec(.y, .x)) %>% 
  
  # Fit data to time series model
  map2(.y = train_crossing$model_spec,
       .f =  ~exec(.y,.x)) %>% 
  
  # Forecast for the next 30 days
  map(forecast, h = 30) %>%  
  
  # Take the mean of the forecast
  map(~pluck(.x, "mean")) %>% 
  
  map(as.numeric)

Forecasting Result

Below is the result for our modeling process for the first time series.

# Result of the first time series
forecast_map[[1]]

#>  [1] 2.346707 2.338587 2.338587 2.338587 2.338587 2.338587 2.338587 2.338587
#>  [9] 2.338587 2.338587 2.338587 2.338587 2.338587 2.338587 2.338587 2.338587
#> [17] 2.338587 2.338587 2.338587 2.338587 2.338587 2.338587 2.338587 2.338587
#> [25] 2.338587 2.338587 2.338587 2.338587 2.338587 2.338587

We use MAE and RMSE to measures and compares the performance of each model.

# scale-back the data to original scale value
forecast_trans <- list()

for (i in 1:length(forecast_map)) {
  forecast_trans[[i]] <- train_crossing$reverse_spec[[i]]( x = forecast_map[[i]], y = transformed_data[[i]])
}

forecast_trans[[1]]

#>  [1] 9.451100 9.366579 9.366579 9.366579 9.366579 9.366579 9.366579 9.366579
#>  [9] 9.366579 9.366579 9.366579 9.366579 9.366579 9.366579 9.366579 9.366579
#> [17] 9.366579 9.366579 9.366579 9.366579 9.366579 9.366579 9.366579 9.366579
#> [25] 9.366579 9.366579 9.366579 9.366579 9.366579 9.366579

# calculate MAE
mae_list <- forecast_trans %>% 
  map2(.y = test_crossing$data, 
       .f = ~yardstick::mae_vec(.x %>% as.numeric(), .y$demand))

rmse_list <- forecast_trans %>% 
  map2(.y = test_crossing$data, 
       .f = ~yardstick::rmse_vec(.x %>% as.numeric(), .y$demand))

# show result
train_crossing %>% 
  separate(series, c("hotel", "market_segment"), sep = "_") %>% 
  select_if(is.character) %>% 
  bind_cols(mae = mae_list %>% as.numeric()) %>% 
  bind_cols(rmse = rmse_list %>% as.numeric()) %>% 
  select(hotel, market_segment, mae, rmse, everything()) %>% 
  head(10)

#> # A tibble: 10 × 9
#>    hotel     market_segment   mae  rmse preprocess_name reverse_name season_name
#>    <chr>     <chr>          <dbl> <dbl> <chr>           <chr>        <chr>      
#>  1 City Hot… Direct          3.79  4.63 log_spec        log_spec     annual     
#>  2 City Hot… Direct         11.7  13.9  log_spec        log_spec     annual     
#>  3 City Hot… Direct         10.6  12.8  log_spec        log_spec     annual     
#>  4 City Hot… Direct          3.79  4.63 log_spec        log_spec     annual     
#>  5 City Hot… Direct         11.7  13.9  log_spec        log_spec     annual     
#>  6 City Hot… Direct         10.6  12.8  log_spec        log_spec     annual     
#>  7 City Hot… Direct          3.79  4.63 log_spec        log_spec     monthly    
#>  8 City Hot… Direct          3.85  4.71 log_spec        log_spec     monthly    
#>  9 City Hot… Direct          3.79  4.64 log_spec        log_spec     monthly    
#> 10 City Hot… Direct          3.79  4.63 log_spec        log_spec     monthly    
#> # … with 2 more variables: outlier_name <chr>, model_name <chr>

Below is the best configuration for each series based on the lowest RMSE, since RMSE give more penality towards large error. To interpret the MAE values, we need to consider the range of the data, shown as the standar deviation of the data.

# best configuration for each series
best_adjust <- train_crossing %>% 
  separate(series, c("hotel", "market_segment"), sep = "_") %>% 
  bind_cols(mae = mae_list %>% as.numeric()) %>%
  bind_cols(rmse = rmse_list %>% as.numeric()) %>% 
  group_by(hotel, market_segment) %>% 
  arrange(rmse) %>% 
  slice(1) 

# show the result
metric_crossing <- train_list %>% crossing(list(mean = mean, std_dev = sd) %>% enframe(name = "type", value = "metric"))

metric_crossing %>% 
  bind_cols(value = map2(.x = metric_crossing$data, 
                         .y = metric_crossing$metric, 
                         .f = ~exec(.y, unlist(.x))) %>% unlist() 
            ) %>% 
  select(series, type, value) %>% 
  pivot_wider(names_from = type, values_from = value) %>% 
  separate(series, c("hotel", "market_segment"), sep = "_") %>% 
  left_join(best_adjust) %>% 
  select_if(~is.list(.) == F) %>% 
  select(-reverse_name) %>% 
  select(hotel, market_segment, mae, rmse,  mean, std_dev, everything()) %>% 
  arrange(rmse)

#> # A tibble: 6 × 10
#>   hotel     market_segment   mae  rmse  mean std_dev preprocess_name season_name
#>   <chr>     <chr>          <dbl> <dbl> <dbl>   <dbl> <chr>           <chr>      
#> 1 Resort H… Direct          2.21  2.65  8.14    4.85 normal_spec     weekly     
#> 2 City Hot… Direct          3.43  4.17  7.53    4.83 normal_spec     weekly     
#> 3 Resort H… Offline TA/TO   4.26  6.09  9.39    7.38 log_spec        weekly_ann…
#> 4 Resort H… Online TA       8.71 11.0  21.8    11.7  squared_spec    weekly     
#> 5 City Hot… Offline TA/TO   8.80 13.7  21.4    32.0  normal_spec     weekly     
#> 6 City Hot… Online TA      13.2  16.2  48.1    28.5  squared_spec    weekly     
#> # … with 2 more variables: outlier_name <chr>, model_name <chr>

Since we don’t have much data, only two years of transactions, the model performance may not perform so well. However, judging from the MAE value, the performance is quite acceptable, with most of the error values are less than the value of one standard deviation. Compared to the aggregated data on the first forecast which have MAE of 21, we have lower MAE value for each series, which give us an evidence that by making a separate forecasting models for each market segment will make the model more accurate.

# Aggregated time series performance
accuracy(forecast_agg, test_agg$demand) %>%
  as.data.frame() %>% 
  slice(2) %>% 
  select(MAE, RMSE)

#>               MAE     RMSE
#> Test set 21.25028 26.31801

Below is the forecasting result for each series. The red line indicate the actual demand value while the blue line indicate the forecast value. The blue area represent area with 80% prediction interval while the light blue are represent the 95% prediction interval. Most of the actual demand is still inside the forecasting intervals.

Resort Hotel, Online TA

model_best <- map2(.x = best_adjust$data, 
                   .y = best_adjust$season_spec,
                   .f =  ~exec(.y,.x)) %>% 
  map2(.y = best_adjust$model_spec,
       .f =  ~exec(.y,.x)
       )

data_test <- test_list$data[[6]] %>% 
  msts(start = 110, seasonal.periods =  7)

model_best[[6]] %>%
  forecast(h = 30 ) %>%  
  autoplot() +
  autolayer(data_test, series = "Data Test") +
  scale_color_manual(values = "firebrick") +
  labs(subtitle = "Resort Hotel, Online TA",
       y = "Demand", x  = NULL) +
  theme_pander() +
  scale_x_continuous(limits = c(100, 115) ,
                     labels = as.Date.numeric(seq(100,115,5)*7-7, origin = range(hotel$arrival_date)[1])) +
  theme(legend.position = "top")

model_best[[6]]$residuals %>% 
  as.data.frame() %>% 
  ggplot(aes(x)) +
  geom_density(fill = "skyblue", alpha = 0.7, color = "white") +
  labs(x = "Residuals", y = "Density",
       title = "Residuals Distribution") +
  theme_pander()

City Hotel, Online TA

data_test <- test_list$data[[3]] %>% 
  ts(start = 110, frequency = 7)

model_best[[3]] %>%
  forecast(h = 30 ) %>% 
  autoplot() +
  autolayer(data_test, series = "Data Test") +
  scale_color_manual(values = "firebrick") +
  labs(subtitle = "City Hotel, Online TA",
       y = "Demand", x  = NULL) +
  theme_pander() +
  scale_x_continuous(limits = c(100, 115), 
                     labels = as.Date.numeric(seq(100,115,5)*7-7, origin = range(hotel$arrival_date)[1])) +
  theme(legend.position = "top")

model_best[[3]]$residuals %>% 
  as.data.frame() %>% 
  ggplot(aes(x)) +
  geom_density(fill = "skyblue", alpha = 0.7, color = "white") +
  labs(x = "Residuals", y = "Density",
       title = "Residuals Distribution") +
  theme_pander()

Resort Hotel, Offline TA/TO

data_test <- test_list$data[[5]] %>% msts(start = 3.090411, seasonal.periods = c(7, 365))

model_best[[5]] %>%
  forecast(h = 30 ) %>% 
  autoplot() +
  autolayer(data_test, series = "Data Test") +
  scale_color_manual(values = "firebrick") +
  labs(subtitle = "Resort Hotel, Offline TA/TO",
       y = "Demand", x  = NULL) +
  theme_pander() +
  scale_x_continuous(limits = c(3, 3.2), 
                     labels = as.Date.numeric(seq(3,3.2,0.05)*365-365, origin = range(hotel$arrival_date)[1])) +
  theme(legend.position = "top")

model_best[[5]]$residuals %>% 
  as.data.frame() %>% 
  ggplot(aes(x)) +
  geom_density(fill = "skyblue", alpha = 0.7, color = "white") +
  labs(x = "Residuals", y = "Density",
       title = "Residuals Distribution") +
  theme_pander()

City Hotel, Offline TA/TO

data_test <- test_list$data[[2]] %>% ts(start = 110, frequency = 7)

model_best[[2]] %>%
  forecast(h = 30 ) %>% 
  autoplot() +
  autolayer(data_test, series = "Data Test") +
  scale_color_manual(values = "firebrick") +
  labs(subtitle = "City Hotel, Offline TA/TO",
       y = "Demand", x  = NULL) +
  theme_pander() +
  scale_x_continuous(limits = c(100, 115), 
                     labels = as.Date.numeric(seq(100,115,5)*7-7, origin = range(hotel$arrival_date)[1])) +
  theme(legend.position = "top")

model_best[[2]]$residuals %>% 
  as.data.frame() %>% 
  ggplot(aes(x)) +
  geom_density(fill = "skyblue", alpha = 0.7, color = "white") +
  labs(x = "Residuals", y = "Density",
       title = "Residuals Distribution") +
  theme_pander()

Resort Hotel, Direct

data_test <- test_list$data[[4]] %>% ts(start = 110, frequency = 7)

model_best[[4]] %>%
  forecast(h = 30 ) %>% 
  autoplot() +
  autolayer(data_test, series = "Data Test") +
  scale_color_manual(values = "firebrick") +
  labs(subtitle = "Resort Hotel, Direct",
       y = "Demand", x  = NULL) +
  theme_pander() +
  scale_x_continuous(limits = c(100, 115), 
                     labels = as.Date.numeric(seq(100,115,5)*7-7, origin = range(hotel$arrival_date)[1])) +
  theme(legend.position = "top")

model_best[[4]]$residuals %>% 
  as.data.frame() %>% 
  ggplot(aes(x)) +
  geom_density(fill = "skyblue", alpha = 0.7, color = "white") +
  labs(x = "Residuals", y = "Density",
       title = "Residuals Distribution") +
  theme_pander()

City Hotel, Direct

data_test <- test_list$data[[1]] %>% ts(start = 110, frequency = 7)

model_best[[1]] %>%
  forecast(h = 30 ) %>% 
  autoplot() +
  autolayer(data_test, series = "Data Test") +
  scale_color_manual(values = "firebrick") +
  labs(subtitle = "City Hotel, Direct",
       y = "Demand", x  = NULL) +
  theme_pander() +
  scale_x_continuous(limits = c(100, 115), 
                     labels = as.Date.numeric(seq(100,115,5)*7-7, origin = range(hotel$arrival_date)[1])) +
  theme(legend.position = "top")

model_best[[1]]$residuals %>% 
  as.data.frame() %>% 
  ggplot(aes(x)) +
  geom_density(fill = "skyblue", alpha = 0.7, color = "white") +
  labs(x = "Residuals", y = "Density",
       title = "Residuals Distribution") +
  theme_pander()

Model Assumption Checking

Autocorrelation

The autocorrelation can be checked using the Ljung-Box test. If there are correlations between residuals, then there is information left in the residuals which should be used in computing forecasts.

best_adjust %>% 
  select(hotel, market_segment) %>% 
  bind_cols( 
      map(model_best, ~Box.test(.x$residuals, type = "Lj")) %>% 
  unlist() %>% 
  matrix(ncol = 5, byrow = T) %>% 
  as.data.frame() %>% 
  select(4:3) %>% 
  rename(p_value = V3, test = V4) %>% 
  mutate(p_value = p_value %>% 
           as.character() %>% 
           as.numeric() %>% 
           round(4)) 
  )

#> # A tibble: 6 × 4
#> # Groups:   hotel, market_segment [6]
#>   hotel        market_segment test           p_value
#>   <chr>        <chr>          <chr>            <dbl>
#> 1 City Hotel   Direct         Box-Ljung test   0.976
#> 2 City Hotel   Offline TA/TO  Box-Ljung test   0.971
#> 3 City Hotel   Online TA      Box-Ljung test   0.769
#> 4 Resort Hotel Direct         Box-Ljung test   0.854
#> 5 Resort Hotel Offline TA/TO  Box-Ljung test   0.959
#> 6 Resort Hotel Online TA      Box-Ljung test   0.959

The results suggests that all of our models don’t have any autocorrelation based on the non-significant p-value.

Normality

We will also check if the residuals for each model is normally distributed using Shapiro-Wilk Test. If the residuals are not normally distributed, it will lead to a biased parameter and less optimal forecast. This is also indicate that we can still tweak our model to get a better performance.

best_adjust %>% 
  select(hotel, market_segment) %>% 
  bind_cols(
    mean_error = map(model_best, ~mean(.$residuals)) %>% 
      as.numeric() %>% 
      round(5)
  ) %>%
  bind_cols(
    median_error = map(model_best, ~median(.$residuals)) %>% 
      as.numeric() %>% 
      round(5)
  ) %>% 
  bind_cols( 
      map(model_best, ~shapiro.test(.x$residuals)) %>% 
        unlist() %>% 
        matrix(ncol = 4, byrow = T) %>% 
        as.data.frame() %>% 
        select(3:2) %>% 
        rename(p_value = V2, test = V3)
  ) %>% 
  rename(segment = market_segment) %>% 
  mutate(p_value = p_value %>% 
           as.character() %>% 
           as.numeric() %>% 
           number(accuracy = 0.00001))

#> # A tibble: 6 × 6
#> # Groups:   hotel, segment [6]
#>   hotel       segment      mean_error median_error test                  p_value
#>   <chr>       <chr>             <dbl>        <dbl> <chr>                 <chr>  
#> 1 City Hotel  Direct           0.180        -0.387 Shapiro-Wilk normali… 0.00000
#> 2 City Hotel  Offline TA/…     0.104        -6.74  Shapiro-Wilk normali… 0.00000
#> 3 City Hotel  Online TA        0.0146       -0.898 Shapiro-Wilk normali… 0.00000
#> 4 Resort Hot… Direct           0.0442       -0.586 Shapiro-Wilk normali… 0.00000
#> 5 Resort Hot… Offline TA/…    -0.0813       -0.142 Shapiro-Wilk normali… 0.00000
#> 6 Resort Hot… Online TA        0.211        -0.426 Shapiro-Wilk normali… 0.00000

Based on the result, all of our model didn’t fulfill the normality assumption for the residuals. The positive mean of error signify that the model is underestimate the forecast while negative mean error means the model is overestimate. If we look at the median of error, all of our models are underestimate on the training set. They might be influenced by the presence of an outlier value such as a really high demand especially on the early part of the series. This suggest that we can improve the model further in order to get better performance.

Conclusion

This article has illustrated how R and functional programming of purrr can help us to do flexible forecasting for multiple time series models. We have tried to do hotel demand forecasting using a real-world datasets with two years worth of data. We also have tried to analyze the series pattern for each hotel and segment and fit the best model for each one of them with a satisfying results. The next step perhaps is to enhance the model further either by using another time series model, incorporate predictor by the unused variables from the original dataset or transforming the data.

Home

Machine Learning

Data Visualization

Article List