5 Step Implementation of Logistic Regression to Predict Venue Cancellations

Logistic Regression is an algorithm that predicts the probability an observation belongs to one of two classes. If the observation being predicted is an event, the binary dependent variable is encoded as as a 1 if the event is likely to occur, or as a 0 if it is not likely.

In this example, I implement a 5-step logistic regression to predict whether a reservation is likely to be cancelled so venues are prepared to find new bookings for empty space.

The dataset used is on a travel business and contains 4238 observations. Its categorical variables include destination country, property type, and whether the customer booked with a special request. Its non-categorical features include number of rooms booked, number of nights booked and the venue star rating. One out of 3 bookings is cancelled, making the cancellation rate 33%.

Step 1: Explore your dataset! Analyze patterns, distributions, pairplots and correlations with the goal of understanding how your features interact with the target you want to predict.

Step 2: Isolate your selected predictive features and target. Using the knowledge derived from your exploratory data analysis, establish which features you think are the most effective in predicting your binary target variable. Isolate them in a dataframe that you will later refer to when you train your Logistic Regression algorithm to learn from the historical data.

Step 2: Apply a train test split. Divide your desired feature set dataframe into one partition allocated for training the model, and another to test its accuracy after it makes predictions. The majority of your partitioned data should be allocated towards training so your model has as many examples to learn from as possible; in this case, 70% of our 4238 records, or 2967 are used to train the model. The remaining 1271 will be used to test how well the model learned.

Step 3: Fit the Logistic Regression algorithm to the training data so it learns from it. Fit your model to your selected feature set observations (X_train) and target (y_train) so it learns which combination of X_train features resulted in a cancellation (when the y_train target = 1), versus a kept booking (when the y_train target = 0).

Step 4: Make predictions on the test dataset and measure their accuracy. Evaluate the accuracy of your Logistic Regression by comparing its predicted target values (y_pred) to the actual target values (y_test) in the test dataset.

Step 5. Use a confusion matrix to visualize how your model is making predictions compared to actuals.

In this case, the Logistic Regression accurately predicts a cancellation 69% of the time; this accuracy consists of 821 predictions which were true non-cancellations (or true positives) and 57 predictions which were true cancellations (or true negatives), out of 1271 observations in the test dataset. The remaining 31% are miss-classified. I hope this helps you understand the mechanics behind a logistic regression better. Thanks for reading!

Footer