What happened when I used a Bagging Classifier to predict on the Kaggle Titanic dataset

I recall, early on in my foray into data science, watching a Kaggle video on YouTube and being advised that as I progressed in the Kaggle competitions that I would move up the leaderboard. Therefore, I have tirelessly tried various algorithms to move up the leaderboard on the Titanic competition question. I have previously tried different models in an attempt to increase the accuracy, but on this occasion I decided that I would change my algorithm to see if the accuracy for this dataset would improve. This renewed attempt at improving my accuracy rating was spurred on by the 100% accuracy I achieved when I worked on UCI’s wine recognition dataset, the link being found here:- How I scored 100% accuracy on UCI’s wine recognition dataset | by Tracyrenee | AI In Plain English | Jan, 2021 | Medium

The problem statement for the Kaggle titanic competition reads as follows:-

“The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).”

I accepted the challenge to answer this competition question and created a file in the Kaggle Titanic competition page, a free online Jupyter Notebook that has Python and several relevant libraries already installed on it. I endeavored to use a new strategy to solve this question, which I will cite in this post.

When the program was created, I loaded the libraries I would need to use to carry out the computations, being numpy, pandas, matplotlib and seaborn. Numpy is a library that computes algebraic equations and pandas, a library that manipulates dataframes, has been written on the back of it. Matplotlib is a graphical library and seaborn, being a more complex graphical library, has been written on the back of matplotlib.

I then loaded the files that are stored in the kaggle website:-

Once the train and test files were loaded, I read them into the program:-

I decided to try out a new strategy in this post to see if this new way of working would improve the accuracy rating for this competition question. I therefore defined the target, ot y variable, at the beginning of the program. I then dropped the target from the train dataset and called it train_less_y:-

I then appended the test dataset to train_less_y and called the newly created variable X_tot:-

I checked X_tot for any null values and found there were quite a few in three columns of data:-

I decided to delete the “Cabin” and “Ticket” columns from X_tot because they would only create noise and would not, in my opinion, contribute anything meaningful to the prediction:-

I decided to initially replace all null values with zero for ease of computation:-

I then imputed the null values in X_tot. I replaced “Age” and “Fare” with median values for tier respective columns. I then replaced “Embarked” with the most commonly used value for its respective column:-

To illustrate how the two classifications in the target array fit into the computer’s memory, I created a graphical representation of this. As can be seen, the 1’s are interlaced with the 0’s, which has made it difficult to make an accurate prediction:-

Once the initial analysis had been conducted, I extracted the title from the name and created a new column, “Title”:-

I then mapped the titles an applied them to the newly created column as a type of encoding:-

I ordinal encoded the “Sex” and “Embarked” columns in the same manner that I had encoded the “Title” column:-

In order to help the model compute the predictions, I converted the float “Age” and “Fare” columns to integers:-

I then defined the features that would be used to calculate the predictions and applied these features to X_tot:-

I then normalised X_tot by using an algebraic equation to convert the values in the numeric columns to values between 0 and 1, which would have the effect of helping the model to make predictions on the data:-

Once the data had been preprocessed, I defined X_train and X_test, which would be input into the model. X_train is X_tot up to the value of the train dataset. X_text is X_tot from the value of the train dataset to the end of the dataset. The target, y_train, is the y variable:-

I selected the model, and in this instance it is sklearn’s BaggingClassifier() because it is an ensemble classifier and I have not yet used this model on the Titanic dataset. I used RandomForestClassifier() as the base estimator and achieved 93.04% accuracy when I trained and fitted the training dataset:-

I made predictions on the test dataset:-

Once the predictions had been made, I put them in a dataframe, along with the test dataset’s “PassengerId” column.

I then submitted the predictions to Kaggle in the format they require:-

I achieved 75.36% accuracy using the algorithm that I had created for this post. This is not my best score, but it is not the worst either:-

When I was writing this post, I went back and removed the “Parch” and “SibSp” columns from the feature selection and this had the effect of improving the accuracy by 2 points. These two variables, therefore do not add to the accuracy of the model and should be removed from the feature selection:-

The code for this blog post can be found in its entirety in my personal Kaggle account, being here:- Titanic — Bagging Classifier | Kaggle

Footer