Hello World with Kaggle

“We’re making data science into a sport”

Everyone loves to learn data science and machine learning these days. Some might have experience with it but how can you take your skills to another level? In this article, I would like to introduce you to a powerful platform called Kaggle to enhance your data science and machine learning skills. The next question might be why Kaggle? What’s so fun about it?

· It has a leaderboard that will show your ranking among your competitors

· You can cooperate with anyone to learn from each other

· You can improve or learn new skills from shared scripts in forums

· It’ll give you a recognition to your data science profile by ranking your profile based on the past competition performance

· It has a very friendly and helpful community

· There will be some big prize pool competitions where you can earn your cash prize.

As the title suggests, this article is for beginners. I would like to guide you through your first Kaggle competition where you can learn how to create a model which will predict the solution for the given problem and submit your results to see where you stand among your competitors. The following link contains some of the best competitions for beginners.

https://www.kaggle.com/getting-started/44088

All you to do is go to https://www.kaggle.com/ and create an account for yourself. Then sign in to your account click the above link which will show you some of the best competitions for beginners. Then go to this link https://www.kaggle.com/c/titanic to start your first competition. Before diving deep into the competition I would like to give you a small brief about this competition. This competition is based on the titanic dataset. All you need to do is, create a model that predicts which passengers survived the titanic shipwreck.

Here the data has been split into two groups:

· Training set (train.csv)

· Test set (test.csv)

Training set consists of data that includes the answers for this problem which will help you to train the model. The test set is similar to your exam where it’ll evaluate your model. Based on the score that you get for your model, you’ll be ranked among your competitors from all over the world.

Overview Tab

This is the initial page where you’ll end up by clicking the above link. This Overview tab will give you an idea about this competition. Then Data tab will give you an idea about the data inside these datasets. Now, we need to go to the Notebooks tab and click on the New Notebook.

Kaggle Notebook

Now you’ll end up in a Kaggle notebook page which is similar to Colab or Jupyter notebooks. In the upper left corner of the page, you can see the title for your notebook. I have changed the title to “Titanic”. You can give whatever you prefer. Automatically you’ll end up with the relevant dataset and imports such as numpy and pandas.

Then you need to load the train.csv which has the training data for your model with the help of pandas library which you have imported previously as “pd”. On the right side, you can see tabs like Data, Settings, Code Help. When you register for this competition, the dataset necessary for this competition will be automatically imported. Then it has a Settings tab which consists of language where you can choose either Python or R. if you don’t know what would be the code to read a csv file, you can always ask from Code Help which will help you with giving you the sample code for that particular problem.

Code Help

Now, we have loaded the train data to the train_data variable using the pandas library. Then you can see the first 5 rows in the train data using the head() method as shown above.

Load Test Data

Now, you need to load the test.csv. If you noticed carefully, this dataset doesn’t contain the “survived” column because this is what you need to predict now.

Model fitting

For this problem, I’ll use “RandomForestClassifier” from the scikit machine learning library. This is my approach. If you want, you can try different algorithms to enhance your prediction. Here I’ll use the target value “Survived” from the training dataset and assign it to the variable “y”. Here I have chooses some values which would have an impact on the prediction.

Data Overview

If you don’t understand any variables that they have used in this dataset, you can always for the data tab section of this competition to learn more about it.

Finally, I’m fitting the model with training and the target value, where the model will predict the value with the help of a random forest classifier algorithm. Now we are using this predict() method on the model to predict the values for the test data. In the end, I’m saving the prediction and passenger id column to the variable output and export it as a csv file named my_submission.csv.

Save Version

After finishing your export, click on the Save Version Button to save your work with current changes. Give Version Name which will help you to track which version you are using now.

Summary of Version 1

Then click on the show versions button which is near the Save Version (small square indicating the number) which will show the summary of that version. Then click on the Go to viewer button.

Submission

Then scroll down a little bit where you can see the submit button. Click on it to submit your work. after you successfully submit your work, you’ll end up getting a score for your model. Now you can see your rank on the leaderboard also.

Score & Leaderboard

Now, it’s your turn to enhance your score. Apply your skills to improve this score furthermore and get your place among the top place. You can make another version of your work and submit it again to evaluate your score. Hope you enjoyed this article. Thank You.

Footer