HDSC Stage G OSP-Predicting Car Prices

This was a data science internship assignment done as a team with the following members — Olufunmilayo Aforijiku, Sooter Saalu, Amoo Eno, Micky Nnamdi, Fiyinfoba Ogunkeye, Ekemini Umanah, Sophia Jack, Adu Aanuoluwapo, Patrick Ogbonna, Oluwasayo Akinkunmi, Chukwuemeka Omeh, Echefu Charles, Toluwanimi Olorunnisola, Adeola Abiola, Samuel Adeapin, Gloria Agunanna

Link to the GitHub project

Tools Used: Python, Tableau, Git

In many countries of the world, it is common to shop for used cars rather than buy new ones, this is a decision that is mostly informed by the financial cost of new vehicles or sometimes the belief that used cars work better or last longer than new ones. However, there is always ambiguity in the price of the car in question, as the resale price differs from seller to seller, with some expecting or requiring long negotiation before the final sale. Apart from these subjective factors, there are physical features that play a part in the price of a used car; such as the age, manufacturer, model, and mileage.

This project looks at predicting the price given for used cars based on these features.

The dataset used in this project comes from Kaggle and it contains scraped data from Craigslist, an American classified advertisements website, with one of the world’s largest collection of used vehicles for sale.

The dataset had information on 458,213 cars with 25 columns showcasing the features of the car, the location it is being sold from, and when it was advertised for sale.

As most of the information in the dataset was user inputted, there were a large number of missing values and error inputs, requiring a lot of cleaning.

Dataset description

Visualizing missing values

Percentage of missing values in each column

Our cleaning process began with an in-depth exploration of the dataset features, which led to the discovery that the description column contained some of the missing information on the condition of the car, as well as some other physical features, in an unstructured format.

This called for a meticulous extraction of data as we cleaned and processed the data, filling missing values with information from the description column when available and dropping rows and columns when needed. Outliers and error inputs were also removed using quartiles and granular data exploration. At the end of this process, we ended with a clean dataset of 186923 rows out of the initial 458213 entries.

Our next step was to perform some analysis on the dataset, exploring the dataset features and seeing how they correspond against price.

Footer