To simplify the manipulation of the data we needed to combine the data provided by Gertrude into a single unified dataset. So choosing the right model of the dataset can have a stake on the prediction, the features of the dataset are:
- date (periode) “Example: 28/01/2021”
- zone (Bordeaux Metropole is divided into 37 zones by Gertrude) “Example: 24”
- sensor_type (Gertrude is using two types of sensors, a CT sensor that calculates the number of cars stopping by a red light, and LQ sensor that calculates the time of waiting of every car in a red light)
- sensor (every sensor is identified by a number) “Example: CT9 ” (this means we’re using the CT sensor number 9)
- 288 values (val_001…val_288) (every value represent 5 minutes)
As you can see in the picture, for example, the first line matches the date january 18th 2019 , zone 27 of Bordeaux, sensor CT and the 288 value that corresponds to number of cars in zone 27, at 18th january 2019 every 5 minutes.
After choosing the dataset, we have to go to the part of Data Preprocessing, The goal of this step is to encode the dataset in a form that could be interpreted and parsed by the the algorithm. So we decided to go for this formula of features:
- year
- month
- day
- dayOfWeek (an integer from 0 to 6)
After encoding the features, the results on our dataset can be seen as:
Another part of Data Preprocessing is to split the data for 2 reasons, the first reason is to differentiates between features and targets, for example in our case the first split is:
So using this split we can start thinking about our model, but before that there is a second reason to split data, which is to divide the data into training data and testing data, so we decided to allocate 80% of the data for training and the remaining for the testing part ( we will explain later what training data and testing data means)
The Random Forest model is one of the popular machine learning models that is used for flexible subjects, highly accurate results and also for training on large amounts of data. The Random Forest Regressor is used for popular fields as Banking and the stock market prediction, etc.
So how this model works?, there is no better way to explain this model more than the example i used in order to understand it for the first time, (The field of study example)
So as you can see on the example, we have a decision tree where each question narrows down our range of possible values until the model get confident enough to make a single prediction.
In this example we start from the first question about the gender of the student, if he’s a male then we go for the next question about his GPA, if his GPA is over 3 then we need to know if he’s taking physics or not, if the answer is “Yes” then he is surely an Engineer, if not the the model is confident enough to say that he is a Computer Science Student (Welcome to the team), same for every path of the decision tree until the model get confident to make a prediction.
The Random Forest Regressor is an ensemble of decision trees
Many trees, constructed in a certain “random” way, form a Random Forest.
The training phase is the phase where our model learns and assimilate information. In this project, we used supervised machine learning. Which means that the data given to the model is “labelled”. In other words, the data is already tagged with the correct answer.
In our case, the data is the CT (sensor) and the label or the correct answer corresponding to the tag is the set of values registered by this sensor during one day. Our model trained on all the city’s sensors data collected from January 2019 to October 2020. This step is very important because, through learning from labeled training data set, the model can predict outcomes for unforeseen data.
As i already mentioned in the Data Preprocessing part, We reserve 20% of the data for model evaluation. This will be called the test data. The test data is “new” data for the model that he will not come across during his training, only the remaining 80% of the data will be used during training.
this test dataset can help us determine performance of the model. The Random Forest Regressor got us a good result using this dataset.
So the test precision is the evaluation related to the test data, we got a 85% precision which is good for such a complicated subject as traffic flow.
In order to evaluate the model, we need to go through some steps.
- validation step : once the model is trained, predictions are made on the test dataset. The results are then compared the the real outputs of the test split.
- accuracy : the accuracy of a machine learning model can be computed based on its performance on the test dataset. The accuracy of our model can reach up to 85%.
- test the model with future data : Finally, the correctness of the model can be evaluated by predicting outputs of upcoming days and comparing the results to the collected data on the following days.
We can also evaluate the prediction by comparing with the average of all previous data for the same area, same CT sensor and same day of the week.
For this comparison, we choose one of the lines in the data-frame grouping the results of predictions. Then, we retrieve the day of the week from this row via the “dayofweek” column, finally we compare the pace of our prediction with the average calculated from the Input data. the last graph is a comparison example of the CT21 prediction for Wednesday. The orange curve represents the Wednesday average for CT21 in zone 10 and the blue curve represents the prediction of CT21 on zon10. Same thing for CT3 during a Saturday and CT2 on a Sunday in zone 10.
- The difference between the nature of CTs data and LQs data.
- The failure of some sensors in some areas, and therefore data that may be missing in some files.
- Unpredictable events, such as accidents or bad weather conditions.
- Data collected during lockdowns is not the same as data collected under normal circumstances. (COVID-19 consequences as curfew, lockdown …)
The objective of our project is to create a model capable of making a relevant prediction on a number of future days, based on the input data previously placed. In our script we specify over how many days we wish to carry out a
prediction
The end result is finally written to a file in csv format, here is what the result looks like (this one being too long, we have extracted parts of it):