Machine Learning are numerical predictions and don’t have so much to show except the confusion matrixes. The best way to evaluate our models’ performance is to look and different accuracy scores.

**Classification Accuracy** is what we usually mean, when we use the term accuracy. Probably the most straightforward and intuitive metric for classifier performance. It is the ratio of number of correct predictions to the total number of input samples. It works well only if there are equal number of samples belonging to each class.

**F1 Score** is the Harmonic Mean between precision and recall. The range for F1 Score is [0, 1]. It tells you how precise your classifier is (how many instances it classifies correctly), as well as how robust it is (it does not miss a significant number of instances).

**Area Under Curve(AUC)** is one of the most widely used metrics for evaluation. It is used for binary classification problem. *AUC* of a classifier is equal to the probability that the classifier will rank a randomly chosen positive example higher than a randomly chosen negative example.

**Recall **is the number of correct positive results divided by the number of all** **relevant samples (all samples that should have been identified as positive), which means the percent of truly positive instances that were classified as such.

**Precision **is the number of correct positive results divided by the number of positive results predicted by the classifier, which means percent of positive classifications that are truly positive.

**Mean Absolute Error (MAE) **is the average of the difference between the Original Values and the Predicted Values. It gives us the measure of how far the predictions were from the actual output.

**Mean Squared Error (MSE)** is quite similar to Mean Absolute Error, the only difference being that MSE takes the average of the **square **of the difference between the original values and the predicted values. The advantage of MSE being that it is easier to compute the gradient.

**Root Mean Squared Error (RMSE)** is more appropriate to represent model performance than the MAE when the error distribution is expected to be Gaussian. It avoids the use of absolute value which is highly undesirable in many mathematical computations.

**R-squared (r2)** is the percentage of the response variable variation that is explained by a linear model. The maximum value of R² is 1 but it may take a negative value.

**Variance inflation factor (VIF)** is a measure of the amount of multicollinearity in a set of multiple regression variables. It is calculated for each independent variable. A high VIF indicates that the associated independent variable is highly collinear with the other variables in the model.

All in one, it looks like the Artificial Neural Network model Multi-Layer Perceptron reached the best results in all accuracy scores, and it has the lowest error rates (MAE, MSE). However, the VIF for this model is above 5, it means that my model also has the highest multicollinearity level.

I will keep working with this one, trying to find a way to make its predictive work even better.

**Machine Learning work great. **But after going through this workflow and given that the model results looks sensible, I have the feeling that something was missing to answer my problematic. As you might have noticed, one severe shortcoming is to account for certainties of the model and confidence over the output.

Inferential statistics use the data to learn about the population that the sample of data is thought to represent, in my case the highly engaged citizens of a Smart-City. With statistical inferences, it is possible to reach conclusions that extend beyond the actual dataset.

If the machine learning models are used to make predictions inside the dataset, probabilistic models are used to make predictions about extensive data, using probability of an event to occur and statistical hypothesis testing.

In this third part of my study, I’ll use probabilistic models and the analysis of variance ANOVA in order to frame my predictions on highly engaged citizens in the most accurate way. Whatever level of assumption is made, a correctly calibrated inference in general requires some assumptions to be correct. I will present them first, then I will be able to compare my actual results with the predicted ones. If both are very close, so my inferences would be right and I will be able to validate my ranking model at a wider scale.

Estimating probabilities with probabilistic model is to formulate my problem like that : I am investigating citizens of a city in the look for highly engaged citizens. I know that the overall population of a city can be classified in highly engaged citizens and non-highly engaged citizens, but I don’t know how many of each there are. By conducting a survey study on a random sample population I found that highly engaged citizens approximate 4,4 % of my whole dataset (16 on 366 respondents). Assuming that this class of citizen had an equal chance to appear in my sample, I want to estimate their prevalence in the whole population of citizens in Smart-Cities.

At least two questions pop up to my mind :

– How can I be sure that my sample is representative of the whole population ? I need to include uncertainty in my estimation, considering the limited data.

-How can I incorporate prior beliefs about highly engaged citizens into this estimation ?

**The inferential statistics method called Bayesian inference allows to express uncertainty and prior beliefs. **To solve this problem with a Bayesian model, I will need to assume that:

– The chances to reach a highly engaged citizen are independent from each others. (I am not spreading my survey in some niche of engaged citizens)

– Any citizen can potentially be highly engaged and match with my definition. (There is no bias that would reserve this class to a niche population)

Since my population can be divided in two classes of citizens : the highly engaged and those who are not, the probability distribution of all situation respects a binomial model of distribution. In the probability theory, the binomial distribution with parameters **n **and **p** is the discrete probability distribution of the number of successes in a sequence of **n **independent experiments, each asking a yes-no question, and each of its own boolean-valued outcome: success (with probability **p**) and failure (with probability q = 1-p). In my problem, **p **is the ultimate objective: I want to figure out the probability of meeting highly engaged citizens in the whole Smart-City citizens, from the observed sample data. In statistics, a single success/failure experiment is drawn from a **Bernoulli Distribution, **which forms the prior distribution for the sample of size **n** drawn with replacement from a population of size N.

My sampling distribution helps to estimate the population statistic. The overall system of my interest, where a population of citizens can be divided in 2 discrete classes (highly engaged and non-highly engaged citizens) and 366 independent respondents, has a Probability Mass Function shown on the binomial distribution below :

The **Central Limit Theorem** states that, no matter the shape of the population distribution, the shape of the sampling distribution will remain the same. This gives us a mathematical advantage to estimate the population statistic. The number of samples have to be sufficient (generally more than 50) to satisfactorily achieve a normal curve distribution. Also, care has to be taken to keep the sample size fixed since any change in sample size will change the shape of the sampling distribution and it will no longer be bell shaped. As we increase the sample size, the sampling distribution squeezes from both sides giving us a better estimate of the population statistic since it lies somewhere in the middle of the sampling distribution (generally).

As for my study, the proportion of 4,4 % percent of highly engaged citizens over a random sample from a city population is already a mean generated out of my 3 case studies. The Central Limit Theorem allows me to assume that if I would investigate 100 Smart-Cities of the world with the same protocol of data collection, the final mean of the proportions of Highly Engaged citizens would keep close to 4,4 %.

As normally distributed, the means of the proportion of highly engaged citizens in Smart-City populations fit inside a confidence interval to be determined by our hypothesis testing.