How to Present Machine Learning Results to Non-Technical People

Methodology 1

This methodology can apply to any model that generates probability score values between 0 and 1.

First sort your model scores from high to low and decile them. Decile 1 will contain the highest scores and decile 10 will have the lowest scores.
Next calculate the minimum, median, and maximum score value for each decile.
Calculate the number of true positives by decile and then take the count of true positives divided by total true positives in your scoring population.

Below is an example of the output using this methodology on a sample set of purchase propensity model scores. The percentages in the “% of Total Purchases” column were calculated by taking purchases in the decile divided by total purchases of 31,197.

Purchases by score decile — Table by author

These are the key takeaways from this table.

The first and second deciles predicted 33% and 32%, respectively, totaling 65% of the purchases. This shows the model was able to predict a majority of the customers that purchased in the top two deciles.
Each decile contains a lower percentage of total purchases compared to the decile above it. This is the trend we expect for a good model because the majority of the purchases are in the top deciles while the bottom deciles contain very little purchases. Notice decile 10 contains only 2% of the purchases compared to decile 1 which contains 33% of purchases.
The scores in decile 3 is between .239 to 0.555. Decile 3 is when the model becomes less likely to predict customer purchases accurately.

I’ve seen data science presentations with model results that was beyond the grasp of non-technical people.

However, if you presented model results with a table that showed the top 20% of customers captured 65% of purchases that’s easy for your stakeholders to understand.

Methodology 2

This second approach is similar to the first and is a view that also makes it easier to evaluate the model results as a data scientist.

Instead of deciles, bin model scores into one-tenth increments between 0 and 1.
Calculate the minimum, median, and maximum score value for each bin.
Calculate the number of true positives in each bin and then take the count of true positives divided by total true positives.

Below is an example of the output using the same set of sample purchase propensity scores.

Purchases by score range — Table by author

These are the takeaways using this output binned by score values.

Using 0.5 as the threshold to map to true the model was able to predict 67% of the total purchases if we add up the percentages in the highlighted “% of Total Purchases” column.
However, customers with a score between 0.1 and 0.19 accounts for 20% of total purchases. This indicates there may be additional features that can improve the model and warrants further analysis. If we didn’t see the 20% but instead saw 2% then you could be more confident the model was capturing all the relevant features to predict purchases accurately.

This output shows stakeholders your model captures 67% of the purchases accurately. It also helps you identify issues if the model is not predicting true positives and show areas where you can research further as in the case of the customers with a score between 0.1 and 0.19.

Conclusion

As data scientists, the impulse is to show the raw model results but often we need to transform the output into a form stakeholders can understand. Now that you seen my approaches I hope you find it easier to articulate your model results for everyone to understand.

Methodology 1

Methodology 2

Conclusion

Footer