We have entered a new Big Data Revolution, a concept that refers to the abundance of accessible data. The fact that large quantities of very high-dimensional or unstructured data are constantly generated and processed.
With all of us still seeking to do what is trendy, learning Machine Learning is no different. It’s very easy to try to master yourself in the field of ML with over a thousand machine learning courses and video lectures now available on the internet, but it could be a high risk that it’s just for namesake. What nobody is talking about in these courses is that the data has been so handpicked that we can simply write three lines of code and get that high level of accuracy without any understanding of the data. This sets us with a few unreasonable expectations. What these courses do is make us dream and hope and demand more than reality will ever offer. Unfortunately, the approaches used to ensure data quality and credibility are rarely discussed in most machine learning studies.
What is data? What do I need to do with it? and why do I need to have an understanding that?
Data is facts and figures gathered together for comparison or study, at least that’s what google tells me, so our aim is to create a model that can reduce the uncertainty of how a new data point can be labeled. When we have problems to solve and want to understand complex mathematical interactions that help us find solutions, the true power of machine learning arises and for this you need an understanding of your data.
Artificial Intelligence and Machine Learning are not magic wands. You can’t just train a model and believe it will perform
Through all my research projects I have observed one thing quite clearly, the minimum requirement for any machine learning algorithm is a sufficiently large data set that can be partitioned into disjoint training and test sets or subject to a certain reasonable form of n-fold cross-validation for smaller data sets that can be trained on one or the other model to provide high accuracy. Usually, 5 or 10-fold cross-validation is adequate to validate most learning algorithms. This form of rigorous internal validation is essential to the development of a robust model.
I feel that it is especially useful to perform a validation test using an external data source beyond the usual procedure of internal validation. In order to ensure reproducibility, this set of external validation must also be sufficiently broad. I also strongly recommend that further clarification or monitoring of data integrity by a knowledgeable expert is also a worthwhile exercise. At some point, we need an expert opinion on our predictive model. Just because the data has some sort of statistical correlation, it does not mean that we can solve any kind of problem with it, and I believe that, in most situations, an expert in the field can lead us to a better solution and can perfectly evaluate whether or not our model is sustainable. Between the expectations of what ML should do and the practical reality of its implementation, there might be a little gap, or even a massive chasm.
Tom Wilde, CEO of Indico, says: “The key thing to remember about AI and ML is that it’s best described as a very intelligent parrot, It is very sensitive to the training inputs provided for it to learn the intended task.”
Most of the errors in the ML phase can be difficult to recognize and resolve, since they do not throw errors or exceptions straight forward, but they influence the final results of our project and hence our insights. Henceforth, to discover those obscured mistakes, we need either a great deal of data comprehension or an opinion from a qualified expert to direct us to a better result.
And what have we learned? Probably not a lot, I am not a great writer, just a friendly reminder that ML is not a TikTok trend everyone wants to jump in for a little bit of validation, knowing the core of your problem, the data available and the various ML models are the few starting steps for a viable ML project.