How to forget nothing in your ML project
I was told a funny story recently. A big oil company put a problem to data scientists: they were to watch if the truck drivers followed safety instructions while transporting gasoline (in particular, if they smoked in the cabin). After some time, the data scientists presented a solution — a hardware module which looked perfect: it recorded videos, evaluated them using neural nets, and triggered alarms with a very low error quote. The only problem was, the module was too big to be mounted anywhere in the truck’s cabin.
In general, hardly any machine learning model in industry adds value without being integrated into a bigger technical context. As all the decisions and work follow out of requirements to the system, getting requirements right is absolutely critical. Every forgotten requirement increases the likelihood that you will later have to rebuild part of your solution, or in the worst case rebuild it from scratch. This means wasting effort, not delivering on time, and losing the trust of your customers.
Requirements engineering is a discipline which is established in the software development process. For software projects, there are also templates and checklists which help not to overlook any stakeholder or requirement. As we know, machine learning solutions have some particularities not present in other projects:
- everything related to the dataset
- experiment reproducibility
- ethical concerns
- context drift
- etc.
Thus, it would be nice to have a catalog of requirements for an AI solution. Everything I could find on the web contained only some of the questions you need to ask. So I created a new catalog aiming to be as complete as possible.
Important note: this catalog covers everything if used together with another catalog/template which includes requirements for the regular software system. Such catalogs have existed for a long time, so there’s no reason to duplicate this work. I recommend the Volere template (old free version, new paid version), but you can also use another option. The catalog presented here should close the gap by adding the machine learning-specific part to common software requirements.
How to use this catalog (and the Volere template) is generally up to the user. The approach which appears pragmatic to me is to go through questions when launching your DS project (for some requirements, a clarification in the course of the project may make more sense.) Try to find the answer to every question. Answers like “doesn’t matter in our context” are explicitly allowed. For the rest of the items, search for relevant stakeholders, ask them, and document what they tell you.
You can find the catalog here, it’s free for use.
Requests for improvements to this catalog and changes are welcome — please either submit a pull request or open a GitHub issue.