In the famous Google paper published by Sculley et al. in 2015 “Hidden Technical Debt in Machine Learning Systems”, they say that in Machine Learning (ML) systems, only a small block is actual ML code.
Being a Product Manager for an AI driven underwriting product, I have figured out that Machine learning code becomes stale faster than milk.
Every now and then, client would come up with a new requirement to add new variable (Data point)and for them its just one new variable but for the team it may become a whole new cycle of rebuilding the model, check for dependencies and then push the model to production.
Now, the main challenges of doing this is
- There are multiple teams involved Data Engineering, Data Scientist, Machine learning Engineer or Developers. Data Engineers might be building pipelines to make data accessible, while Data Scientists are worried about building and improving the ML model. Then Machine Learning Engineers or developers will have to worry about how to integrate that model and release it to production. As soon as it comes to a point where multiple people are involved, mistakes are bound to happen.
- Now all these teams might be using different tools and workflows which makes it hard to automate the process and it ends up being a process un-auditable.
- Versioning is another challenge, though there are tools like Git which makes the job easy but the challenge still persists in organizations where they dont understand the importance of versioning and directly deep dive into building and managing machine learning models.
A simple and only solution to the problem above is to brings cross functional teams together and using their learnings from Agile and DevOps deliver an end-to-end ML system. Also, start to use tools like Git and Jira to keep track of what was delivered before and what’s the new upgrade all about, so that you can roll back in case your dream machine learning models fails in production.
Now we have understood the problem, lets break it down to bring out a solution
- Lets start with Data
Data should be free of errors and would be in a shape and form which Data Scientist is expecting, ryt?
Let’s try to sort this out.
I always use to think More the data better the models prediction would be. But now I know, it also creates a lot of dependency on the sources of data. More the sources more complex this problem becomes.
So, regardless of which flavour of architecture you have, it is important that the data is easily discoverable and accessible. You can use a data lake architecture, a more traditional data warehouse, a collection of real-time data streams or a decentralized data mesh architecture as per your company’s fit.
Transparent and Accessible data will help the data scientists to easily figure out the best features out of the data.
Now the data can change in two different axis: structural changes to its schema, and the actual sampling of data over time.
As discussed above, proper versioning of data and due notice of drift in data needs to be managed, it can be done by checking for data drift in model results.
We will continue to deep dive and move to Reproducible Model Training in the next article.
Till then Take Care, have a Great Day ahead!
Would love to connect on linkedin