When an error occurs within a system, how do we know what error occurred and how to fix it? It has to do with observability — the amount of data that the system itself gives us regarding the context of the error, where it occurred, and what occurred. In the past, observability was quite simple. Systems weren’t as multi-layered and complex as they are today. But as networks and infrastructure become more complicated, it becomes even more necessary to improve upon observability in data science, monitoring, and its best practices.
[Related article: Continuous Delivery for Machine Learning]
Observability is a metric by which we measure the amount of context we have regarding what is happening inside of a system, by what is reported on the outside of a system. Imagine that you are a zoologist looking for a never-before-seen species of animal — in the forest. Your observability would be relative to the amount of evidence and facts you have regarding where this mystical animal has been seen; if it’s been seen by rivers or mountains, who it has been seen by, and what it looks like. Without a high level of observability, you would never be able to find an animal in the dense woods.
Similarly, observability is really traceability — it is what makes it possible for developers and troubleshooters to determine when something has gone wrong, where, and why. Observability in data science is what makes it possible for someone outside of the system to find out what has gone wrong inside of the system.
Systems are complicated today. Companies today may be running private and on-premise servers in a hybrid network architecture. They may have dozens or even hundreds of third-party applications, APIs, plug-ins, and more. When something goes wrong, the network infrastructure may not be able to report back where and when it went wrong. It may not have a high level of observability.
In the past, when systems were simple, it was easy enough to take a look at every component of a system to determine where a fault lay. While there were errors, they were often inscrutable — everyone remembers the hex codes thrown up by an old school “Blue Screen of Death.”
But today, systems are so much more complex that there have to be greater levels of observability. We need to know exactly what is going on within a system to be able to fix it. If we don’t know what’s going on within a system, it can take hours or days to determine where a fault is. And issues such as performance issues and memory leaks may never be fully tracked down, because it may not be apparent where they are.
Observability is not a replacement for monitoring, nor is it the same thing, though they do have some crossover. Better observability naturally leads to better monitoring. But observability refers to how transparent the internal state of a system is to the outside, whereas monitoring refers to the act of taking information from within a system and actively analyzing it.
Systems with greater levels of observability will have better monitoring, but without monitoring observability often cannot be acted upon. Systems need to be built with solid observability from the outset, and then the appropriate monitoring tools should be used.
- Always give context to reports. Today, data could be coming from anywhere within the sprawling and often organically-organized channels.
- Prioritize error messages and reporting. Anything that is urgent should always go to a different channel than anything that is mundane.
- Structure logs and events. Logs should be useful and searchable and should have all the data that is available.
- Maintain unique IDs. Unique IDs should be used for events throughout the system so they can be traced back.
Through better observability, organizations are able to reduce the amount of time they spend in troubleshooting, maintenance, and development. But observability also needs to be baked into a system from the start, or it can be difficult to achieve.