While “observability” has become a common buzzword in the world of cloud computing, the core principles of this discipline predate the cloud by decades.
The concept, which initially emerged from control theory, and systems engineering, has been slowly shaped by technological advances, and the increasing complexity of modern systems. While observability is now often associated with cloud computing, this is just the newest evolution of a very long and interesting history.
So, how did we get from the physical control rooms of the past to the intricate monitoring and tracing systems of modern, distributed cloud native architectures?
The Genesis of Observability: Control Theory and Beyond
The foundations of observability can be traced back to the mid-20th century and the field of control theory, which focuses on understanding and controlling the behavior of dynamic systems. Control theory looks at how to build stable systems, and requires a deep understanding of the inputs and outputs of those systems. Here’s a deeper look into key concepts from this field:
State Observation:
- Technical Details: Control theorists were concerned with how to reconstruct the internal state of a system using external measurements. This is a far more challenging task than simply monitoring inputs, or outputs. This involved using mathematical models, state space representations and other techniques to infer the inner workings of a system using only external observations.
- Practical Implications: Early feedback loops relied on a human in the loop, but with a stronger grasp on the mathematics, systems can be designed to make those decisions autonomously, by continuously adjusting inputs.
- Influence on Modern Systems: This concept evolved into modern telemetry and instrumentation, where metrics are gathered from systems to understand their performance, and also to detect anomalies. Modern observability platforms are built on this idea of being able to fully view a system from the outside.
Feedback Control:
- Technical Details: Feedback control systems use measurements to adjust the inputs of a system, to maintain a desired level of performance. This can be seen in how cruise control in a car functions, by measuring the speed and adjusting the engine to maintain the speed setting.
- Implementation: This involves measuring the output of a system, and then taking corrective actions based on the delta between the desired state and the actual output. Often, these systems are implemented using a PID loop controller which provides a set of parameters used to fine tune how quickly a system reacts.
- Practical Implications: Understanding feedback loops is fundamental to building self-regulating systems and also provides insight into how systems can self-heal, and avoid failures.
System Identification:
- Technical Details: This focuses on developing models of dynamic systems, by analyzing the relationship between inputs and outputs, by using various statistical and mathematical methods to accurately characterize the behavior of a complex system.
- Implementation: This is often done using data from sensors, as well as other types of system data. This involves understanding how the different parts of the system are linked, and can be used to improve system designs by modeling different scenarios.
- Practical Implications: System identification was initially used to design controllers for physical systems (such as aircraft, power plants), however, the principles also apply to any dynamic system, including modern cloud applications, and can help to predict how they will perform under different circumstances.
The Evolution of System Monitoring: From Mainframes to Distributed Systems
As computing evolved, and became more complex, so did the techniques for monitoring them:
Early System Monitoring:
- Technical Details: Initially, monitoring focused on basic system metrics like CPU utilization, memory consumption, disk I/O, and network traffic. These metrics were often collected manually, or using simple tools that did not provide a good overview of complex systems, or large scale environments.
- Implementation: Early monitoring often involved manual review of logs and system status information. This required specialized expertise in the various applications and databases, and also required a manual effort to track down issues.
- Limitations: Scale and complexity of modern systems made this manual approach slow, and inaccurate. Data was often not available in a central location, making it harder to spot widespread issues.
The Rise of Application Performance Monitoring (APM):
- Technical Details: APM tools started to appear as applications became more complex, which provided better insights into application performance, and also provided the ability to trace requests end to end. These solutions often relied on a proprietary agent or library, which would collect metrics and then send them to a central APM system.
- Implementation: These solutions often use sampling, or other methods to track performance without creating too much overhead. They also used code injection, or instrumentation to automatically trace requests across multiple application components.
- Limitations: These tools often focused on a single application, or used techniques that did not allow for a holistic view across different services.
The Impact of Microservices:
- Technical Details: As applications moved to a microservices based approach, this made it increasingly difficult to use a single APM system to track all requests. This led to the advent of distributed tracing, which was developed as a way to track requests as they move between different microservices.
- Implementation: Techniques such as OpenTracing, OpenCensus, and more recently OpenTelemetry, were created to address the needs of distributed tracing by standardizing trace data, and allowing for easy integration with existing APM solutions.
- Challenges: The rise of microservices also led to more complex application architectures that were harder to monitor and troubleshoot with basic tools.
The Advent of Cloud Computing:
- Technical Details: Cloud infrastructure introduced new challenges, as systems now became more dynamic, and elastic with a rapidly changing set of resources. This created the need for new types of tools that could be easily deployed to these new, and changing environments.
- Implementation: Cloud-based monitoring systems needed to scale to cope with vast volumes of data and also to work across a variety of different providers and their specific APIs. This created the need for tools that can integrate with multiple cloud environments, and also process vast amounts of data.
- Key Technologies: Time series databases, such as Prometheus or InfluxDB, were created to solve for the unique technical challenges of storing large volumes of time based data.
The Rise of Modern Observability: Metrics, Logs, and Traces
Modern observability encompasses metrics, logs, and traces, which when used together, form a powerful approach to understanding modern distributed systems.
Metrics:
- Technical Details: Modern metrics systems use a pull based model, which allows collection systems to poll for metrics using an API endpoint, which makes it easy for metrics to scale and also be pulled across different environments.
- Implementation: This requires using tools such as Prometheus which are designed for this pull based approach. Time series databases also support complex queries and aggregations that can be used for anomaly detection and forecasting.
Logs:
- Technical Details: Modern logging involves structured logs, using JSON and logfmt. These structured logs can be easily parsed and transformed. Log aggregation systems, like the ELK stack (Elasticsearch, Logstash, Kibana) can then be used to manage and analyze this data.
- Implementation: Modern logging systems use agents to capture and transmit logs to centralized systems. These agents provide mechanisms for parsing, and transforming logs before transmitting them across the network.
Traces:
- Technical Details: Distributed tracing systems are implemented using libraries that propagate tracing headers across distributed systems. These headers allow the tracing system to stitch together a complete view of a single request, while providing insights into how different systems and components are interacting.
- Implementation: Trace data is often collected using a centralized collection system, which will then present a view of how different services interact with one another. This type of monitoring requires additional code in the application to generate, and capture the trace data.
The Rise of DevOps and Site Reliability Engineering (SRE)
The rise of DevOps and Site Reliability Engineering (SRE) has also shaped the modern observability landscape. As Ben Treynor Sloss, the founder of SRE at Google, said, “SRE is what happens when you ask a software engineer to design an operations team” (Google, 2016). This approach relies on data driven operations, and a strong understanding of system behavior.
Key SRE practices that require strong observability include:
- Service Level Objectives (SLOs): Defining clear goals for system performance and availability, and using observability data to ensure they are met.
- Error Budgets: Having well defined error budgets allow for faster innovation, by allowing engineers to safely deploy new changes, and allowing them to have flexibility in their deployment practices.
- Toil Reduction: Implementing automation and process improvements using real data that can be obtained with metrics, logs and traces.
- Continuous Improvement: Continuously measuring system performance, and proactively looking for ways to improve them.
The Future of Observability: What’s Next?
As systems become more complex, the need for advanced observability will continue to grow, with several key trends:
- AI-Powered Observability: Using AI and machine learning to automate data analysis, detect anomalies, and predict future issues, which provides valuable insights that are not apparent when doing manual analysis.
- Edge Observability: Monitoring systems at the edge of the network to provide better insights for IoT and other types of edge devices that are critical for modern cloud deployments.
- Open Standards: The adoption of open standards like OpenTelemetry, will allow organizations to avoid vendor lock-in and create more portable solutions for observability.
- Democratized Observability: The continued focus on building easier to use tools that can democratize access to observability systems.
Actionable Takeaways:
- Learn From The Past: Gain a full appreciation of the different challenges that have been faced, and the technical advancements that have made current observability tooling possible.
- Embrace All Three Pillars: Implement and maintain tools for metrics, logs and traces in your system to provide a more holistic view of your infrastructure, rather than relying on a single source of data.
- Use Open Standards: Adopt open standards such as OpenTelemetry for telemetry data collection, to avoid vendor lock-in and ensure portability across different systems.
- Automate Your Approach: Use automation and tooling to simplify the process of collecting and analyzing observability data, and make sure that your tools are integrated and work well with each other.
- Use Feedback Loops: Continuously use your observability data to identify issues, improve performance, and also to plan for future architectural changes.
By understanding the roots of observability, you will be able to more effectively leverage its modern tools and techniques.
If you are looking for a platform that can help streamline the management and collection of this data, and to truly leverage the benefits of observability, you may be interested in exploring the new types of solutions available in the marketplace.
Citations:
- Cloud Native Computing Foundation (2023). State of Cloud Native Development.
- Google. (2016). Site Reliability Engineering: How Google Runs Production Systems.
- Sigelman, B. (2018, March 14). Metrics, Traces, and Logs.