Any enterprise that wants to deliver software-based services to customers isn't likely to find long-term success without comprehensive infrastructure monitoring. If you want to provide a first-class customer experience, you need to be able to identify and remediate issues before they affect your customer. One of the worst things that can happen is hearing about an outage from a customer, rather than a monitoring platform.
However, the complexity and overhead of managing a monitoring infrastructure can be a burden to most development teams, as it requires a dedicated ops team and managing resource budgets. For teams with a complex, sprawling monitoring stack, it might be time to look at managed monitoring services.
Monitoring is critical, simply because engineering teams need to be able to see how their systems perform and behave while maintaining production workloads. Modern application infrastructure is a dynamic landscape. Performance demands can suddenly multiply in a matter of minutes, and new feature releases may completely shift the expected behavior of the entire system. Good monitoring should provide accurate, up-to-date visibility into these changes and their impacts.
Google's DevOps Research and Assessment (DORA) team has spent several years researching various aspects of software development and delivery. Their conclusions show that "Monitoring and observability is one of a set of capabilities that drive higher software delivery and organizational performance" and "Good monitoring is a staple of high-performing teams."
Good monitoring can help drive lower mean time to resolution (MTTR), the ever-critical metric that grades how effective engineering operations are at identifying, triaging, and resolving issues. MTTR can be directly correlated with customer experience, which is key in driving top-line revenue and market share.
Monitoring obviously plays a key role in successful software delivery. However, there wasn't always such a broad selection of managed services and platforms. Previous generations of monitoring solutions were often a do-it-yourself (DIY) affair.
In the past, legacy monitoring solutions offered users plenty of options, but often required self-hosting and management of the underlying infrastructure, as well as a complex and inconsistent library of plugins and integrations.
For performance and network monitoring, tools like Nagios, Zabbix, and Graylog were once the standard. They offered paid licensing, as well as DIY, self-hosted solutions. However, this generation of tooling was conceived and designed before cloud infrastructure was widely adopted.
Newer monitoring solutions offer a hybrid approach, with a much stronger focus on the cloud. Customers have options, including self-hosted infrastructure and managed services.
For application monitoring and observability, Datadog and New Relic are well-known platforms. They provide generalized monitoring tools, as well as application performance monitoring (APM) and tracing, allowing users to get a detailed look into the performance metrics of the individual components of an application. Sentry is another tool in the application monitoring field, providing services that cater directly to app developers.
For log aggregation and searching, Elasticsearch is a great product that offers a hybrid approach. The ELK Stack provides powerful text-searching capabilities across a variety of text-based data, like log streams. Elasticsearch users can opt to manage and deploy their own clusters on-premises or in the cloud or take advantage of the various managed Elasticsearch providers.
A popular combination for monitoring, alerting, and visualization is Prometheus and Grafana. Prometheus combines a monitoring and alerting stack with a time-series database. It focuses on modern, microservices-based architectures, both on-premises and in the cloud, and can be self hosted, as well as consumed as a service from providers like Google Cloud. Grafana provides a capable platform for visualization, offering similar freedom in terms of hosting infrastructure with Grafana Cloud.
Amazon recently announced the launch of managed Prometheus and Grafana services, adding to their portfolio of platform-native monitoring tools like Cloudwatch. They are now in the same market as dedicated monitoring providers like New Relic and Datadog. Teams that want to go all in with the same platform for their cloud infrastructure can now take advantage of using the same APIs and IaC tools to manage and deploy their entire architecture.
It seems natural to assume that an organization with any kind of software infrastructure should host and run its own monitoring infrastructure as well. In the past, this approach made sense, as application architecture was relatively simple, and the administrative overhead was minimal. In modern software delivery and development, however, monitoring systems and the infrastructure they monitor are much more complex.
Organizations may bristle at the apparently higher costs of managed monitoring, but that doesn't tell the whole story. While self-hosted monitoring may appear cheaper at first glance, there are many hidden costs to consider. In general, managed service providers can offer customers the advantage of economies of scale. This holds especially true for monitoring services, which typically require non-trivial deployments at scale.
The complexity of monitoring infrastructure also brings additional costs. Even a single monitoring tool is going to have the same operational burdens as an application stack. To run, deploy, manage, and update highly available monitoring and logging clusters for tools like Elasticsearch, companies often need dedicated operations and engineering personnel. For leaner development teams, it may not be possible or cost effective to bring on staff for the sole purpose of running monitoring.
This complexity also means time to value is going to be much higher. Self-hosted, self-managed monitoring solutions that are resilient enough to support production-level workloads could take weeks, or even months, to deploy. And, of course, spending time on deploying and configuring monitoring tools means that engineers will be using precious development cycles on the monitoring stack itself, rather than on core, revenue-generating software products. Teams should not be trying to reinvent the wheel.
Despite the obvious advantages of managed monitoring services, larger enterprises might have different considerations for monitoring solutions—and it's not always a clear-cut choice.
From a cost perspective, cloud providers are notorious for exponential cost growth as use scales. Data transfer charges can quickly bloom, which could be a serious issue for organizations with large, hybrid infrastructure deployments. Larger organizations can take advantage of their own economies of scale with respect to infrastructure and engineering personnel. If the cost of running their own deployment is cheaper than a managed service, it makes sense to go the self-hosted route.
There may also be compliance and data governance regulations that mandate self-hosted infrastructure versus managed. Government entities, or companies handling sensitive PII like HIPAA data, often need to retain more control and visibility over all facets of their infrastructure. This might not be offered by managed monitoring tools. Compliance frameworks often require some demonstrable chain of custody around sensitive data, and monitoring providers may not be able to provide viable guarantees.
For some organizations, monitoring may actually be a core part of the business as well. In this case, it makes total sense to use self-hosted monitoring solutions. For example, companies like Datadog are probably not looking outside their walls for help with monitoring their infrastructure.
Managed monitoring lets engineering teams outsource the administrative burden of running complex monitoring infrastructure at scale. Smaller teams now have access to batteries-included monitoring solutions that allow them to leverage state-of-the art observability and monitoring capabilities—without the overhead. On the flip side of the coin, larger enterprise-scale organizations may opt for in-house solutions to meet more specialized needs. In any case, you can expect the managed monitoring space to grow over time, with more advanced and comprehensive offerings.
Ultimately, picking the right solution is dependent on your use case. Finding the best fit isn’t always simple; there are a variety of factors to consider, and making the wrong choice can negatively impact operations teams and their ability to resolve critical, customer-facing issues. Reach out to our proven team of DevOps consultants to get your engineering teams started off on the right foot.