The Observability Inflation

Where did we go wrong? And what’s next

Observability is nothing to take lightly. It is a critical core competency of any team maintaining a complex infrastructure or developing a modern distributed system.

It allows teams to identify and troubleshoot issues quickly, but most importantly build new products, innovate and dare - knowing that their observability stack is a safety net they can always go to.

Datadog’s recent outage was a dreadful day for many R&D groups around the globe, and was a live demonstration of the major role that observability plays in our lives.

We get it, you need a good observability pill to sleep through the night, but why the heck is it SO EXPENSIVE?

One of the largest bills any SaaS company is used to paying is their cloud provider monthly bill. It does make sense though since in modern companies this means the entire R&D’s infrastructure, and the entire company’s IT environment.

Honeycomb recently provided a rule of thumb that companies spend to 30% (!) of their IT and infrastructure costs on observability. Even in market analysis from a few very proficient VCs I personally got to see, numbers were no less than 10%.

Wrap your head around that for a second. These are huge numbers.

It’s one thing for costs to be expensive, but why unpredictable?

Almost every engineering team can tell the following tale. There is a bug in production that took a lot of time to investigate and figure out. As quality bugs are, you missed it during the CI stage and it fell between the cracks of your monitoring pillars. Your existing logs were insufficient to shed light over the problem and no metric indicated an issue was emerging. After spending hours on troubleshooting this incident, you take the obvious step of covering this gap from now on, by adding logs and custom metrics that better capture the behavior that could indicate a problem. Your goal: to NEVER have to go through this painful process again.

A week later, someone from management starts noticing the expected monitoring bill for this month is much higher than usual. Deep down you know what happened. That log line you added created much higher volume than you had anticipated, or that new metric is of much higher cardinality than you planned.

This tale is nothing but reality served to us daily by legacy APM vendors like Datadog, New Relic, Dynatrace and others.

They made a business out of charging us for unpredictable data volumes. They basically claim that if you’re an engineer you probably know for sure what the expected number of monthly log lines and log volumes is (yeah that’s right, they charge by both!), right?

You must know the amounts of API calls flowing through your 180 microservice-based production? At least on a weekly average, right?

Just like McDonalds is a real estate business more than a restaurant chain, these APM vendors are data warehouse companies more than anything else, built to hoard our data, and make us pay for it, regardless of how valuable it eventually is.

They use shady pricing mechanisms to make it harder for customers to truly control costs, and use sales playbooks that make customers take wild, baseless estimates of their expected usage, then charge them for any “surprising” excess usage.

Sticks and stones

Nothing comes without a price. The immense costs of APM vendors and their unpredictable bills have (along with other reasons) worn out the adoption of these solutions.

The three pillars of observability are a well-known concept we’ve been taught to follow. You need to have your logs, metrics and traces in place in order to be a happy and efficient engineer. Although amazing in theory, reality is far different.

The result is clear: over 70% of teams do not have a full APM tier in place (DevOps Pulse 2022). They turn to lower tiers like logs and custom metrics, avoiding the higher, more pricey and high cardinality tiers.

Not having an application monitoring layer means longer troubleshooting times, a harder path to solving complex issues and sometimes not even knowing a logical bug is taking place.

Where did we go wrong?

Observability tools and processes were built to be adopted from the bottom up. Each engineering team could integrate an observability solution (usually by integrating the solution into their application code or runtime). This team could then troubleshoot better with the new data they exposed, build dashboards, alerting pipelines, all within an isolated worldview, separated from the rest of the organization.

Observability vendors turned to the developers as their go-to-guys in the organization. They would be the gatekeepers of new solutions, sometimes creating “pockets” inside the organization of one solution which is different than others inside the organization were using.

Observability was more a part of the development cycle than of the organization’s infrastructure.

This had great advantages for the developers too. Teams could own their observability stack, and even treat it as another task in the development cycle that needs attention and maintenance. But most of all, it allowed them the ability to execute fast in cases where the organization slowed them down, and think later.

Most developers do not have access to the organization’s cloud infrastructure, where changes are taken with more care and have a wider impact.

But nothing comes without a price, and here the price turned out to be immense. Building observability to be adopted in this bottom up fashion causes problems at scale. Engineers will always optimize for more observability data to be on the safe side and will always see the infrastructure and related data costs as someone else’s problem.

The result was uncontrollable data volumes collected across the organization with no one decision maker in charge of the simple question of “is this data really necessary?”.

Studies show that less than 1% of this data is useful and ever explored by users. The other 99% is collected, stored, processed but never used. Paying and storing the 99% you care nothing about so you can reach the 1% you do - is at the heart of the problem that started to form.

Growing data volumes eventually also had an impact on the end-user. A hard working developer logging into their observability tool would very likely be overwhelmed by huge amounts of data, where reaching the right data is nearly impossible. Data queries became a real bottleneck, manifesting in slow loading dashboards, and over-resourced logging management solutions.

Observability data was growing in size constantly, eventually affecting the way teams react to burning issues. Junior team members would lose themselves in this complex ocean of data, escalating issues to power users who were the only ones who seem to know where to look.

Cloud-native made things even worse

The observability solutions that dominate today’s market were built when the world was very different. Engineering teams were mostly working on single-tiered monolith applications that combined user-facing interfaces and data access into a single platform.

But the world is moving to cloud-native for speed, scale, and efficiency. Cloud-native architectures bring with them great promise for faster software development life cycles. Modern organizations build software by organizing people into small, interdependent engineering teams, focused on the services for which they’re responsible. This way value can be generated more incrementally and much more quickly, and the agility of the organization can be kept in larger scales than before.

But cloud-native microservices architectures are built exactly the opposite to what legacy observability vendors were born into. Suddenly each user-facing traffic triggers numerous API calls between various microservices. Gaining visibility into these smaller and distributed, interdependent pieces becomes a much harder task at scale. There are lots of moving parts, hundreds of interactions and many places where the root cause of the next problem could originate from.

Cloud-native environments broadcast massive amounts of data — somewhere between 10 and 100 times more than traditional virtual environments. Tracking all that with the same old bottom up approach is where data volumes rose fast, and companies were left to pick up the bill.

And so, observability solutions were getting more expensive, but companies were not getting more value. In fact, in many cases the value they got was actually declining. From performance issues, to simply getting overwhelmed with data - troubleshooting with your observability solution might have even gotten worse.

Today’s observability vendors can’t keep up

Legacy APM vendors operating in the new cloud-native world are suffering from many downsides that represent the heart of the expectation mismatch between the “old” world and the new cloud-native one.

Among these downsides, you should also be aware of the following:

  • Scalability of their data model - Cloud-native environments are API-driven and distributed. Naively monitoring everything developers choose to collect and then paying per volume is unrealistic in many cases. It would mean tons of data you don’t need, and scary surprising bill surges.

  • Harder organizational alignment - In a cloud-native company DevOps or SRE carry the torch of production health across so many different teams deploying their applications to the company’s infrastructure. The bottom-up adoption through developers makes it harder for them to do their job, and pick the solutions they need to succeed.

  • Vendor lock-in - A legacy observability vendor is built to be part of the development lifecycle, delivering platform-specific agents and SDKs to provide full coverage. In an ever changing cloud-native environment, that’s a major lock-in you should be mindful of.

  • Data privacy - most observability vendors are built on a centralized architecture where customer data is being trucked and stored in the vendor’s environment. In today’s world storing logs and API traces means sensitive and PII data outside of your control.

    It takes one to know one

    A cloud-native observability solution should help solve all these worrying mismatches, and most of all take control over this immense data-growth challenge.

    We built groundcover to be different all the way from how it collects data, to how data is being processed, stored and priced. The architecture is based on the following concepts:

    • Data is collected out-of-band from the application
      Using eBPF to collect data instead of code instrumentation. That allows to empower DevOps and SRE teams to cover everything, instantly, without the need to involve many stakeholders in the R&D . It also breaks vendor lock-ins as integrating groundcover doesn’t mean changing the development cycle.

    • Data is being processed distributedly where the data lies
      Processing raw data on-the-fly inside each Kubernetes node, without trucking it outside or storing it anywhere. That allows to create digested insights from that data (like metrics) without paying the price of shipping or hoarding it, dramatically reducing the volume of data being stored.

    • **Keeping the data in-cloud
      **Storing the data collected in-cluster, inside the customer’s environment without it ever leaving the cluster to be stored anywhere else. That allows keeping the data private and secured, but also to reuse it to gain more value - like connecting metrics and collecting to your own Grafana.

      The end result is the ability to break the infamous cost to visibility depth tradeoff.
      No more tough trade-offs about which parts of your production to observe. Teams can control their budget and avoid unexpected spikes in cost with a flat, predictable price.

      But even more importantly, have all the data they need to work at full capacity. We need to generate metrics at mind-blowing cardinality and raw data (like logs and traces) at full depth (as an example, storing all API payloads around incidents to allow for faster troubleshooting).

      All this data needs to be kept privately in-cloud, yet also in SaaS so it can be easily distributed and shared inside the organization.

      A truly cloud-native [no trade-offs] APM

    • Observability is a core competency in today’s distributed cloud-native systems. complex systems, yet gaining that observability has become a burden over time - of cost, time and effort.

      The world is changing faster than ever, and the observability stack should keep up. A solution that matches development speed is needed. A true cloud-native APM needs to make sure the stack is constantly covered, that the data is kept private and in control, and that all the data needed can be available without constantly fighting to meet a reasonable budget.

Did you find this article valuable?

Support WeMakeDevs by becoming a sponsor. Any amount is appreciated!