Simple/hard metrics that help reduce MTTR when looking for a root cause

Published in

okmeter.io blog

4 min readAug 21, 2018

Recently there was a mini-incident in a data center where we host our servers. It did not affect our service after all.

And thanks to the right operational metrics, we’ve been able to instantly figure our what’s happening. But then an thought came up to me, how we would’ve been racking our heads trying to understand what’s happening without 2 simple metrics.

The story begins with an on-call engineer spotting an anomalous increase in some service response time. He than checks whether this is true for service overall or just for some handler by examining this service /ping handler response percentiles. This PING handler doesn’t go in any other service or database or whatever, it just returns 200 ok and is needed for the sole purpose of health check by load balancers and Kubernetes.

So what first comes to mind? It’s probably resource starvation and specifically the CPU. Let’s check it:

OK, we see a surge. Let’s figure out which process on the server is that, to see if it’s one of the neighbors or what:

We see that it’s not some particular process misbehaved, but all of them started to use more CPU time simultaneously. So now there’s no easy way forward: since all the services are tangled one with another, we need to check load profile levels, understand what generated them — users or some internal causes, etc. Or it could be some sort of degradation or resources themselves.

Though I tried to keep you intrigued, but you might’ve already figured out that it it was the CPU itself being in a degraded state. Dmesg showed:

CPU3: Core temperature above threshold, cpu clock throttled (total events = 88981)

So basically CPU frequency was lowered. Let's check the temperature:

OK, it’s clear now what was going on. As we saw this happening to 6 servers at the same time, we figured it was definitely a data center issue, but not a global one — only some racks were affected.

Let’s get back to our metrics.

For future events like this, we probably want to know that any server got its CPU overheated ASAP. On the other hand you wouldn’t want to add CPU temp charts on your dashboard, because it takes screen space and peoples attention, but actual problems like this are ridiculously rare.

Usually you would use triggers to automatically monitor and control some parameters or metrics. But to set up a trigger one need to choose a proper threshold. What should we set for a CPU temperature?

It’s the difficulty of choosing the right threshold that pushes a lot of software operations engineers to dream of a universal anomaly detection, the one that will solve all of their problems :-)

Still, in the real world we need a threshold.

Keep it simple, right? What are we care about? Our service performance. So let’s set the temp on which our service experienced issues. But what about other services running on servers that were never overheated?

OK, how about some physics intuition then? Let’s check “usual” temp across our cluster and get the base line:

90°С seems appropriate, right? Let’s just check with another cluster:

Hm… Here it’s way lower on average. So should we set different threshold?

Digging deeper — it’s not the temperature that caused service issue.

The problem is CPU freq being lowered!

Let’s check the number of such events. Linux sysfs gives us that in /sys/devices/system/cpu/cpu*/thermal_throttle/package_throttle_count

Charing it for that time of an issue shows this:

Though our monitoring service collects this automatically by default, it’s not even rendered as a chart anywhere in the system. There’s just an auto-trigger for all of our clients that alerts whenever there’s more than couple of such throttle events per second. It is invisible and doesn’t require your attention, and one might argue that it triggers once in a century, but then it works like a charm — notifying you, whenever it affects you server performance. And you don’t need to play guess game. It’s all on the platter for you!

That what takes most of our effort at okmeter.io — researching and developing a knowledge base of error-proof auto-triggers, that will save you the trouble to figure out unknown to you problem. It’s takes work to figure out the right metrics, that are simple but effective. Simple is hard.

Simple/hard metrics that help reduce MTTR when looking for a root cause

Let’s get back to our metrics.

Written by pavel trukhanov