Monitoring (with) Elasticsearch: Nine Circles of Hell

pavel trukhanov
okmeter.io blog
Published in
4 min readFeb 28, 2018

--

We’ve finally made the finishing touches on the elasticsearch monitoring and officially released it. Only after three complete reworks did we manage to achieve really nice results and detect all the issues in any ES cluster setup.

Below we would like to describe our production cluster, reveal the issues that we’ve been experiencing with it, and also showcase our new ES monitoring functionality.

Elastic search in a nutshell

Basically, Elasticsearch is a distributed auto-scaling RESTful service for full-text search powered by the Apache Lucene library.

Now a few definitions for ES:

  • Node is a JVM process launched on a server.
  • Index is a set of documents to search in; an index can contain documents of multiple types.
  • Shard is an index component. Shards help to distribute index data and query calls amongst nodes.
  • Replica is a copy of a shard. Each shard is usually stored in multiple replicas on different servers for higher failure resilience.

Each shard itself is a Lucene “index” (not to be confused with the above “index” term), and thus is also split into segments.

How we use ES

All metrics in okmeter monitoring are labeled, and each metric can therefore be identified by unique key-value sets, or a labels dictionary. For example:

{
project: okmeter,
hostname: es103,
plugin: net,
metric: net.interface.bytes.in,
interface: eth1
}
{
project: okmeter,
hostname: es103,
plugin: process_info,
metric: process.cpu.user,
process: /usr/bin/java,
username: elasticsearch,
container: ~host
}
{
project: okmeter,
hostname: es103,
plugin: elasticsearch,
metric: elasticsearch.shards.count,
cluster: okmeter-ovh,
index: monthly-metadata-2016–10,
shard_state: active,
shard_type: primary
}

We include such dictionaries as documents in our ES. We can then query ES to search for what we need. For example, we can use this simplified request

project: okmeter AND metric: elasticsearch.index.size

to generate the following chart that shows index sizes for every index:

For such queries ES will return a metric IDs for the corresponding metrics, which can be used for retrieving an actual time series from our time-series store.

ES Cluster status

Basically, ES defines the following three statuses for clusters:

  • Green — all the required shard replicas of each index are available.
  • Yellow — some shard replicas are either in the state of migrating (from some unavailable nodes to the present ones), or are not assigned to nodes.
  • Red — certain parts of index replicas are completely unavailable due to some shard replicas being unavailable.

Thus, as the key chart for okmeter’s predefined ES dashboard, we’ve selected a chart based on these statuses:

On the day shown on the chart, we added more powerful servers to our ES cluster, and wanted to remove two legacy nodes. At around 14:00, an interesting idea came into my mind: “how about to run a monitoring testing?” After a short discussion, we decided to give it a try, and proceeded to deactivate one node to check whether our ES was really ‘elastic’ enough for such a task.

As soon as we did this, we received an alert from our monitoring system, saying that the metrics collector was not OK.

What the hell??

We then restarted the node and waited for a while until everything went back to being alright. However, the reason for one inactive node causing the whole system to fail, was still unclear. We needed to go deeper…

At 14:30, we again deactivated the same node and got the same alarms. “Hm…” Certainly in some cases, negative results can also be useful, but we needed to figure out why this was occurring and how to resolve the issue.

First, we tried to make a sort of decommission, i.e. commanded ES to remove all the shards from certain nodes, one by one, which is reflected in the chart above (blue areas at 15:10 denote transitions of shards to other nodes). This showed no problems at all.

We on a hunch then counted the resulting shards and got 140. “Rather strange”. The number_of_shards in our setup was 20. The number_of_replicas was 2, plus one master copy. As a result, the total number of shards per index should have been either 60 or 180 shards, per 3 indices (1 index per month), and not 140. Finally, we noticed that the index for December was created without replicas (number_of_replicas = 0), thereby leading to a complete crash of this index right after deactivation of any node!

It was very lucky for us to detect the missing replica during a controlled experiment. Otherwise, full re-indexing from the raw data in the core storage would have been necessary to recover from such a fault.

To prevent this issue, we’ve added an automatic trigger which notified about indices with 0 replicas. As we are a monitoring service, we’ve made this trigger automatic for all customers.

If you have ES, okmeter.io will help prevent all kinds of trouble with it.

Follow us here, on Twitter or on our Facebook page to know more about ES and monitoring in general.

Or sign up with okmeter.io to get your ES and everything else monitored without hustle!

--

--