Monitoring (with) Elasticsearch: A few more circles of hell

Published in

okmeter.io blog

7 min readMar 27, 2018

This is the second part of our two-part article series devoted to Elasticsearch monitoring. The heading of this article refers to Dante Alighieri’s “Inferno”, in which Dante offers a tour through the nine increasingly terrifying levels of hell. Our journey into Elasticsearch monitoring was also filled with hardships, but we have overcome them and found solutions for each case.

In the first part we described our setup, and defined such terms as Node, Index, Shard, Replica, that we will use in this part too. If you don’t know these terms, you might want to read the first part now. This part is focused on various situations occurring while monitoring with Elasticsearch. Common problems are described here, and with each case being different, and the respective solution depending on your goals and objectives, this reading might be worthwhile for further understanding of Elasticsearch in general.

Split brain

Split brain is the most sadly remembered issue with elastic. It is a situation when a second master node mistakenly appears in a cluster due to poor network connectivity among nodes, or high response latency of a single node (e.g. due to delays created by GC).

When this scenario occurs, an index will have two versions, and documents may be indexed in different parts of the same cluster, thereby leading to an inconsistent search result. This means that the same search request may return different results. Recovering the indices here is a very difficult task, which basically requires fully re-indexing or restoring the whole index from a backup, and then somehow (probably manually) reintroducing all recent changes made since this backup.

ES features a mechanism to protect against ‘split brain’, and the key setting is minimum_master_nodes. Keep in mind that minimum_master_nodes is set to 1 by default, so no protection is active.

We reproduced this on our test ElasticSearch cluster and decided to add two automatic triggers: one of them detects multiple master nodes in a cluster, while the other generates an alert when discovery.zen.minimum_master_nodes is less then recommended (quorum of N/2+1). This should be monitored too, as you might decide to add nodes while forgetting to update minimum_master_nodes (which is actually a pretty common scenario). Go check your settings now, if you’re not using okmeter =) or you’re not 100% sure it’s right

Requests monitoring

We have covered cluster state monitoring in our previous post about ES, now let’s see how many requests are served by a cluster, what these requests are, and how quickly they are served. For presentation purposes, we will use our own setup as an example.

Most of the search requests in our system are performed against three indices, with each index split into 20 shards. As a result, initial ~250 search queries per second sent from our application code to ES turn into ~ 250 x 3 x 20 = 15,000 requests to a particular shard.

Also, we issue about ~200 indexing requests per second, but ES performs ~600 indexations per second because each shard is stored in three copies (master instance + 2 replicas).

ES provides very little statistical data on request timings: a counter of a total number of processed requests and a cumulative sum of time spent processing requests for each request type. Thus, we can calculate only an average response time by request type and generate the corresponding chart (we calculate the average value directly during chart rendering, storing raw metrics to be able to chart them as well):

The search queries are shown in orange. As you can see, at 16:00 the latency dropped ~3 times.

This was a result of the force merge that we applied to the segments. Our indices were split by months, i.e. indexing was performed in the current month, while search was performed in intervals of three months. Since only ‘read’ operations were performed against indices from previous months, we could ‘force merge’ these segments directly under workload:

Finally, when only one segment remained in each shard, search requests for these indices boosted up dramatically, as you can see in the picture above showing client request latencies.

Since then we have regularly performed ‘force merge’ at the end of each month for ‘closed’ monthly indices (i.e. those that we will not change or re-index). We highly recommend this procedure for the minimization of request latencies.

Charts of background operations

In addition, okmeter shows charts on background operations either performed by elastic by its own, or produced by “force merge’. We have separated these operations into additional charts in order to distinguish user requests from “system” requests. These two usually differ significantly in latencies (seconds vs. milliseconds) and call count (user requests being far more numerous).

The following chart represents the ‘merge’ operation described above which boosted request latencies (as seen on our charts). To see how computational resources of our cluster were consumed, it seemed more reasonable for us to inspect the total response time of ES, instead of less important for background operations latencies:

In terms of the system metrics of Linux, this ‘merge’ looked like a resource-intensive ‘write’ operation to a disk drive performed by the ES process:

Caches

To ensure quick request processing, elastic uses the following caches:

query_cache (former filter_cache) — bitsets of documents matched with a certain filter in a request
fielddata_cache — used for aggregations
request_cache — shard caches the whole response to a request

Our monitoring agent registers the following parameters for each of these caches: size, hits, misses, and evictions.

For example, the other day, our elastic crashed with the OutOfMemory exception. It was difficult to figure out the reason for this through logs, but we noticed a sharp increase in memory usage by the fielddata cache:

The most interesting thing was that we used only basic functionality, without any elasticsearch aggregations or scoring, so fielddata cache should not have been used here. We only searched for documents by exact values of certain parameters. Then why was the fielddata cache utilized so heavily?

This also appeared to be a controlled experiment. We used ‘curl’ to make heavy aggregation requests and eventually had multiple crashes. Basically, such crashes can be prevented by properly setting memory limits for fielddata, but this simply did not work — probably due to bugs in our legacy elastic 1.7.

System metrics

In addition to internal metrics of elastic, it is reasonable to consider it a high-level operating system process , characterized by processing time, utilized memory, I/O workload on a drive, etc.

A very long time ago, when we decided to update ES from version 1.7.5, our choice was to install 2.4 (version 2.5 might still be unstable). Since it’s a bit risky to perform a standard major update of elastic, we recommend deploying the second cluster and then synchronously creating a copy through our code which can index multiple clusters at once.

After adding the new cluster to indexing, we discovered that the new ES performed ~350 ‘write’ operations per second while the legacy cluster performed only ~25 operations:

As shown above, es101 is the node from the legacy cluster, and es106 is the node from the new cluster. Since we decided not to install SSD on new nodes (as they had sufficient built-in memory), a multitude of I/O operations eventually caused very poor performance.

Then we checked all release notes for elastic 2.x and finally revealed that index.translog.durability was set to request by default, i.e. translog must be synchronized on a drive after each indexing request. When we changed it to async with standard sync_interval equal to 5 seconds, the overall performance improved significantly.

In addition to the system metrics for ES, it is also useful to check JVM metrics, such as GC, memory pools, etc. Our agent automatically detects and links these metrics through JMX, and also automatically displays the corresponding charts.

Automatic discovery of ES

In our recent post we described how much effort we spend ensuring that all of the services on our customers’ servers are automatically monitored. Such an approach ensures comprehensive monitoring along with much faster deployment.

In general, automatic detection for ES works this way:

check the list of processes for a process similar to ES, i.e. jvm with the launching class org.elasticsearch.bootstrap
find the ES configuration file based on the launch command syntax
read the configuration file to identify the listening IP address and port

Once this process is complete, the remaining operations are simply to read the metrics via the standard API and send them to the cloud.

Takeaways

When organizing the monitoring of a service, we always build our solutions on real use cases. It is essential to fully understand a service in order to determine its potential weak spots. This is why we first implemented the support of technical services that are familiar and widely used in our company.

In addition, feedback from our customers is also greatly appreciated. We are constantly improving the dashboard and automatic triggers, so that users can directly see root causes of issues instead of complicated charts.

If you have an ES which is still not being monitored, try our 2-week trial version for free — this might be the very thing you need.