Monitoring Agent: should be easy… Right?

Published in

okmeter.io blog

7 min readDec 5, 2017

These days there are quite a few systems for storing and processing of metrics — timeseries db, but the situation with monitoring agents (software that collects metrics) is not so rosy — there are still not that many options available.

Almost all cloud services in the market are working on developing their agents, and we at okmeter.io are no exception. The reason for that is simple. There are many specific requirements that don’t fit well with the architecture of existing solutions.

Here are our main specific requirements:

We have to be sure that metrics are delivered into the cloud no matter what.
We have complex plugin logic: our plugins depend on and interact with each other.
Diagnostics: we should be able to understand why the agent can’t collect certain metrics etc.
The agent must use as little as possible of client server resources.

Delivery of metrics into the cloud at any cost

The monitoring system should always be up and running, it’s especially critical when there are existing issues in the client infrastructure. We started with writing all the metrics to the server disk where they were collected (we call it ‘spool’). Then the agent immediately attempts to send metrics to the collector as a batch and, if successful, deletes this batch from the disk. It’s important not to fill up client server’s disk, so the spool size is limited to 500 megabytes. If it gets full, we start deleting the “oldest” metrics.

Very soon we found a problem with this approach: agent won’t be able to work if the server’s hard drive is full. In this situation if we can’t write to the hard drive, we try to send the metrics right away. We decided not to change completely the send logic because we still want to minimize the time metrics reside solely in the agent’s memory and not persisted.

If a metrics batch fails to send, the agent tries again. The problem is that metrics batches can be rather large, and unequivocally selecting a timeout is quite difficult. If we make it high, we’ll get a delay in the metrics delivery in case of just a single “stuck” request. In the event of a lower timeout, large metrics batches might stop slipping through. We cannot break large batches into small ones for a number of reasons.

We decided not to use a common timeout when making requests to the collector. We set a timeout on establishing a TCP connection and on a TLS handshake, while the “aliveness” of the connection is checked by a TCP keepalive [wiki]. This mechanism is available on just about all modern OS’s, but if it isn’t (we have clients with FreeBSD 8.x, for example) we have to set a large timeout for the entire request.

This mechanism has three settings:

Keepalive time: how long to wait after the last packet of data has been received in the connection before beginning to send keepalive probes
Keepalive probes: the number of unsuccessful probes before we consider the connection dead
Keepalive interval: the interval between probes

All time intervals are set in seconds. So it will probably won’t be suitable for services that are sensitive to subsecond delays.

The default values are of little use in practice:

$ sysctl -a |grep tcp_keepalive
net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_intvl = 75

tcp_keepalive_time = 7200 — means that Linux will start to check a connections only after 2 hours of inactivity.

tcp_keepalive_probes = 9 and tcp_keepalive_intvl = 75 together mean that a connection will be considered dead only after 9*75 ~= 11 minutes of silence on the other end.

These parameters can be redefined for any specific connection — our goal is to identify the problematic connection and close it as quickly as possible. For example, the settings time = 1 time = 1 probes = 3 will allow us to catch a failed connection in just four seconds, while keeping some reliability in failure detection.

Once the agent’s connectivity to the collector completely vanishes, we can’t do nothing. However, quite frequently we encounter a situation where the possibility of a connection exists, but it’s just the DNS server used by the server where the agent is running isn’t working. In the event of such a DNS error we decided to try to resolve the collector’s domain through one of external public DNS.

Plugins

We, at okmeter, pay a lot of attention to auto-discovery of services on client servers. This allows our clients to set up monitoring quicker and ensures that nothing left unmonitored. For auto-discovery to work we need to provide a list of processes to almost each plugin. Not only is it required when starting an agent — it is needed constantly in order to identify newly started services. We get the list of processes once per interval, we use it for calculating process metrics as well as to send it to those plugins that require it for auto-discovery.

Any plugin can run other plugins or additional instances of itself. For example, we have an nginx plugin, which once a minute receives a list of processes, and for each nginx that it comes across, it:

Locates its config.
Reads the config and all nested and included configs.
Finds all “log_format” and “access_log” directives,
Uses found “log_format”s to generate regular expression for parsing corresponding logs,
For each found access_log file, runs an instance of logparser plugin that will parse the file.

If some log is added or some format is changed, the logparser settings will change on the fly, additional plugin instances will launch, and unneded ones will stop.

Loparser can parse not only a single file, but a number of them specified by a glob pattern entry. However, since we want to parse logs in parallel, any glob pattern is regularly expanded; and the agent launches the necessary number of logparser plugin instances.

Sniffing

Another rather clever thing was added recently: a traffic sniffer. Here’s how it’s used by the mongodb plugin:

According to the process list, the mongodb plugin finds a launched mongo on the server.
It tells the sniffer that it wants to receive packets on a specific TCP port.
It receives packets from the sniffer, performs additional parsing of the TCP payload, and calculates various metrics.

As you can see what we end up with are not quite plugins in the usual sense, but full fledged modules that can interact with each other. Such kind of scenarios would be very difficult to embed into any existing agent/plugin architecture.

Agent Diagnostics

Supporting our first clients was a hell for us. We had to spent a lot of time corresponding, asking clients to launch various diagnostic commands on the server and send us the output. In order to save the face with our clients and reduce the time to repair, we got the agent’s log in order and started having it delivered to our cloud in real time.

The log helps us catch most problems, but it can’t completely replace communicating with the client. The most typical problem once was with services auto-discovery. The majority of clients use relatively standard configs, log formats, etc., and everything runs like a clockwork. However, in companies with experienced system administrators the potential of configuring any software is heavily exploited and cases arise that we didn’t even know existed.

For example, we recently learned that you can configure a Postgresql via ALTER SYSTEM SET, which generates a separate config, postgresql.auto.conf, which overrides various values of the primary config.

We’ve started getting the feeling that over time our agent is transforming into a storehouse of knowledge about how various projects tend to be constructed :)

Performance Optimization

We are constantly tracking our agent’s resource consumption. We have several plugins that can generate a significant load on the server: logparser, statsd, and packet sniffer. In these cases we try to set various benchmarks, and we frequently profile code at our staging environment.

We used golang for the latest version of our agent, and it features a profiler that can be activated under load. We decided to take advantage of this and taught the agent to regularly spit out a CPU profile log. This allows us to understand how the agent is behaving under real load profile on client’s servers.

Because okmeter agent is monitoring resource usage of absolutely every process running on the server including itself, our clients can always see how much our agent is consuming. For example, on one of the client’s front-end servers the agent parses the log with about 3.5K RPS, still it consumes only a small ammount of CPU time, as you can see here:

Accidentally there were an nginx-amlify agent running along side with ours on that server at the time. You can see the resources it uses while parsing the same log :)

Conclusion

It turns out that monitoring agents aren’t quite as simple as they might seem. We spend about half of our time developing and supporting our agent.
This is even more complicated for us due to the fact that all our clients has different environments and setups, and they all configure their infrastructure differently.
Taking a look back, we realize that we could not have made use of any existing solution without rewriting it almost completely.

We’ll keep moving forward. Come with us and stay tuned!