USE, RED and real world PgBouncer monitoring

pavel trukhanov
okmeter.io blog
Published in
13 min readSep 25, 2018

--

Brendan Gregg’s USE (Utilization, Saturation, Errors) method for monitoring is quite known. There are even some monitoring dashboard templates shared on the Internet. There’s also Tom Wilkie’s RED (Rate, Errors, Durations) method, which is suggested to be better suited to monitor microservices than USE.

We, at okmeter.io, recently updated our PgBouncer monitoring plugin and while doing that we’ve tried to comb everything and we used USE and RED as frameworks to do so.

Why we needed both and not just stuck with USE, as it is more commonly known? To answer that we need to understand their applicability first. While they are know, I don’t think they are widely systematically applied in practice of covering IT-systems with monitoring.

USE

Using Brendan Gregg’s own words:

For every resource, check utilization, saturation, and errors.

Where resource is all and any physical server functional component (CPUs, disks, busses, …). But also some software resources as well, or software imposed limits/resource controls (containers, cgroups, etc).

Utilization: the average time that the resource was busy servicing work. So CPU utilization or disk IO utilisation of 90% means that it is idle, not doing work only 10% of the time, and busy 90% of time. But also, for such resources as memory, where one can’t apply the idea of “non idle percentage of time”, one could measure the proportion of a resource that is used.

Anyways, 100% utilization means no more “work” can be accepted, either at all, i.e. when memory is full, it is full, you can’t do anything about it. Or it’s 100% utilized only now, at the moment (as with CPU), and new work could be put into a waiting list, queue or something. And these two scenarios are covered by the corresponding remaining two USE metrics:

Saturation: the degree to which the resource has extra work which it can’t service, often queued.

Errors: the count of error events, i.e. such as “resource is busy”, “Cannot allocate memory”, “Not enough space”. While those usually do not produce performance impact directly, they either lead to client errors, or, through retries, redundant devices etc, lead to performance impact from client point of view anyway.

RED

Tom Wilkie, former Google engineer and Grafana’s VP Product now, was frustrated with the USE performance monitoring methodology. For instance, how would you measure i.e. the saturation of memory? Error counts also can be problematic, especially IO errors and memory bandwidth.

The nice thing about this [USE] kind of pattern is that it turns the guesswork of figuring out why things are slow into a much more of a methodological approach.

So, as an alternative Wilkie suggests another easy-to-remember acronym, RED:

  • (Request) Rate: The number of requests per second.
  • (Request) Errors : The number of failed requests.
  • (Request) Duration : The amount of time to process a request, a.k.a. service latency.

Though RED is designed only for request-driven services. As opposed to batch-oriented or streaming services for instance.

So how is it better?

RED offers a way of looking at service functioning and performance consistent across different services. Thus reducing cognitive load of on-call engineers, which is crucial in times of outages.

PgBouncer

While PgBouncer is a connections pooling service after all, and as such it might be monitored with RED, it also has all kinds of internal limits and limited resources, so in this case we concluded we’re going to need to use USE as well :)

Application of those methods to Pgbouncer should be done with regards to its main purpose and also one should know specifics of its internal structure — all the software kind of resources and limits to cover with USE.

It’s not enough to monitor PgBouncer as a black-box network service — to know whether a Linux process is alive and a TCP port is open. You actually need to know whether it’s working properly from client point of view — proxying SQL transactions and queries in a timely manner.

So here’s how it looks like from client’s, say, some web-application, PoV:

  1. Client connects to PbBouncer.
  2. Client makes SQL request / query / transaction
  3. Gets a response.
  4. Repeat steps 2–3 as many times as needed.

Here’s client’s connection states diagram:

During LOGIN (CL_ stands for client) Pgbouncer might authorize a client based on some local info (such as auth_file, certificates, PAM or hba files), or in a remote way — with auth_query in a database. Thus a client connection while logging in might need and be executing a query. We show that as Executing substate:

But CL_ACTIVEqueries also might be actually executing some queries and so linked to actual database server connections by PgBouncer, or idling, doing nothing. This linking / matching of clients and server connections is the whole raison d’etre of PgBouncer. PgBouncer links those clients with server only for some time, depending on pool_mode — either for a session, transaction or just one request.

As transaction pooling is the most common, we’ll assume it for the rest of this post

So client while being in cl_active state is actually might be or might be not linked to a server connection. To account for that we split this state in two: active and active-linked/executing. So here’s a new diagram:

These server connections, that clients get linked to, are “pooled” — limited in number and reused. Because of that it might occur that while client sends some request (beginning a transaction or performing a query) a corresponding server connection pool is exhausted, i.e. pgbouncer oppened as many connections as were allowed it and all of them are occupied by (linked to) some other clients. PgBouncer in this scenario puts client into a queue, and this client’s connection goes to a CL_WAITING state. This might happen as well while client only logging in, so there’s CL_WAITING_LOGIN for that also:

On the other end there are server connections — from PgBouncer to the actual database. Those have respectful states: SV_LOGIN for when authorizing, SV_ACTIVE for when it’s linked with (and used or not by) client’s connections, or if it’s free — SV_IDLE.

USE and PgBouncer

Thus we can formulate (a naive version) of a specific PgBouncer pool Utilization:

pool_u = #_server_connections_utilized_by_clients / pool_size

PgBouncer has an administration interface available through a connection to a special ‘virtual’ database named pgbouncer. There are a number of SHOW commands in it, one of those — SHOW POOLS — will show number of connections in each state for each pool:

We see here 4 client’s connections opened, all of them — cl_active. And 5 server connections: 4 — sv_active an one is insv_used.

One can collect this SHOW POOLS output, using for example some prometheus exporter. And chart them, to get something like this:

But how do we get utilization from that? We need to answer these first:

  • What’s this pool size?
  • How do we count utilized connections? Current number or as a percentage of time? In average or as peak usage?

Pool size

It’s not a that simple, PgBouncer has 5 different setting related to limiting connection count!

  • You can specify pool_size for each proxied database. This will create a separate pool of that size for every user connecting to a database. It defaults, if not set, to default_pool_size setting, which again by default has a value of 20. So if you have multiple users in your database (and you, probably, should) it might by default create 20 Postgres connections, which seems alright. But each Postgres connection in a postgres process, and if you have many users, thus many pool, you might end up with a pretty high total number of postgres server processes (aka backends).

Suggestion: If you don’t have a limit on number of different users in your database, your pgbouncer also probably have automatically created database pools. Set a pretty low default_pool_size , just in case.

  • max_db_connections is exactly suitable for covering this problem — it limits total number of connections to any database, so badly behaving clients won’t be able to create too many Postgres backends. But max_db_connections is not set by default, so it’s unlimited ¯_(ツ)_/¯

Suggestion: limit it! As a baseline you can use, for example, Postgres’s max_connections setting, which is 100 by default. But don’t forget to adjust it if you have multiple PgBouncer instances going directly to one DB server.

  • reserve_pool_size — is a limit on an additional, reserve pool, which kicks in if a regular pool is exhausted, i.e. there are pool_size open server connections. In that case PgBouncer might open up these additional connections. As I understand it was designed to help serve a burst of clients, but from my understanding it’s not a very useful for that, because in a moment of a peak load, when pool might be exhausted and a DB might be having bad time serving all that load, opening more connections to it won’t do any good. But this reserve pool is handy to watch for pool saturation, as we’ll discuss later.
  • max_user_connections — this limits total number of conns to any database from one user. From my point of view, it’s a very strange limit, it makes sense only in case of multiple databases with same users.
  • max_client_conn — limits total number of incoming clients connections. It differs from max_user_connections because it includes connections from any user. And if you see such errors in pgbouncer log: no more connections allowed — this meansmax_client_conn is reached. By default it is set to 100, so pgbouncer will just reset any new incoming TCP connection.

Suggestion: you might want to set max_client_conn >> SUM ( pool_size + reserve pool), like, 10 times, maybe.

Pgbouncer’s administration interface database besides SHOW POOLS has also SHOW DATABASES command, that shows actually applied limits and all configured and currently present pools:

Server connection monitoring

So let’s return to the question — how do we count utilized connections? Just take current number or should we measure it as a percentage of time? Should that be average or peak usage?

In practice it’s not so easy to properly track pool utilization because pgbouncer reports many indicators only in a form of current values. Therefore allowing only sampling mode of metrics collection with a probability of artifacts. Here’s a real life example, where depending on when pgbouncer metrics collection happened, at the start of a minute or at the end of it, one can see quite different picture of pool utilization:

There were no changes in load profile during charted period. Check out these Postgres connections chart and Pgbouncer files usage chart for the same period — no changes at all:

So we, implementing our pgbouncer monitoring, decided that we are going to provide a combined picture to our clients: our monitoring agent samples SHOW POOLS each seconds and once a minute reports average, as well as peak count of connections in each state:

So dividing this by pool_size will give you average and peak pool utilization as a percentage, so you can trigger an alert if it goes somewhere close to 100%.

PgBouncer also provides SHOW STATS command, that provides stats (not a surprize, I know) on requests and traffic for every proxied database:

Here, for the purpose of measuring pool utilization we are mostly interested in total_query_time — total number of microseconds spent by pgbouncer when actively connected to PostgreSQL, executing queries. Dividing this by respectful pool size (considering pool size to be the number of seconds that all the server connections might spent in total serving queries within one wall clock second) we get another measure/estimate of pool utilization, let’s call it “query time utilization”. This one (unlike an utilization estimate calculated from server connection counts) is not prone to problems with sampling, thanks to total_query_time being a cumulative sum, so it won’t miss a thing.

Compare this:

To the one we saw before:

You see later one not really showing all the times when utilization was ~100%, while the first chart with “query time utilization” does.

Monitoring of PgBouncer pools saturation

Let’s discuss for a moment why we need Saturation metric at all, when we can tell if everything is overloaded looking only if Utilization is high or not?

The problem is that even with cumulative stats like total_query_time one can’t tell if there were some short periods of high utilization between two moments when we look at the stats. For example, you have some cron jobs configured to simultaneously start and make some queries to a database. If these queries are short enough, i.e. shorter than stats collection period, then measured utilization might still be low, while at these moment of crons start time they might exhaust resource (be that connection pool or something else). In this case, as we discussed, they probably waited in a queue of some sort. But they probably also might’ve affected queries coming from other clients in that way, leading to a local performance degradation from these clients point of view. But looking only on Utilization metric, you won’t be able to diagnose that.

How can we track that on PgBouncer. A straightforward (and naive) approach is to count clients in SHOW POOLS output in a cl_waiting state, that we discussed. In under normal circumstances you won’t see them, and seeing number of waiting client greater than 0 means pool saturation, as here:

But as you know, you can only sample SHOW POOLS, and this leads to a possibility of missing such waitings.

Here’s where we can use PgBouncer built-in saturation detection — as I wrote before, you can configure it to open additional connections in case of when pool fills up, just set non zeroreserve_pool_size. Thereby we can detect pool saturation by comparing a number of open server connections to a respectful pool_size, if it exceeds the limit, pool was saturated at some point:

Here we can clearly see a picture of some sort of cron/periodic jobs, that kick in at the start of each hour and saturate this pool. And even though we do not see at any moment number of active connections exceeding pool_size limit, we know for sure that pgbouncer detected that and opened reserve connections.

There’s another related setting — reserve_pool_timeout — it defines a timeout so pgbouncer won’t be using that reserve pool. And it defaults to 5 seconds, so if you’re going to user reserve pool for pool saturation detection, you should probably set it quite low.

While I showed problems due to only sampling metrics collection possible with SHOW POOLS data, it is anyways very useful to monitor clients connection states. Because thanks to distinct pools for different users, one can see users that are in active usage of actual server connections (linked with those). At okmeter.io you can chart it as this:

sum_by(user, database, 
metric(name="pgbouncer.clients.count", state="active-link")
)

And that’s an example chart:

We, at okmeter.io, provide even more deep details on that usage. You can see distribution of client’s IP addresses. This allows to distinguish not only most active DB users, but most active in that way instances of applications:

In this example, you can see IP addresses of specific kubernetes pods with a web-application instances running in them.

Errors

For server connection pool exhaustion there is these saturation metrics. But for client connections there are some limits too, as we discussed. Reaching those will produce not queueing and waiting, but blunt denial of service errors. For which you should monitor in pgbouncer logs, where you might find some of those:

launch_new_connection: user full 
launch_new_connection: database full
no more connections allowed

RED monitoring and PgBouncer

While USE is designed to find performance issues and bottlenecks, RED is more targeted on characterizing workload, i.e. incoming and outgoing traffic. So RED will tell you if everything works as intended (or at least as before), and if something’s not right, USE will help to find a cause.

Request Rate

This one is, at a first glance, pretty straightforward for SQL proxy / pooler. Clients send requests (transactions and queries) in a form of SQL statements. You will find total_requests in SHOW STATS output. Let’s chart its rate, which in okmeter.io will be simply rate(metric(name="pgbouncer.total_requests"))

We clearly can see daily changes and short, anomalous spikes in usage.

Request Errors

Well, RED and USE have this “E” in common, which is Errors. The way I see it is that USE’s errors are more about the case of when this service/resource we’re monitoring is unable to handle more load. While RED’s errors should be more about errors from clients point of view: statement timeouts “canceling statement due to statement timeout”, rollbacks etc.

Request Durations

Here again we can useSHOW STATS with its cumulative total_query_time and total_requests. Dividing one to another we’ll get average query time, and if you track that in time, you’ll get average query time chart:

So we clearly see that most of the time it is pretty stable. While there were some anomalous spikes at 19:30 an later. Having that, we could dig deeper using more detailed PgBouncer metrics or we might need to look deeper into Postgres metrics.

I hope you find this write up useful. I’ve tried to cover all the bases, if you feel that you have something to add — please, tell me, I’ll be glad to discuss.

I’m preparing next article on PgBouncer and Postgres metrics and monitoring. So if you’re interested — follow us here, at facebook or twitter to stay tuned!

Our monitoring service — okmeter.io will help you stay on-top of everything happening with you Postgresql, RDS and other infrastructure services.

--

--