PostgreSQL: Exploring how SELECT Queries can produce disk writes

Nikolay Sivko
okmeter.io blog
Published in
4 min readMar 4, 2018

--

We already wrote about monitoring posgresql queries, at the time we thought that we completely understood how PostgreSQL works with various server resources.

Working regularly with the statistics of PostgreSQL queries, we noticed some anomalies and decided to dig a bit deeper for better understanding. Through this process, we found that while the behavior of postreSQL is kind of strange at first glance (or at least very peculiar), the clarity of its source code is quite admirable.

A SELECT query can “dirty” some pages

It turns out that in PostgreSQL, SELECTs (which usually believed to be read-only)may cause modifications to some database records, which postgres will then write to disk.

Wait, what?

PostgreSQL uses MVCC (MultiVersion Concurrency Control) technology for transactional integrity. All changes to the database records only happen in transactions. Each transaction is assigned an identification number. Postgresql refers to this transaction ID as txid (int32). Table data — records — is represented in tuples. A tuple contains data from one particular row, as well as metadata associated with this row:

Picture: www.interdb.jp

Postgresql refers to the id of a transaction that created this particular tuple as the xmin (or t_xmin) of a tuple. And the id of a transaction that marked this tuple as deleted (if any) is referred to as the xmax (or t_xmin).

Here’s how they are calculated and changed:

  • An INSERT to a table creates a new tuple with xmin = txid of that INSERT transaction.
  • DELETE marks the tuples as deleted, setting its xmax = txid of that DELETE
  • UPDATE works as a combination of a DELETE and an INSERT.

The SELECT statement queries the database and retrieves selected data from a specified table, while also performing a visibility check, which goes as follow:

At a high level, a transaction with the txid1 identifier will ”see” a specific tuple, only if the following conditions are met:

xmin <= txid1 <= xmax

Although tuple changes occur immediately, transactions may take a long time to complete. This is why during the visibility check it is necessary to also check whether transactions with the identifiers xmin and xmax have been completed or not, and what was the status of each completed transaction.

Because, for example, while for a particular tuple xmin might be < txid1, the transaction with txid = xmin, that created this tuple, might still be in-flight, and still might fail later, leading to deletion of this tuple. So this tuple should nevertheless be “invisible” for txid1.

PostgreSQL stores information about the current state of each transaction in a commit log (CLOG). Checking the states of a large number of transactions in CLOG is resource-intensive so Postgres therefore caches the information about transaction states directly in the header of the tuple. For example, if during a SELECT it is recognised that an xmin transaction is completed, Postgresql saves this knowledge into so called hint bits of the tuple. Both xmin and xmax statuses are recorded in these hint bits, which are placed in an infomask part of the tuple header.

We previously described the process of changing tuples, however, to complete our investigation , we need to clarify the meaning of “dirty pages” in Postgresql. PostgreSQL works with the information stored on the disk, as well as in memory, and organizes data by blocks, or “pages”. This is done to boost efficiency. Each page contains a number of tuples and their associated metadata. If a single tuple is modified, an entire page is marked as “dirty”. This implies that there is a difference between this data in memory and the corresponding data saved to disk. This modified page must therefore be synchronized with that on the disk. Additionally, these modifications are recorded into the WAL (write-ahead log). This is done for the purpose of restoring data integrity (in the event that the database process terminates abnormally).

SELECT can cause synchronous writes on the disk:

As you might know, PostgreSQL works with data using a buffer cache. If needed data is not in the buffer cache, PostgreSQL reads it from the disk and puts into the buffer cache . If there is not enough space in this cache, then the least requested page is pushed out, evicted. If this page turns out to be in a “dirty” condition at the time of an eviction, then it must be written to disk in this exact moment.

Conclusion:

Most cases of “strange” PostgreSQL behaviour are caused by built in functionality, which is intended to optimize the efficiency and performance of a database.

Subscribe to us here, on Twitter or on our Facebook page to receive okmeter updates, or sign up with okmeter.io directly.

--

--