TLDR: Run pglocklogger on your prod databases. Whenever there's an outage due to blocking in
PostgreSQL, it may give you essential observability.
Recently, I experienced a prod outage caused by a blocked migration.
The migration itself was almost instantaneous, but it picks up a lock before it
actually runs. That may take a while, but that's no problem, right? The migration runs a bit
later, no worries.
That's when I learned that an ALTER TABLE waiting on a lock, itself also blocks benign
work that would otherwise run concurrently. So having your blocked ALTER TABLE can spiral into
an application-wide outage quite easily.
For example, a SELECT WITH UPDATE can run simultaneously with a SELECT, but not if an ALTER
TABLE is executed in between. This is to prevent livelocks. A simple solution is to set a
lock_timeout in your migrations, which causes the ALTER to fail early and allow all other work
to continue. Then, retry the migration later.
I figured this out during the post-mortem analysis, but during the outage, this was hard to
infer, even from PostgreSQL's own slow query log. There's a number of resources on how to
discover this during an outage, but afterwards, all that information is gone.
Not with pglocklogger. It inspects PostgreSQL's in-memory state (using in-memory tables
intended for this) and logs the output when it crosses particular thresholds. In an outage, or
during post-mortem investigations, this may turn out to be an essential observability tool.