We hit our maximum connection limit on some PostgreSQL servers — which made me wonder why the hot standby servers weren’t being used … well, at all. They’re equally big, expensive servers with loads of disk space. But they’re just sitting there “in case”.
So we directed some traffic over to the standby server. I’m also going to tweak a few settings related to user limits — increase the max connections since these are dedicated hosts and have plenty of available I/O, memory, CPU, etc resources; increase the number of reserved connections since replication filled up all of the reserved slots; implement a per-user connection limit on one account that runs a lot of threads — but directing some people who were only trying to look at data over to the standby server seemed like a quick fix.
Now, we discovered something interesting about how queries against the standby interact with replication. It makes a lot of sense when you start thinking about it — if you query against the writable replica, there’s some blocking that goes on. The system isn’t going to vacuum data that you’re currently trying to use. The standby, however, doesn’t have any way to clue the writable replica in to the fact you are trying to use some data. So the writable replica gets a delete, does its thing to hide those rows from future queries, and eventually auto-vacuum comes through and cleans up those rows. All of this gets pushed over to the standby … and there goes the data you were trying to read.
Odds of this happening on a query that takes eight seconds? Incredibly low! Odds increase, however, the longer a query runs. So some of our super massive reports started seeing an error indicating that their query was cancelled “due to a conflict with recovery”
There are two solutions in the PostgreSQL documentation — one is to increase the max_standby_streaming_delay value (there’s also an archive delay, but we aren’t particularly concerned about clients querying the server during recovery operations) the other is to avoid vacuuming data too quickly — either by setting hot_standby_feedback on the standby or increasing vacuum_defer_cleanup_age on the primary.
There’s a third option too — don’t use the standby for long-running queries. That’s easily done in our case … and doesn’t require tweaking any PostgreSQL settings. Ad hoc reporting and direct user access really shouldn’t be implementing such substantial queries (it’s always good to have a SQL expert plan out and optimize complex queries if that’s an option).