Replication involves sending records from the master, receiving the record on the remote replica, writing the record on the remote replica, and flushing the record to persistent storage, and finally replaying the record. Replication lag can occur when a large amount of data hasn’t been fully replayed into the remote replica. Identifying where the lag occurs can help in rectifying the underlying problem.
select client_addr, usename, application_name, state, sync_state, (pg_wal_lsn_diff(pg_current_wal_lsn(),sent_lsn) / 1024)::bigint as PendingLag, (pg_wal_lsn_diff(sent_lsn,write_lsn) / 1024)::bigint as WriteLag, (pg_wal_lsn_diff(write_lsn,flush_lsn) / 1024)::bigint as FlushLag, (pg_wal_lsn_diff(flush_lsn,replay_lsn) / 1024)::bigint as ReplayLag, (pg_wal_lsn_diff(pg_current_wal_lsn(),replay_lsn))::bigint / 1024 as TotalLag FROM pg_stat_replication;
Commented to explain what each column means:
select client_addr , usename , application_name, state, sync_state, (pg_wal_lsn_diff(pg_current_wal_lsn(),sent_lsn) / 1024)::bigint as PendingLag, -- The amount of WAL data that hasn't been sent ... check network stuff if lag persists (pg_wal_lsn_diff(sent_lsn,write_lsn) / 1024)::bigint as WriteLag, -- The amount of replayed log data that isn't applied ... check iostat stuff if lag persists (pg_wal_lsn_diff(write_lsn,flush_lsn) / 1024)::bigint as FlushLag, -- similar to write lag, and often these two numbers are high in conjunction (pg_wal_lsn_diff(flush_lsn,replay_lsn) / 1024)::bigint as ReplayLag, -- The amount of log data that is waiting to be replayed ... check iostat stuff but could also be high CPU or memory utilization (pg_wal_lsn_diff(pg_current_wal_lsn(),replay_lsn))::bigint / 1024 as TotalLag -- Basically a sum of the previous values FROM pg_stat_replication;