Skip to main content

Monitoring areas deep dive

Postgres copilot monitors your Postgres database across multiple areas, each focused on specific aspects of database health and performance. This guide explains what each area covers and what kinds of Issues it can detect.

Overview

Postgres copilot's monitoring follows a top-down troubleshooting methodology based on the Four Golden Signals:

  1. Latency — How long queries take
  2. Traffic — Query volume and patterns
  3. Errors — Failed queries and operations
  4. Saturation — Resource utilization

Each monitoring area maps to one or more of these signals.

Query performance

What it monitors:

  • Query execution times
  • Query plans and plan changes
  • pg_stat_statements metrics
  • Wait events during query execution

Common Issues detected:

  • Slow queries exceeding thresholds
  • Sequential scans on large tables
  • Suboptimal query plans
  • Excessive temp file usage

Example Issue:

Query averaging 5.2s execution time
SELECT * FROM orders WHERE customer_id = $1 AND status = $2

Recommendation: Add composite index
CREATE INDEX idx_orders_customer_status ON orders(customer_id, status);

Indexing

What it monitors:

  • Index usage statistics
  • Missing indexes (detected via query patterns)
  • Unused indexes (bloating storage)
  • Duplicate/redundant indexes
  • Index bloat levels

Common Issues detected:

  • Missing indexes causing slow queries
  • Indexes never used (candidates for removal)
  • Overlapping indexes wasting space
  • Invalid indexes needing rebuild

Example Issue:

Unused index detected: idx_orders_legacy_status

This index hasn't been used in 30 days.
Size: 2.3 GiB

Recommendation:
DROP INDEX CONCURRENTLY idx_orders_legacy_status;

Configuration

What it monitors:

  • Postgres configuration parameters
  • Settings deviating from best practices
  • Memory allocation
  • Connection limits
  • Checkpoint configuration

Common Issues detected:

  • work_mem too low for complex queries
  • shared_buffers misconfigured
  • max_connections too high
  • autovacuum parameters suboptimal

Example Issue:

work_mem is set to default 4MB

Based on query patterns, some queries would benefit from higher work_mem.
Current complex sorts are spilling to disk 12,000 times/day.

Recommendation:
ALTER SYSTEM SET work_mem = '256MB';
SELECT pg_reload_conf();

Vacuum & bloat

What it monitors:

  • Autovacuum activity and queue
  • Table bloat levels
  • Dead tuple accumulation
  • Transaction ID wraparound risk
  • VACUUM and ANALYZE frequency

Common Issues detected:

  • Tables with high bloat percentage
  • Autovacuum falling behind
  • Transaction ID wraparound approaching
  • Tables needing ANALYZE

Example Issue:

Table orders has 45% bloat (12 GiB wasted)

Autovacuum is not keeping up with update rate.
Current dead tuples: 2.3 million

Recommendation:
1. Run VACUUM (VERBOSE) orders; during low-traffic window
2. Consider adjusting autovacuum parameters:
ALTER TABLE orders SET (autovacuum_vacuum_scale_factor = 0.05);

Connections

What it monitors:

  • Active connections
  • Connection pool efficiency
  • Idle connections
  • Connection wait times
  • Per-application connection usage

Common Issues detected:

  • Connection limit approaching maximum
  • Too many idle connections
  • Connection leaks
  • Missing connection pooling

Example Issue:

Connection utilization at 85% (170/200)

Trend shows increasing connection count.
Many connections are idle in transaction.

Recommendation:
1. Implement connection pooling (PgBouncer)
2. Review application for connection leaks
3. Consider increasing max_connections cautiously

Replication

What it monitors:

  • Replication lag
  • WAL generation rate
  • Streaming replication status
  • Logical replication health
  • Standby connection status

Common Issues detected:

  • Replication lag exceeding threshold
  • WAL accumulation on primary
  • Replication slot inactive
  • Standby falling behind

Example Issue:

Replication lag is 45 seconds

Standby replica-1 is falling behind primary.
WAL accumulation: 2.1 GiB

Recommendation:
1. Check network connectivity to standby
2. Verify standby has sufficient resources
3. Review wal_keep_size configuration

Locks

What it monitors:

  • Lock wait times
  • Lock chains and blocking
  • Deadlock frequency
  • Long-held locks
  • DDL lock conflicts

Common Issues detected:

  • Queries blocked by long transactions
  • Frequent lock conflicts
  • DDL operations blocking production
  • Deadlock patterns

Example Issue:

DDL blocked for 45 minutes

ALTER TABLE users ADD COLUMN... is waiting for AccessExclusiveLock
Blocked by: transaction holding AccessShareLock for 2 hours

Recommendation:
1. Identify and terminate blocking transaction
2. Consider lock_timeout for DDL operations
3. Use pg_blocking_pids() to monitor

Storage

What it monitors:

  • Disk space usage
  • Table and index sizes
  • Growth trends
  • WAL directory size
  • Temporary file usage

Common Issues detected:

  • Disk space running low
  • Rapid growth patterns
  • Large temporary files
  • WAL accumulation

Example Issue:

Disk utilization at 78%

Current growth rate: 2.3 GiB/day
Estimated time to 90%: 12 days

Recommendation:
1. Review large tables for archival candidates
2. Check for bloated tables
3. Plan disk expansion

Security

What it monitors:

  • Authentication failures
  • Privilege escalation
  • Exposed configuration
  • SSL/TLS usage
  • Password policy compliance

Common Issues detected:

  • Repeated authentication failures
  • Superuser usage in applications
  • Unencrypted connections
  • Weak password patterns

Example Issue:

Application using superuser credentials

Application 'web-backend' connects as postgres superuser.
This violates principle of least privilege.

Recommendation:
1. Create dedicated application user
2. Grant only required permissions
3. Update application connection string

Checkpoints

What it monitors:

  • Checkpoint frequency
  • Checkpoint duration
  • WAL generation patterns
  • Checkpoint-related I/O
  • checkpoint_warning triggers

Common Issues detected:

  • Too frequent checkpoints
  • Long checkpoint durations
  • Checkpoint-related I/O spikes
  • Suboptimal checkpoint configuration

Example Issue:

Checkpoints occurring every 2 minutes

Configured checkpoint_timeout: 5 minutes
Actual average interval: 2.1 minutes
Cause: High WAL generation rate

Recommendation:
1. Increase max_wal_size to reduce checkpoint frequency
2. Review write-heavy queries for optimization

How Issues are prioritized

Each monitoring area contributes to Issue severity based on:

FactorImpact on severity
Service impactDirect effect on users raises severity
Data riskRisk of data loss is always Critical
UrgencyTime-sensitive issues (disk space, wraparound) raise severity
EffortQuick wins may be prioritized for momentum

Customizing monitoring

You can adjust thresholds and focus areas:

# View current configuration
./postgres_ai config show

# Adjust threshold for slow query detection
./postgres_ai config set query.slow_threshold 1000ms

# Disable specific checks
./postgres_ai config set checks.unused_indexes false

Next steps