Monitoring areas deep dive
Postgres copilot monitors your Postgres database across multiple areas, each focused on specific aspects of database health and performance. This guide explains what each area covers and what kinds of Issues it can detect.
Overview
Postgres copilot's monitoring follows a top-down troubleshooting methodology based on the Four Golden Signals:
- Latency — How long queries take
- Traffic — Query volume and patterns
- Errors — Failed queries and operations
- Saturation — Resource utilization
Each monitoring area maps to one or more of these signals.
Query performance
What it monitors:
- Query execution times
- Query plans and plan changes
- pg_stat_statements metrics
- Wait events during query execution
Common Issues detected:
- Slow queries exceeding thresholds
- Sequential scans on large tables
- Suboptimal query plans
- Excessive temp file usage
Example Issue:
Query averaging 5.2s execution time
SELECT * FROM orders WHERE customer_id = $1 AND status = $2
Recommendation: Add composite index
CREATE INDEX idx_orders_customer_status ON orders(customer_id, status);
Indexing
What it monitors:
- Index usage statistics
- Missing indexes (detected via query patterns)
- Unused indexes (bloating storage)
- Duplicate/redundant indexes
- Index bloat levels
Common Issues detected:
- Missing indexes causing slow queries
- Indexes never used (candidates for removal)
- Overlapping indexes wasting space
- Invalid indexes needing rebuild
Example Issue:
Unused index detected: idx_orders_legacy_status
This index hasn't been used in 30 days.
Size: 2.3 GiB
Recommendation:
DROP INDEX CONCURRENTLY idx_orders_legacy_status;
Configuration
What it monitors:
- Postgres configuration parameters
- Settings deviating from best practices
- Memory allocation
- Connection limits
- Checkpoint configuration
Common Issues detected:
- work_mem too low for complex queries
- shared_buffers misconfigured
- max_connections too high
- autovacuum parameters suboptimal
Example Issue:
work_mem is set to default 4MB
Based on query patterns, some queries would benefit from higher work_mem.
Current complex sorts are spilling to disk 12,000 times/day.
Recommendation:
ALTER SYSTEM SET work_mem = '256MB';
SELECT pg_reload_conf();
Vacuum & bloat
What it monitors:
- Autovacuum activity and queue
- Table bloat levels
- Dead tuple accumulation
- Transaction ID wraparound risk
- VACUUM and ANALYZE frequency
Common Issues detected:
- Tables with high bloat percentage
- Autovacuum falling behind
- Transaction ID wraparound approaching
- Tables needing ANALYZE
Example Issue:
Table orders has 45% bloat (12 GiB wasted)
Autovacuum is not keeping up with update rate.
Current dead tuples: 2.3 million
Recommendation:
1. Run VACUUM (VERBOSE) orders; during low-traffic window
2. Consider adjusting autovacuum parameters:
ALTER TABLE orders SET (autovacuum_vacuum_scale_factor = 0.05);
Connections
What it monitors:
- Active connections
- Connection pool efficiency
- Idle connections
- Connection wait times
- Per-application connection usage
Common Issues detected:
- Connection limit approaching maximum
- Too many idle connections
- Connection leaks
- Missing connection pooling
Example Issue:
Connection utilization at 85% (170/200)
Trend shows increasing connection count.
Many connections are idle in transaction.
Recommendation:
1. Implement connection pooling (PgBouncer)
2. Review application for connection leaks
3. Consider increasing max_connections cautiously
Replication
What it monitors:
- Replication lag
- WAL generation rate
- Streaming replication status
- Logical replication health
- Standby connection status
Common Issues detected:
- Replication lag exceeding threshold
- WAL accumulation on primary
- Replication slot inactive
- Standby falling behind
Example Issue:
Replication lag is 45 seconds
Standby replica-1 is falling behind primary.
WAL accumulation: 2.1 GiB
Recommendation:
1. Check network connectivity to standby
2. Verify standby has sufficient resources
3. Review wal_keep_size configuration
Locks
What it monitors:
- Lock wait times
- Lock chains and blocking
- Deadlock frequency
- Long-held locks
- DDL lock conflicts
Common Issues detected:
- Queries blocked by long transactions
- Frequent lock conflicts
- DDL operations blocking production
- Deadlock patterns
Example Issue:
DDL blocked for 45 minutes
ALTER TABLE users ADD COLUMN... is waiting for AccessExclusiveLock
Blocked by: transaction holding AccessShareLock for 2 hours
Recommendation:
1. Identify and terminate blocking transaction
2. Consider lock_timeout for DDL operations
3. Use pg_blocking_pids() to monitor
Storage
What it monitors:
- Disk space usage
- Table and index sizes
- Growth trends
- WAL directory size
- Temporary file usage
Common Issues detected:
- Disk space running low
- Rapid growth patterns
- Large temporary files
- WAL accumulation
Example Issue:
Disk utilization at 78%
Current growth rate: 2.3 GiB/day
Estimated time to 90%: 12 days
Recommendation:
1. Review large tables for archival candidates
2. Check for bloated tables
3. Plan disk expansion
Security
What it monitors:
- Authentication failures
- Privilege escalation
- Exposed configuration
- SSL/TLS usage
- Password policy compliance
Common Issues detected:
- Repeated authentication failures
- Superuser usage in applications
- Unencrypted connections
- Weak password patterns
Example Issue:
Application using superuser credentials
Application 'web-backend' connects as postgres superuser.
This violates principle of least privilege.
Recommendation:
1. Create dedicated application user
2. Grant only required permissions
3. Update application connection string
Checkpoints
What it monitors:
- Checkpoint frequency
- Checkpoint duration
- WAL generation patterns
- Checkpoint-related I/O
- checkpoint_warning triggers
Common Issues detected:
- Too frequent checkpoints
- Long checkpoint durations
- Checkpoint-related I/O spikes
- Suboptimal checkpoint configuration
Example Issue:
Checkpoints occurring every 2 minutes
Configured checkpoint_timeout: 5 minutes
Actual average interval: 2.1 minutes
Cause: High WAL generation rate
Recommendation:
1. Increase max_wal_size to reduce checkpoint frequency
2. Review write-heavy queries for optimization
How Issues are prioritized
Each monitoring area contributes to Issue severity based on:
| Factor | Impact on severity |
|---|---|
| Service impact | Direct effect on users raises severity |
| Data risk | Risk of data loss is always Critical |
| Urgency | Time-sensitive issues (disk space, wraparound) raise severity |
| Effort | Quick wins may be prioritized for momentum |
Customizing monitoring
You can adjust thresholds and focus areas:
# View current configuration
./postgres_ai config show
# Adjust threshold for slow query detection
./postgres_ai config set query.slow_threshold 1000ms
# Disable specific checks
./postgres_ai config set checks.unused_indexes false