Hello! This page is a semi-organized mess of notes while learning PostgreSQL.
Below are PostgreSQL-tagged blog posts:
PostgreSQL (10)
- PostgreSQL 17: JSON_TABLE(), MERGE with RETURNING, and Updatable Views
- Rails World 2024 Conference Recap
- Solid Cache for Rails and PostgreSQL
- Trying out Solid Queue and Mission Control with PostgreSQL
- Shrinking Big PostgreSQL tables: Copy-Swap-Drop
- Wait a minute! — PostgreSQL extension pg_wait_sampling
- You make a good point! — PostgreSQL Savepoints
- SaaS on Rails on PostgreSQL — POSETTE 2024
- Mastering PostgreSQL for Rails: An Interview with Andy Atkinson
- 🎙️ IndieRails Podcast — Andrew Atkinson - The Postgres Specialist
- Top Five PostgreSQL Surprises from Rails Devs
- 🎙️ Ship It! Podcast — PostgreSQL with Andrew Atkinson
- ✂️ Use Cases for Merging and Splitting Partitions With Minimal Locking in PostgreSQL 17
- 🎙️ Hacking Postgres 🐘 Podcast — Season 2, Ep. 1 Andrew Atkinson
- Compiling PostgreSQL on macOS To Test Documentation and Patches
- Rails + Postgres Postgres.FM 086 — Extended blog post edition! 🎙️
- 'Rows Removed By Filter', Inspecting Pages, Buffer Cache — Part Two
- Maintainable Podcast — Maintainable…Databases? 🎙️
- What's 'Rows Removed By Filter' in PostgreSQL Query Plans all about?
- Remote Ruby — Unleashing the Power of Postgres with Andrew Atkinson 🎙️
- The Rails Changelog — #014: PostgreSQL for Rails Developers with Andrew Atkinson 🎙️
- Teach Kelvin Your Thing (TKYT) — High Performance PostgreSQL for Rails 🖥️
- PostgreSQL IO Visibility: #wehack PostgreSQL Internals and pg_stat_io
- PGConf NYC 2023 Recap 🐘
- SaaS for Developers with Gwen Shapira — Postgres, Performance and Rails with Andrew Atkinson 🎙️
- Podcast: Code and the Coding Coders who Code it! Episode 27 Andrew Atkinson 🎙️
- Presenting 'Partitioning Billions of Rows' at SFPUG August 2023
- PGSQL Phriday #011 — Sharding and Partitioning
- Slow & Steady — Database Constraints with Andrew Atkinson 🎙️
- PostgreSQL Table Partitioning Primary Keys — The Reckoning — Part 2 of 2
- Code With Jason 190 — PostgreSQL and Sin City Ruby 🎙️
- PostgreSQL Table Partitioning — Growing the Practice — Part 1 of 2
- PGSQL Phriday #009 — Database Change Management
- PGDay Chicago 2023 Conference
- PGSQL Phriday #008 — pg_stat_statements, PgHero, Query ID
- Ruby For All Podcast: My Guest Experience 🎙️
- Upgrading to PostgreSQL 15 on Mac OS
- PostgreSQL, Ruby on Rails, Rails Guides
- AWS re:Invent Day 1
- PGSQL Phriday #001 — Query Stats, Log Tags, and N+1s
- PgHero 3 Released
- RailsConf 2022 Conference
- PGConf NYC 2021 Conference
- Find Duplicate Records Using ROW_NUMBER Window Function
- Bulk Inserts, RETURNING, and UNLOGGED Tables
- Using pg_repack to Rebuild Indexes
- Manually Fixing a Rails Migration
- PostgreSQL pgbench Workload Simulation
- PostgreSQL Indexes: Prune and Tune
- Views, Stored Procedures, and Check Constraints
- Building a Web App with Boring Technologies (2017 Edition)
- A Look at PostgreSQL Foreign Key Constraints
- Intro to PostgreSQL 'generate_series'
- Automated Daily Backup Solution for PostgreSQL
- PostgreSQL for the Busy MySQL Developer
Recommended Learning Resources
- https://postgres.fm
- https://www.pgcasts.com
- https://www.youtube.com/c/ScalingPostgres
- https://sqlfordevs.io
- The
EXPLAIN
Glossary from PgMustard https://www.pgmustard.com/docs/explain - Optimizing Postgres for write heavy workloads
- Book: PostgreSQL Query Optimization
- Book: PostgreSQL 9.0 High Performance
- Book: PostgreSQL 14 Internals
Queries
I keep queries in a GitHub repository: pg_scripts
Query: Approximate Count
Since a COUNT(*)
query can be slow, try an approximate count:
SELECT reltuples::numeric AS estimate
FROM pg_class WHERE relname = 'table_name';
Query: Get Table Stats
SELECT
attname,
n_distinct,
most_common_vals,
most_common_freqs
FROM pg_stats
WHERE tablename = 'table';
Consider cardinality for columns, and selectivity for indexes and queries, when designing indexes.
Cancel or Kill by Process ID
Get a PID with SELECT * FROM pg_stat_activity;
Try to cancel the query first, otherwise terminate the backend:
SELECT pg_cancel_backend(pid);
SELECT pg_terminate_backend(pid);
Tuning Autovacuum
PostgreSQL runs a scheduler Autovacuum process when PostgreSQL starts up. This process looks at configurable thresholds for all tables and determines whether a VACUUM
worker should run, per table.
Thresholds can be configured per table. A good starting place for tables that have a large amount of UPDATE and DELETE queries, is to perform that per-table tuning.
The goal is to make VACUUM run more regularly and for a longer period of time, to stay on top of the accumulation of bloat from dead tuples, so that operations are reliable and predictable.
In Routine Vacuuming, the two options are listed:
autovacuum_vacuum_scale_factor
autovacuum_vacuum_threshold
The scale factor defaults to 20% (0.20
). For large tables with a high amount of updates and deletes, we lowered the value to 1% (0.01
). With the lowered threshold, vacuum will run more frequently in proportion to how much it’s needed.
Set the value for a table:
ALTER TABLE bigtable SET (autovacuum_vacuum_scale_factor = 0.1);
Can be reset:
ALTER TABLE bigtable RESET (autovacuum_vacuum_threshold);
Autovacuum Tuning
- Set
log_autovacuum_min_duration
to0
to log all Autovacuum. A logged AV run includes a lot of information. - pganalyze: Visualizing & Tuning Postgres Autovacuum
autovacuum_max_workers
autovacuum_max_freeze_age
maintenance_work_memory
Specialized Index Types
The most common type is B-Tree. Specialized index types are:
- Multicolumn
- Covering (Multicolumn or newer
INCLUDES
style) - Partial
- GIN
- GiST
- BRIN
- Expression
- Unique
- Hash
Removing Unused Indexes
Ensure these are set to on
SHOW track_activities;
SHOW track_counts;
Cybertec blog post with SQL query to discover unused indexes: Get Rid of Your Unused Indexes!
Remove Duplicate and Overlapping Indexes
https://wiki.postgresql.org/wiki/Index_Maintenance
Remove Seldom Used Indexes on High Write Tables
New Finding Unused Indexes Query
This is a great guideline.
As a general rule, if you’re not using an index twice as often as it’s written to, you should probably drop it.
In our system on our highest write table we had 10 total indexes defined and 6 were classified as Low Scans, High Writes. These indexes may not be worth keeping.
Partial Indexes
How Partial Indexes Affect UPDATE Performance in PostgreSQL
Timeout Tuning
statement_timeout
: The maximum time a statement is allowed to run before being canceledlock_timeout
Connections Management
A connection forks the OS process (creates a new process) and is thus expensive.
Use a connection pooler to reduce overhead from connection establishment
- PgBouncer (see below). Running PgBouncer on ECS
- RDS Proxy. AWS RDS Proxy
- Managing Connections with RDS Proxy
Connection issues could benefit from changing:
connect_timeout
read_timeout
checkout_timeout
(Rails, default5s
): maximum time Rails will spend trying to check out a connection from the pool before raising an error. checkout_timeout API documentationstatement_timeout
. In Rails/Active Record, set inconfig/database.yml
under avariables
section with a value in milliseconds. This becomes a session variable which is set like this:
SET statement_timeout = 5000
(in milliseconds) and be displayed like this: SHOW statement_timeout
production:
variables:
statement_timeout: 5000
For Rails with Puma and Sidekiq, carefully manage the connection pool size and total connections.
The Ruby on Rails database connection pool. We also use a proxy in between the application and PG.
PgBouncer
Install PgBouncer on macOS with brew install pgbouncer
. Create the .ini
config file as the article mentions, point it to a database, accept connections, and track the connection count.
Heap Only Tuple (HOT) Updates
HOT (“heap only tuple”) updates, are updates to tuples not referenced from outside the table block.
HOT updates in PostgreSQL for better performance
2 requirements:
- there must be enough space in the block containing the updated row
- there is no index defined on any column whose value is modified (big one)
fillfactor
What is fillfactor and how does it affect PostgreSQL performance?
- Percentage between 10 and 100, default is 100 (“fully packed”)
- Reducing it leaves room for “HOT” updates when they’re possible. Set to 90 to leave 10% space available for HOT updates.
- “good starting value for it is 70 or 80” Deep Dive
- For tables with heavy updates a smaller fillfactor may yield better write performance
- Set per table or per index (B-Tree defaults to 90 fillfactor)
- Trade-off: “Faster UPDATE vs Slower Sequential Scan and wasted space (partially filled blocks)” from Fillfactor Deep Dive
- No index defined on any column whose value is modified
Limitations: Requires a VACUUM FULL
after modifying (or pg_repack)
ALTER TABLE foo SET ( fillfactor = 90 );
VACUUM FULL foo;
Or use pg_repack
pg_repack --no-order --table foo
Installing pg_repack on EC2 for RDS
Note: use -k, --no-superuser-check
Locks Management
log_lock_waits
deadlock_timeout
“Then slow lock acquisition will appear in the database logs for later analysis.”
Lock Types
Exclusive locks, and shared locks.
AccessExclusiveLock
- Locks the table, queries are not allowed.
Table locks and row locks.
Tools
Tools: Query Planning
EXPLAIN (ANALYZE, BUFFERS)
This article 5 Things I wish my grandfather told me about ActiveRecord and PostgreSQL has a nice translation of EXPLAIN ANLAYZE output written more in plain English.
pgMustard
Format EXPLAIN
output with JSON, and specify additional options.
EXPLAIN (ANALYZE, BUFFERS, VERBOSE, FORMAT text) --<sql-query>
Using pgbench
Repeatable method of determining a transactions per second (TPS) rate.
Useful for determining impact of parameter tuning. Configurable with custom SQL queries.
Could be used to test the impact of increasing concurrent connections.
- Initialize database example with scaling option of 50 times the default size:
pgbench -i -s 50 example`
- Benchmark with 10 clients, 2 worker threads, and 10,000 transactions per client:
pgbench -c 10 -j 2 -t 10000 example`
I created PR #5388 adding pgbench to tldr!
pgtune
PGTune is a website that tries to suggest values for PG parameters.
https://pgtune.leopard.in.ua/#/
PgHero
PgHero brings operational concerns into a web dashboard. Built as a Rails engine.
We’re running it in production and saw immediate value in identifying unused and duplicate indexes to remove.
https://github.com/ankane/pghero
pgmonitor
https://github.com/CrunchyData/pgmonitor
postgresqltuner
Perl script to analyze a database. Has some insights like the shared buffer hit rate, index analysis, configuration advice, and extension recommendations.
https://github.com/jfcoz/postgresqltuner
pg_test_fsync
pgmetrics
pgcli
brew install pgcli
An alternative to psql
with syntax highlighting, autocomplete and more.
Write Ahead Log (WAL) Tuning
Can cause a significant I/O load
checkpoint_timeout
- in seconds, default checkpointing every 5 minutesmax_wal_size
- if max wal size is about to be exceeded, default 1 GB
Reducing the value causes more frequent checkpoint operations.
checkpoint_warning
parameter
checkpoint_completion_target
General Recommendation (not mine)
“On a system that’s very close to maximum I/O throughput during normal operation, you might want to increase
checkpoint_completion_target
to reduce the I/O load from checkpoints.”
Parameters
commit_delay
(0 by default)wal_sync_method
wal_debug
Extensions and Modules
Foreign Data Wrapper (FDW)
Native Foreign Data Wrapper (FDW) functionality in PostgreSQL allows connecting to remote sources
The table structure may be specified when establishing the foreign table
We were able to avoid the need for any intermediary data dump files.
We used a temp
schema to isolate temporary tables away from the main schema (public
).
The process is:
- Create a server
- Create a user mapping
- Create a foreign table (optionally importing the schema)
Let’s say we had 2 services, one for managing inventory items for sale, and one for managing authentication.
Connect as a superuser
create EXTENSION postgres_fdw;
CREATE SCHEMA temp;
CREATE SERVER temp_authentication;
FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (host 'authentication-db-host', dbname 'authentication-db-name', port '5432'); -- set the host, name and port
CREATE USER MAPPING FOR postgres
SERVER temp_authentication
OPTIONS (user 'authentication-db-user', password 'authentication-db-password'); -- map the local postgres user to a user on the remote DB
IMPORT FOREIGN SCHEMA public LIMIT TO (customers)
FROM SERVER temp_authentication INTO temp; -- this will make a table called temp.customers
uuid-ossp
Generate universally unique identifiers (UUIDs) in PostgreSQL.
pg_stat_statements
Tracks execution stats for all statements. Stats made available from view. Requires reboot (static param) on RDS on PG 10 although pg_stat_statements
is available by default in shared_preload_libraries
in PG 12.
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
https://www.virtual-dba.com/blog/postgresql-performance-enabling-pg-stat-statements/
pgstattuple
The pgstattuple module provides various functions to obtain tuple-level statistics.
https://www.postgresql.org/docs/current/pgstattuple.html
citext
Case insensitive column type
pg_cron
Available on PG 12.5+ on RDS, pg_cron is an extension that can be useful to schedule maintenance tasks, like manual vacuum jobs.
See: Scheduling maintenance with the PostgreSQL pg_cron extension
pg_timetable
pg_timetable: Advanced scheduling for PostgreSQL
pg_squeeze
Replacement for pg_repack, automated, without needing to run a CLI tool.
auto_explain
Adds explain plans to the PostgreSQL log.
Percona pg_stat_monitor
pg_stat_monitor: A cool extension for better monitoring using PMM - Percona Live Online 2020
pganalyze Index Advisor
A better way to index your PostgreSQL database: pganalyze Index Advisor
pgbadger
brew install pgbadger
Bloat
Overview
How does bloat (table bloat, index bloat) affect performance?
- “When a table is bloated, PostgreSQL’s
ANALYZE
tool calculates poor or inaccurate information that the query planner uses.”. Example of 7:1 bloated/active tuples ratio causing query planner to skip. - Queries on tables with high bloat will require additional IO, navigating through more pages of data.
- Bloated indexes, such as indexes that reference tuples that have been vacuumed, add IO. Rebuild the index
REINDEX ... CONCURRENTLY
- Index only scans slow down with outdated statistics. Autovacuum updates table statistics. Minimize table bloat to improve performance of index only scans. PG Routing vacuuming docs.
- Cybertec: Detecting Table Bloat
- Dealing with significant PostgreSQL database bloat — what are your options?
Upgrades
This is also a really cool Version Upgrade Comparison Tool: 10 to 12
PG 11
October 2018
- Improves parallel query performance and parallelism of B-tree index creation
- Adds partitioning by hash key
- Significant partitioning improvements
- Adds “covering” indexes with
INCLUDE
keyword. Adds “payload columns” to indexes. Docs: Index only scans and Covering indexes
PG 12
Release announcement. Released October 2019.
- Partitioning performance improvements
REINDEX
gains support forCONCURRENTLY
. This is very helpful for online index rebuilds, to remove “bloat” from index entries.
PG 13
Released September 2020
- Parallel vacuum
PG 14
- More support for
query_id
- Multi-range types
PG 15
- SQL
MERGE
command for Upserts!
PG 16
Released September 2023
pg_stat_io
System View- Replication running from a standby as a source
- Bi-directional or “active-active” Logical Replication. This is cool! https://www.crunchydata.com/blog/active-active-postgres-16
RDS
Amazon RDS hosts PostgreSQL (with customizations). RDS is a regular single-writer primary instance model for PostgreSQL.
Aurora PostgreSQL
- Separates storage and compute
- “Fast clones”
- Regional offering (single region) and Global offering (multi-region)
- Can create multi-region replication natively, or without the need for VPC Peering
AWS RDS Parameter Groups
Working with RDS Parameter Groups
- Try out parameter changes on a test database prior to making the change. Create a snapshot before making the change.
- Parameter groups can be restored to their defaults (or they can be copied to create an experimental group). Groups can be compared with each other to determine differences.
- Parameter values can process a formula. RDS provides some formulas that use the instance class CPU or memory available to calculate a value.
Database Constraints
Blog: A Look at PostgreSQL Foreign Key Constraints
CHECK
NOT NULL
UNIQUE
PRIMARY KEY
FOREIGN KEY
EXCLUSION
Native Replication
- Physical Replication
- Logical Replication using PUBLICATION and SUBSCRIPTION objects.
- PostgreSQL 16 gained bi-directional logical replication for the same table, on two separate instances!
PostgreSQL Logical Replication
Crunchydata Logical Replication in PostgreSQL
- Create a
PUBLICATION
, counterpartSUBSCRIPTION
. - All operations like
INSERT
andUPDATE
are enabled by default, fewer can be configured - Logical replication available since PG 10.
max_replication_slots
should be set higher than number of replicas- A role must exist for replication
- Replication slot is a replication object that keeps track of where the subscriber is in the WAL stream
- Unlike normal replication, writes are still possible to the subscriber. Conflicts can occur if data is written that would conflict with logical replication.
Declarative Partitioning
RANGE
(time-based)LIST
-
HASH
- Crunchydata Native Partitioning Tutorial
- pgslice
- pg_partman
- pg_party
Partition Pruning
Default is on
or SET enable_partition_pruning = off;
to turn it off.
https://www.postgresql.org/docs/13/ddl-partitioning.html#DDL-PARTITION-PRUNING
Uncategorized Content
- Use
NULL
s instead of default values when possible, cheaper to store and query. Source: Timescale DB blog
Stored Procedures
Stored Procedures are User Defined Functions (UDF).
Using PL/pgSQL, functions can be added to the database directly. Procedures and functions can be written in other languages.
To manage these functions in a Ruby app, use the fx gem (versioned database functions)!
PostgreSQL Monitoring
pg_top
On Mac OSbrew install pg_top
and run itpg_top
Uncategorized Resources
This is an amazing article full of nuggets.
- The idea of an “Application DBA”
- Things I liked: Using an intermediate table for de-duplication. Column structure is elegant, clearly broken out destination ID and nested duplicate IDs.
- Working with arrays
ANY()
for an array of items to compare againstarray_remove(anyarray, anyelement)
to build an array but remove an elementarray_agg(expression)
to build up list of IDs andunnest(anyarray)
to expand it
- Avoidance of indexes for low selectivity, and value of partial indexes in those cases (activated 90% v. unactivated users 10%)
-
Tip on confirming index usage by removing index in a transaction with
BEGIN
and rolling it back withROLLBACK
. - Generalists/specialists: Application DBA and Performance Analyst
- PostgreSQL Connection Pooling: Part 1 – Pros & Cons
PostgreSQL Presentations
PostgreSQL Tuning
shared_buffers
. RDS default is around 25% of system memory. Recommendations say up to 40% of system memory could be allocated, at which point there may be diminishing returns beyond that.
The unit is 8kb chunks, and requires some math to change the value for. Here is a formula:
https://stackoverflow.com/a/42483002/126688
Parameter | Unit | Default RDS | Tuned | Link |
---|---|---|---|---|
shared_buffers |
8kb | 25% mem | ||
autovacuum_cost_delay |
ms | 20 | 2 | |
autovacuum_vaccum_cost_limit |
200 | 2000 | Docs | |
effective_cache_size |
8kb | |||
work_mem |
MB | 4 | 250 | |
maintenance_work_memory |
||||
checkpoint_timeout |
||||
min_wal_size |
MB | 80 | 4000 | High write log blog |
max_wal_size |
MB | 4000 | 16000 | |
max_worker_processes |
8 | 1x/cpu | ||
max_parallel_workers |
8 | 1x/cpu | ||
max_parallel_workers_per_gather |
2 | 4 |
PostgreSQL Backups
Sequences
TRUNCATE
and reset:TRUNCATE <table name> RESTART IDENTITY
https://brianchildress.co/reset-auto-increment-in-postgres/ALTER SEQUENCE <seq-name> RESTART WITH 1;
(e.g.users_id_seq
)- Serial and BigSerial are special types that use Sequences
- Logical Replication restrictions: “Sequence data is not replicated” (as of PG 16) https://www.postgresql.org/docs/current/logical-replication-restrictions.html
Identity Columns
- Identity columns are recommended for primary keys, over using Sequences (with Serial or BigSerial)
Scaling Web Applications
Full Text Search (FTS)
- Postgres full-text search is Good Enough!
- Postgres Full Text Search vs the rest
- Full Text Search in Milliseconds with Rails and PostgreSQL
Engineering blog posts
Big companies use PostgreSQL and Ruby on Rails.
Sharding and Scaling
- The growing pains of database architecture
- Herding elephants: Lessons learned from sharding Postgres at Notion
- The Great Re-shard: adding Postgres capacity (again) with zero downtime
- Partitioning GitHub’s relational databases to handle scale
- How GitHub Partitioned Its Relational Database to Improve Reliability at Scale
- Scaling at Instacart: Distributing Data Across Multiple Postgres Databases with Rails
Operational Excellence