- Introduction
IN
clauses with a big list of values- Parsing, planning, and executing
- Creating this pattern directly
- Active Record ORM creating this pattern implicitly
- Fixing N+1s
- Eager loading with includes or preload
- Eager loading using eager_load
- Alternative approaches using
ANY
orSOME
- A VALUES clause
- A temporary table of ids
- Using
ANY
and anARRAY
of values - Finding
IN
clause queries in pg_stat_statements - Improvements in Postgres 17
- Grouping similar query groups in pg_stat_statements
- Improvements in PostgreSQL 18
- Conclusion
Introduction
If you’ve created web applications with relational databases before, using ORMs like Active Record, part of Ruby on Rails, you’ve likely dealt with performance problems in your database layer once you’ve reached a certain size of data and query volume.
In this post, we’re going to look at a specific type of problematic query pattern.
For queries with “IN clauses”, as data sizes grow, and their list of values gets big, these queries tend to perform very poorly, causing user experience problems or even partial outages in extreme cases.
We’ll dig into the specific details of how these are constructed and why, how you might be creating them now, and how you can fix them.
IN
clauses with a big list of values
The technical term for values are a parenthesized list of scalar expressions.
For example in the SQL query below, the IN
clause portion is WHERE author_id IN (1,2,3)
and the list of scalar expressions is (1,2,3)
.
SELECT * FROM books
WHERE author_id IN (1, 2, 3);
The purpose of this clause is to perform filtering. Looking at a query execution plan in Postgres, we’ll see something like this fragment below:
Filter: (author_id = ANY ('{1,2,3}'::integer[]))
Why are these slow?
Parsing, planning, and executing
Remember that our queries are parsed, planned, and executed. A big list of values are treated like constants, and don’t have associated statistics.
Queries with big lists of values take more time to parse and use more memory.
Without pre-collected table statistics for planning decisions, PostgreSQL is more likely to mis-estimate cardinality and row selectivity.
This can mean the planner chooses a sequential scan over an index scan, causing a big slowdown.
How do we create this pattern?
Creating this pattern directly
In Active Record, a developer might create this query pattern by using pluck()
to collect some ids in a list, then passing that list as an argument to another query.
Here’s an example of that getting ids and assigning them to author_ids
:
author_ids = Author.
where("created_at >= ?", 1.year.ago).
pluck(:id)
The author_ids
are supplied as the argument querying books
by author_id
foreign key:
Book.where(author_id: author_ids)
Another scenario is when this pattern is created implicitly by ORM methods. What does that look like?
Active Record ORM creating this pattern implicitly
This query pattern can happen when using eager loading methods like includes
or preload
.
This Crunchy Data post mentions how eager loading methods produce IN
clause SQL queries.
The post links to the Eager Loading Associations documentation which has examples in Active Record and the resulting SQL that we’ll use here.
Let’s first discuss N+1 with these examples.
Fixing N+1s
Let’s study the examples here. Here’s some Active Record for books and authors:
# N+1
books = Book.limit(10)
books.each do |book|
puts book.author.last_name
end
The issue above is the undesirable N+1 query pattern, where a table is repeatedly queried in a loop, instead of bulk loading a set of rows.
To fix the N+1, we’ll add the includes(:author)
eager loading method to the code above.
That looks like this:
books = Book.includes(:author).limit(10) 👈
books.each do |book|
puts book.author.last_name
end
We’ve now eliminated the N+1 queries, but we’ve opened ourselves up to a new possible problem.
Eager loading with includes or preload
While the includes(:author)
fixed the N+1 queries, Active Record is now creating two queries, with the second one having an IN
clause.
Here’s the example from above as SQL:
SELECT books.* FROM books LIMIT 10;
SELECT authors.* FROM authors
WHERE authors.id IN (1,2,3,4,5,6,7,8,9,10);
Here we only have 10 values for the IN
clause, so performance will be fine. However, once we’ve got hundreds or thousands of values, we will run into the problems described above.
Performance will tank if the authors.id
primary key index isn’t used for this filtering operation.
Where else do these IN
clause queries come from?
Eager loading using eager_load
Besides includes()
and preload()
which create two queries with the second having an IN
clause, there’s another way to do eager loading in Active Record.
An alternative method eager_load
works a little bit differently. It produces a single SQL query that uses a LEFT OUTER JOIN
.
Here’s an example of eager_load
from the Active Record documentation:
books = Book.eager_load(:author).limit(10)
books.each do |book|
puts book.author.last_name
end
The following single SQL query is produced. Note that it has no IN
clause.
SELECT
"books"."id" AS t0_r0,
"books"."title" AS t0_r1
FROM
"books" LEFT OUTER JOIN "authors"
ON "authors"."id" = "books"."author_id"
LIMIT 10;
Since we’re now using a join operation, we’ve got statistics available from both tables. This makes it much more likely PostgreSQL can correctly estimate selectivity and cardinality.
The planner also isn’t needing to parse and store a large list of constant values.
While IN
clauses might perform fine with smaller inputs, e.g. 100 values or fewer,1, for large lists we should try and restructure the query to use a join operation instead.
Besides restructuring the queries into joins, are there other alternatives?
Alternative approaches using ANY
or SOME
Crunchy Data’s post Postgres Query Boost: Using ANY Instead of IN describes how IN
is more restrictive on the input.
A more usable to IN
can be using ANY
or SOME
, which has more flexibility in handling the list of values.
Here’s A CTE example using ANY
:
WITH author_ids AS (
SELECT id FROM authors
)
SELECT title
FROM books
WHERE author_id = ANY (
SELECT id
FROM author_ids);
However, ANY
is not generated by Active Record. What if we want to generate these queries using Active Record?
One option is to use the any
method provided by the ActiveRecordExtended gem.
Let’s talk at another alternative approach using a VALUES
clause.
A VALUES clause
In the comments in the PR above, Vlad and Sean discussed an alternative for IN
using a VALUES clause.
Let’s look at an example with a CTE and VALUES clause:
WITH ids(author_id) AS (
VALUES(1),(2),(3)
)
SELECT title
FROM books
JOIN ids USING (author_id);
Or we can write this as a subquery:
SELECT title
FROM books
WHERE author_id IN (
SELECT id
FROM (VALUES(1),(2),(3)) AS v(id)
);
This is better because the IN
list is a big list of scalar expressions, where the VALUES
clause is treated like a relation (or table). This can help with join strategy selection.
A temporary table of ids
Yet another option for big lists of values is to put these into a temporary table for the session. The temporary table can even index the ids.
CREATE TEMP TABLE temp_ids (author_id int);
INSERT INTO temp_ids(author_id) VALUES (1),(2),(3);
CREATE INDEX ON temp_ids(author_id);
SELECT title
FROM books b
JOIN temp_ids t ON t.author_id = b.author_id;
Using ANY
and an ARRAY
of values
Another form is using ANY
with an ARRAY:
SELECT title
FROM books
WHERE author_id = ANY (ARRAY[1, 2, 3]);
The ANY
form can perform better. With an IN
list, the values are parsed like a chain of OR operations, with the planner handling one branch at a time.
ANY
is treated like a single functional expression.
This form also supports prepared statements. With prepared statements, the statement is parsed and planned once and then can be reused.
Here’s an example of fetching books by author:
PREPARE get_books_by_author(int[]) AS
SELECT title
FROM books
WHERE author_id = ANY ($1);
EXECUTE get_books_by_author(ARRAY[1,2,3,4,5]);
How do we find whether our system has these problematic IN
queries?
Finding IN
clause queries in pg_stat_statements
To find out if your query stats include the problematic IN
queries, let’s search the results of pg_stat_statements
by querying the query
field.
Unfortunately these don’t always group up well, so there can be duplicates or near-duplicates. You may have lots of PGSS results to sift through.
Here’s a basic query to filter on query
for '%IN \(%'
:
SELECT
query
FROM
pg_stat_statements
WHERE
query LIKE '%IN \(%';
See the linked PR for a reproduction set of commands to create these tables, queries, and then inspect the query statistics using PGSS.
While you can find and restructure your queries towards more efficient patterns, are there any changes coming to Postgres itself to better handle these?
Improvements in Postgres 17
As part of the PostgreSQL 17 release in 2024, the developers made improvements to more efficiently work with scalar expressions and indexes, resulting in fewer repeated scans, and thus faster execution.
This reduced latency by reducing IO, and the benefits are available to all Postgres users without the need to change their SQL queries or ORM code!
Grouping similar query groups in pg_stat_statements
There are more usability improvements coming for Postgres users, pg_stat_statements, an IN
clause queries.
One problem with these has been that similar entries aren’t collapsed together when they have a different numbers of scalar array expressions.
For example IN ('1')
was not grouped with IN ('1','2')
. Having the statistics for nearly identical entries split across multiple results makes them less useful.
Fortunately, fixes are coming. On the Ruby on Rails side, Sean Linsley is working on a fix by replacing the use of IN
with ANY
which solves the grouping problem.
Here’s the PR: https://github.com/rails/rails/pull/49388#issuecomment-2680362607
On the PostgreSQL side, there are fixes coming for PostgreSQL 18.
Improvements in PostgreSQL 18
Related improvements are coming to PostgreSQL 18 for 2025.
This commit1 implements the automatic conversion of x IN (VALUES ...)
into ScalarArrayOpExpr.
Another noteworthy commit is: “Squash query list jumbling” from Álvaro Herrera.2
pg_stat_statements produces multiple entries for queries like SELECT something FROM table WHERE col IN (1, 2, 3, ...)
depending on the number of parameters, because every element of ArrayExpr is individually jumbled.
Most of the time that’s undesirable, especially if the list becomes too large.
This commit3 mentions the original design was for a GUC query_id_squash_values, but that was removed in favor of making this the default behavior.
Conclusion
We looked at a problematic query pattern, giant IN
lists. We looked at why these are slow: they take more resources to parse and plan. There are fewer indexing options compared with joins. Compared with an equivalent join operation which provides two sets of table statistics, the big list of constants can’t be compared with pre-collected data distribution statistics.
We looked at how these queries can be created directly using multiple queries in a common application code pattern, or indirectly by using eager loading methods in the Active Record ORM.
Fortunately improvements are coming in newer versions of both Ruby on Rails and PostgreSQL.
We learned how to find these using pg_stat_statements for PostgreSQL. Once we’ve found these problematic queries, we looked at several possible alternative solutions that can be parsed, planned, and executed more efficiently.
Our main tactics are to convert these to joins when possible. Outside of that, we could consider using the ANY
operator and a prepared statement.
Hopefully the next time you see giant IN lists causing database performance problems, you’re more prepared to tackle them!
Please share any tips or tracks you have by reaching out through my contact form.
-
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c0962a113d1f2f94cb7222a7ca025a67e9ce3860 ↩ ↩2
-
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=62d712ecfd940f60e68bde5b6972b6859937c412 ↩
-
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9fbd53dea5d513a78ca04834101ca1aa73b63e59 ↩