Recently, I was showing the PostgreSQL query planner output in a training session. We discussed the “Rows Removed by Filter” count as we prepared to add an index based on a high proportion of filtered rows.
The count was one less than the total row count, which made sense since the query had a
LIMIT of 1, and was finding a unique value.
However, as I tried different
WHERE clause values, the “Rows Removed by Filter” count started to confuse me. I realized I didn’t fully understand the behavior, and would need to dig in.
Along the way, I asked Michael Christofides (of PgMustard) for help. Michael was gracious to read this post, and even pair up and talk through the findings.
What did we find?
Digging Into “Rows Removed by Filter”
Let’s discuss the table and query details, and other circumstances.
- We queried a “users” table with 20210 total rows. The table uses an integer sequence starting from 1.
- New rows are not being inserted. We’re working with a static set of rows in this post.
VACUUMhas not run for this table.
- We used the same
WHEREclause in the query, changing only the
name_codestring column, which holds a mostly-unique (but not guaranteed or enforced with a constraint) “code” value for each row.
- We set a
LIMITof 1 on the queries, but no
ORDER BY, and no
OFFSET. In practice, a
LIMITwould often be accompanied by an
ORDER BY, but that wasn’t the case here.
- A Sequential Scan was used to access the row data because there was no index. In a production system, an index on the
name_codecolumn would be a good idea. In this post we’re aiming to understand the behavior without that index.
- All data was accessed from shared buffers, confirmed by adding
EXPLAIN (ANALYZE)and seeing
Buffers: shared hitin the execution plan.
Analyzing “Rows Removed By Filter”
Expectation versus reality
Although I expected “Rows Removed by Filter” to be one less than the total row count, I learned there are a lot of reasons for why that wouldn’t be the case.
What was happening?
LIMIT 1 in the query, as soon as PostgreSQL found any match, there was an “early return.” There was no need to access additional rows in pages.
name_code values from earlier inserted rows, in earlier pages, we’d see smaller values for “Rows Removed by Filter.”
name_code values “late” in the insertion order, many more rows were removed before finding a match.
We could confirm this in the planner output by observing greater numbers of buffers accessed.
As stated, when prepending
EXPLAIN (ANALYZE, BUFFERS) on the query, we saw fewer buffers accessed for these “early matches.”
With fewer buffers accessed, there was less latency, and lower execution times. However, performance wasn’t the goal of this exercise. If it was, it would make more sense to add an index covering the
Additional information about “Rows removed by filter”
One question I had was whether “Rows Removed by Filter” was an actual figure or an estimated figure.
The PgMustard Glossary explains it as:
This is a per-loop average, rounded to the nearest integer.
What does this mean? When more than one loop is used for the plan node (e.g.
loops=2 or greater), then “Rows Removed by Filter” is an average value of the “Rows Removed by Filter” per-loop.
For example, if one loop removed 10 rows, and another removed 30 rows, we’d expect to see a value of 20 as the average of the two.
When there’s one loop (
loops=1), the figure is the actual number of rows processed and removed.
Michael pointed out how using
OFFSET without ordering, would be correlated with “Rows Removed by Filter.”
To start, when we grabbed the
name_code from the first row with the
LIMIT 1, we didn’t see “Rows Removed by Filter” at all. In the default planner output format, when zero rows are filtered, the message is not displayed.
When we choose the second row (based on the default ordering), we see a single shared hit, and we see one row removed.
Next, we tried going to the “middle” of the 20 thousand rows, using an
OFFSET of 10000, and supplying the
name_code value to the query that first row at that offset, again without specifying an ordering for the query.
name_code, we see “Rows Removed by Filter: 10000”, which exactly matched the offset.
- The PostgreSQL EXPLAIN documentation describes how “Rows Removed by Filter” appears only when
ANALYZEis added to
- “Rows Removed by Filter” applies for filter conditions like a
WHEREclause, but also for conditions on a
- “Rows Removed by Filter” appears when at least one row is scanned for evaluation, or a “potential join pair” (for join nodes) when rows were discarded by the filter condition.
- Michael noted that using
ORDER BY, which is commonly used with
LIMIT, can produce more predictable planner results. See this thread: https://twitter.com/Xof/status/1413542818673577987
- Without an explicit ordering, and when using a
LIMIT, the results of “Rows Removed by Filter” may be surprising.
LIMIT 1is used, PostgreSQL finds the first match and returns. The default ordering is likely the insertion order of the rows.
- When analyzing “Rows Removed by Filter” figures, check whether the plan node had more than one loop. In that case, the rows are an average of all loops, rounded to the nearest integer.
- For performance work, a high proportion of rows filtered out indicates an optimization opportunity. Adding an index may greatly reduce the need to filter so many rows, cause so much storage access, and speed up your query.
Thanks for reading!