Kagamino's Blog

Kagamino's Blog

PostgreSQL aggregations: subqueries are not so bad

PostgreSQL

As I recently was sharing my postgres experience with a coworker, I noted something weird in a query plan of my first post.

Query Time
Subquery 500ms
Merged CTEs 100ms
Subquery — original
SELECT
  products.id,
  products.name,
  (SELECT COUNT(id) FROM reviews WHERE product_id = products.id) AS num_reviews,
  (SELECT COUNT(id) FROM orders  WHERE product_id = products.id) AS num_orders
FROM products;

Original subqueries query plan

Merged CTEs
WITH product_reviews AS (
  SELECT product_id, COUNT(id) AS num_reviews
  FROM reviews
  GROUP BY product_id
),
product_orders AS (
  SELECT product_id, COUNT(id) AS num_orders
  FROM orders
  GROUP BY product_id
)
SELECT
  products.id,
  products.name,
  COALESCE(product_reviews.num_reviews, 0) AS num_reviews,
  COALESCE(product_orders.num_orders,   0) AS num_orders
FROM products
LEFT JOIN product_reviews ON products.id = product_reviews.product_id
LEFT JOIN product_orders  ON products.id = product_orders.product_id;

Merged CTEs query plan

In the subqueries plan, each branch had a Bitmap Index Scan followed by a Bitmap Heap Scan. This is a common pattern, that we could say as “good enough”. But when this is the bottleneck of a query (as it is with 360ms), it’s a common idea to go and try to replace these two blocks by an Index Only Scan.

A simple fix

The idea of a Bitmap Index Scan + Bitmap Heap Scan is to use the index to filter (or prefilter) the rows from the index, which is really fast, and then go fetch the actual data from the main Heap memory (the table per se). For instance, instead of scanning 10GB of data, and dumping 90% of it, you use the index to find the 10% interesting rows, and then fetch the Heap blocks where the data lives. Note that you will actually fetch more than 10% of the disk data, because you are forced to load whole memory pages (around ~8KB by default). The Bitmap Heap Scan is really good when the chosen data lives close together, because you can fetch many rows in a single page read.

However, when using subqueries, the independence between loops of the query leads to a bad correlation of heap pages: you may access several times during the query to the same heap page, but you will pay the cost several times too (in a single Bitmap Heap Scan, it’s smart and groups data access from a single page).

The other weird thing is that the query shouldn’t even need to make a data fetch: the index is good enough to answer the query, we only need to know how many rows were matched for each product.

The original query contained count(id), and replacing it with count(*) indeed improved the plan. We now have an Index Only Scan, and the query is down from 500ms to 100ms, equal to the previous best. Also note that the subquery version gains more performance as we use root table filters.

Subquery — fixed
SELECT
  products.id,
  products.name,
  (SELECT COUNT(*) FROM reviews WHERE product_id = products.id) AS num_reviews,
  (SELECT COUNT(*) FROM orders  WHERE product_id = products.id) AS num_orders
FROM products;

Fixed subqueries query plan

Query Scan type Time
Subquery count(id) Bitmap Index + Heap Scan 500ms
Subquery count(*) Index Only Scan 100ms
Merged CTEs Parallel Seq Scan 100ms

But there are still major differences between the subqueries approach and the CTEs, notably when we apply filters to these queries.

Leaf filtering

The subqueries approach is better for root table filtering. If you want to compute the same statistics, but only for products with a price over 900 (the price is random between 0 and 1000), then the Subquery approach will filter before running the subqueries, and in this case reduce the number of subquery loops by 10x.

Subquery — root filter
SELECT
  products.id,
  products.name,
  (SELECT COUNT(*) FROM reviews WHERE product_id = products.id) AS num_reviews,
  (SELECT COUNT(*) FROM orders  WHERE product_id = products.id) AS num_orders
FROM products
WHERE products.price > 900;

Subqueries root filter

Merged CTEs — root filter
WITH product_reviews AS (
  SELECT product_id, COUNT(id) AS num_reviews
  FROM reviews
  GROUP BY product_id
),
product_orders AS (
  SELECT product_id, COUNT(id) AS num_orders
  FROM orders
  GROUP BY product_id
)
SELECT
  products.id,
  products.name,
  COALESCE(product_reviews.num_reviews, 0) AS num_reviews,
  COALESCE(product_orders.num_orders,   0) AS num_orders
FROM products
LEFT JOIN product_reviews ON products.id = product_reviews.product_id
LEFT JOIN product_orders  ON products.id = product_orders.product_id
WHERE products.price > 900;

Merged CTEs root filter

Query No filter Root filter (price > 900)
Subquery count(*) 100ms 15ms
Merged CTEs 100ms 75ms

However, when the filtering is in a leaf table, the Index Only Scan can no longer be used (where customer_name ilike '%0'). As expected, the subqueries version gets back its Bitmap Heap Scan, and duration goes up. However, I can’t explain why the CTE goes up too. The Parallel Seq Scan goes from 10ms to 100ms, and it doesn’t make any sense.

Subquery — leaf filter
SELECT
  products.id,
  products.name,
  (SELECT COUNT(*) FROM reviews WHERE product_id = products.id) AS num_reviews,
  (SELECT COUNT(*) FROM orders  WHERE product_id = products.id AND customer_name ILIKE '%0') AS num_orders
FROM products;

Subqueries leaf filter query plan

Merged CTEs — leaf filter
WITH product_reviews AS (
  SELECT product_id, COUNT(id) AS num_reviews
  FROM reviews
  GROUP BY product_id
),
product_orders AS (
  SELECT product_id, COUNT(id) AS num_orders
  FROM orders
  WHERE customer_name ILIKE '%0'
  GROUP BY product_id
)
SELECT
  products.id,
  products.name,
  COALESCE(product_reviews.num_reviews, 0) AS num_reviews,
  COALESCE(product_orders.num_orders,   0) AS num_orders
FROM products
LEFT JOIN product_reviews ON products.id = product_reviews.product_id
LEFT JOIN product_orders  ON products.id = product_orders.product_id;

Merged CTEs leaf filter query plan

Query No filter Leaf filter (customer_name ilike '%0')
Subquery count(*) 100ms 110ms
Merged CTEs 100ms 140ms

Disk usage

Another important aspect is disk usage. Here’s a summary of disk read for each query.

Query Orders index Orders Reviews index Reviews Total
Merged CTEs 57MB 7MB 64MB
Subquery count(*) 230MB 160MB 390MB
Subquery count(id) 230MB 7600MB 160MB 780MB 8770MB

The last subquery’s very high disk usage is explained by the very high number of uncorrelated loops. This means you may fetch a page 10 times to read 10 rows, instead of reading 10 rows at once in one page fetch.

In this setup, it’s not much of an issue, because the data is small (57MB to read all orders), I have much RAM available, and I’m the only one using it. But this is not the case in production systems. Obviously tables are much larger (can easily become tens of GBs), and data read from disk may often be discarded to let other queries cache disk pages. The global setting that controls how much memory is allocated to caching disk data is called shared_buffers (128MB on my machine).

In a simple AWS RDS setup, you can expect your disk to have a ~100MB/s read throughput, when your local setup may be 20x faster. I can use docker-compose and blkio_config settings to limit the disk. I also use a much lower shared_buffers that can’t cache all my data, like 16MB.

docker-compose.yml
services:
  postgres:
    image: postgres:18
    restart: unless-stopped
    command: postgres -c shared_buffers=16MB -c track_io_timing=on
    environment:
      POSTGRES_HOST_AUTH_METHOD: trust
    ports:
      - "5432:5432"
    blkio_config:
      device_read_bps:
        - path: /dev/nvme0n1  # your disk device
          rate: '125mb'       # gp3 baseline throughput
      device_read_iops:
        - path: /dev/nvme0n1
          rate: 3000          # gp3 baseline IOPS
      device_write_bps:
        - path: /dev/nvme0n1
          rate: '125mb'
      device_write_iops:
        - path: /dev/nvme0n1
          rate: 3000
Subquery — count(id), Bitmap Heap Scan
SELECT
  products.id,
  products.name,
  (SELECT COUNT(id) FROM reviews WHERE product_id = products.id) AS num_reviews,
  (SELECT COUNT(id) FROM orders  WHERE product_id = products.id) AS num_orders
FROM products;

Original subqueries with limits

Subquery — count(*), Index Only Scan
SELECT
  products.id,
  products.name,
  (SELECT COUNT(*) FROM reviews WHERE product_id = products.id) AS num_reviews,
  (SELECT COUNT(*) FROM orders  WHERE product_id = products.id) AS num_orders
FROM products;

Subqueries fixed with limits

Merged CTEs
WITH product_reviews AS (
  SELECT product_id, COUNT(id) AS num_reviews
  FROM reviews
  GROUP BY product_id
),
product_orders AS (
  SELECT product_id, COUNT(id) AS num_orders
  FROM orders
  GROUP BY product_id
)
SELECT
  products.id,
  products.name,
  COALESCE(product_reviews.num_reviews, 0) AS num_reviews,
  COALESCE(product_orders.num_orders,   0) AS num_orders
FROM products
LEFT JOIN product_reviews ON products.id = product_reviews.product_id
LEFT JOIN product_orders  ON products.id = product_orders.product_id;

Merged CTEs with limits

Query Scan type Time (throttled disk)
Subquery count(id) Bitmap Heap Scan 4200ms
Subquery count(*) Index Only Scan 100ms
Merged CTEs Parallel Seq Scan 90ms

I think I can explain the numbers. CTEs, as before, only need to read once the data. So it’s reading just the right amount from disk. Subquery Index Only reads less data (index) but many times. The index fits in the cache, and the Shared Hit/Read ratio is very high (~x10). Subquery Bitmap Heap Scan gets dirty. Data from tables can’t fit in cache, so each independent loop has a high chance of having to read a page from disk. The Shared Hit/Read ratio is ~0.5, and total data read from disk goes from 7.7MB in the previous case to 5.5GB.

Wrapping up

In my benchmarks, it’s not so clear which query is the best one. It might depend on specific use cases, and requires benchmarking on your infrastructure.

Subqueries are easier to write. They excel in independence: you are less at risk to make row match unrelated row and mess your statistics. They are also better at filtering overall: they provide performance gains for each filter reducing the amount of data, and even more with good covering indexes.

CTEs are a bit harder to write and read. They do not benefit from root filtering, as the CTEs are computed before. However, they are the best for disk affinity: they read data in a cache friendly way. They also enjoy indexes and partitions to even reduce their disk footprint. In production systems, this might provide greater performance benefits.