BigQuery Scalar Subqueries

A scalar subquery returns one value and works anywhere a value is expected. Learn BigQuery scalar subqueries in SELECT, WHERE, HAVING, and expressions, plus the correlated form and JOIN trade-offs.

Of all the subquery forms, the scalar subquery is the most quietly useful, because it behaves like an ordinary value. It returns exactly one thing, one row and one column, which means you can drop it anywhere SQL expects a single value: in the SELECT list as a column, in a WHERE comparison, in a HAVING filter, even in the middle of an arithmetic expression. That one constraint, exactly one value, is what gives it its flexibility and also its single failure mode. This article covers the scalar form, following on from BigQuery Subqueries and Nested Queries and BigQuery Correlated Subqueries, using a schema of customersproducts, and purchases.

The Defining Rule

A scalar subquery must return one row and one column. That is the whole contract. If it returns more than one row, BigQuery stops with an error along the lines of Scalar subquery produced more than one element, because there is no single value to slot into the surrounding expression. Aggregate functions like MAXMINAVGSUM, and COUNT naturally collapse to a single value, which is why most scalar subqueries are aggregates. The basic shape looks like this:

SELECT
product_id,
(SELECT MAX(purchase_total) FROM `analytics-warehouse.shop.purchases`) AS max_purchase
FROM `analytics-warehouse.shop.products`

That single MAX value is computed once and attached to every row of the result.

In the SELECT Clause

The most common use is comparing each row against an overall metric by adding that metric as a column. Here every purchase is shown next to the global average purchase value.

SELECT
purchase_id,
purchase_total,
(SELECT AVG(purchase_total) FROM `analytics-warehouse.shop.purchases`) AS average_total
FROM `analytics-warehouse.shop.purchases`

Now each row carries both its own amount and the dataset-wide average, which makes “above or below average” trivial to read or compute downstream. The subquery runs once, not per row, because it does not depend on the outer row at all.

In the WHERE Clause

A scalar subquery is just as natural in a filter, where it supplies the value on one side of a comparison. This returns every purchase above the overall average.

SELECT *
FROM `analytics-warehouse.shop.purchases`
WHERE purchase_total > (
SELECT AVG(purchase_total)
FROM `analytics-warehouse.shop.purchases`
)

The inner query produces the single average, and the outer query keeps rows that exceed it. This is the kind of comparison a join handles awkwardly, so the scalar subquery is genuinely the cleanest tool for it.

In the HAVING Clause

You can filter groups against a computed value too, by combining a GROUP BY with a scalar subquery in HAVING. This returns categories whose total sales beat the average category total.

SELECT category, SUM(purchase_total) AS total_sales
FROM `analytics-warehouse.shop.purchases`
GROUP BY category
HAVING SUM(purchase_total) > (
SELECT AVG(total_sales)
FROM (
SELECT category, SUM(purchase_total) AS total_sales
FROM `analytics-warehouse.shop.purchases`
GROUP BY category
)
)

There are two layers of aggregation here, which is worth reading slowly. The inner subquery computes each category’s total, then averages those totals into a single number, and the outer HAVING keeps only the categories whose total exceeds that average. It is a clean way to express “above average among the groups.”

Inside an Expression

Because a scalar subquery is just a value, you can do arithmetic with it. This builds a price index, each product’s price relative to the average.

SELECT
product_id,
price,
price / (
SELECT AVG(price)
FROM `analytics-warehouse.shop.products`
) AS price_index
FROM `analytics-warehouse.shop.products`

price_index of 1 means a product sits exactly at the average, above 1 means pricier than average, below 1 means cheaper. Normalising a column against a dataset-wide statistic this way is a common reporting move, and the scalar subquery makes it a one-liner.

Correlated Scalar Subqueries

So far the subqueries have been independent, computed once for the whole query. A scalar subquery becomes correlated when it references the outer row, which turns it into a per-row calculation. This counts each customer’s purchases.

SELECT
c.customer_id,
c.name,
(
SELECT COUNT(*)
FROM `analytics-warehouse.shop.purchases` p
WHERE p.customer_id = c.customer_id
) AS purchase_count
FROM `analytics-warehouse.shop.customers` c

It is still scalar, because COUNT(*) returns a single number, but now that number is computed afresh for each customer, since the inner query filters on c.customer_id from the outer row. This is the bridge between scalar and correlated subqueries: the same single-value rule, but evaluated per row instead of once.

Scalar Subquery or a Join?

Knowing when to use a scalar subquery versus a join keeps your queries both readable and cheap. For a single value applied to every row, like a global average or maximum, a scalar subquery is the only clean option; a join does not naturally express it. For conditional filtering, such as keeping rows above a global metric, the scalar subquery in WHERE is again the better fit, since joins handle that comparison clumsily. But for per-row counts and sums, the correlated scalar subquery competes directly with a join plus GROUP BY, and on large datasets the join usually wins on performance because the engine can compute all the aggregates in one pass rather than re-running the subquery per row.

Keeping It Fast

A few habits keep scalar subqueries efficient. Use them freely for genuine constants, a total count, a max, a min, an average over the whole table, since these run once and cost little. For per-row metrics on large data, prefer a join with GROUP BY over a correlated scalar subquery, because recomputing an aggregate for every outer row gets expensive. And filter the inner query as tightly as possible, since in BigQuery every subquery scans data and you pay by the bytes scanned, so the less the inner query reads, the cheaper the whole statement.

ClauseExampleUse case
SELECT(SELECT AVG(price) FROM products)Add a global metric beside each row
WHEREWHERE total > (SELECT AVG(total) FROM ...)Filter against a global value
HAVINGHAVING SUM(x) > (SELECT AVG(sum_x) FROM ...)Compare a group to an external value
Expressionprice / (SELECT AVG(price) ...)Normalise a value against the dataset
Correlated(SELECT COUNT(*) FROM ... WHERE related = outer.key)A per-row calculated value

Conclusion

A scalar subquery returns exactly one value and can be used anywhere a value is expected: as a column, in a filter, inside a HAVING clause, or in an expression. Keep it to one row and one column or BigQuery will reject it, which is why it almost always wraps an aggregate. Use it freely for global constants and for filtering against them, where a join would be awkward, but when it becomes a correlated per-row aggregate over large data, weigh it against a join with GROUP BY, which usually scales better. And filter the inner query hard, because in BigQuery the cleanest scalar subquery and the cheapest one are usually the same query.

View Comments (2)

Leave a Reply

Prev Next

Subscribe to My Newsletter

Subscribe to my email newsletter to get the latest posts delivered right to your email. Pure inspiration, zero spam.

Discover more from Discuss Data Science, Machine Learning and Analytics

Subscribe now to keep reading and get access to the full archive.

Continue reading