Scalar Subqueries in BigQuery, Explained Simply

A one-page BigQuery scalar subquery reference, every snippet explained line by line: SELECT, WHERE, HAVING and expressions, the one-row rule, aggregates, and correlated subqueries. That completes the scalar subqueries cluster end to end. Want me to build the same four-part set for the next topic, or move to something else?

Imagine you are filling in a form and one box says “amount, as a percentage of the team average.” You cannot fill that box until you know one number: the team average. So you go and work it out, get a single figure like 4,200, and then use it in your calculation. That little side-trip to fetch one number is exactly what a scalar subquery is.

A scalar subquery is a small question tucked inside a bigger query, and its answer is always a single value. One number, one piece of text, one date. Because the answer is just one value, you can drop it anywhere a value would normally go: in the list of columns you are selecting, in a filter, or right in the middle of a sum.

Here it is fetching one number, the average price, and showing it next to every product.

SELECT
name,
price,
(SELECT AVG(price) FROM products) AS avg_price
FROM products;

Read the part in brackets on its own: “the average price of all products.” That returns a single number. The outer query then prints it on every row, so you can eyeball each price against the average. The inner question runs once, gets its one answer, and that answer is reused everywhere.

The same single value works as a yardstick in a filter. If you only want products that cost more than average, you ask for the average once and compare against it.

SELECT name, price
FROM products
WHERE price > (SELECT AVG(price) FROM products);

Download here a cheatsheet:

There is one rule, and it is the whole personality of a scalar subquery: it must return exactly one row and one column. If you write a subquery that hands back a whole list of values, BigQuery cannot squeeze a list into a box that expects one number, so it stops and complains that the subquery “produced more than one element.” The fix is almost always to wrap the inner query in something that collapses many rows into one, like MAXMINAVGSUM, or COUNT. Those functions are a scalar subquery’s best friends because they are guaranteed to give back a single value.

There is also a slightly cleverer version called a correlated scalar subquery. This one peeks at the row you are currently looking at and answers a question about that specific row, like “how many purchases has this exact customer made.” It still returns one value, but it works it out fresh for every row rather than once for the whole table.

That is really all there is to it. A scalar subquery is a quick errand to fetch one value, run mid-query, so you can use that value right where you need it. If you have ever paused a calculation to look up a single number and then carried on, you already understand exactly what it does.

See you soon.

View Comments (1)

Leave a Reply

Prev Next

Subscribe to My Newsletter

Subscribe to my email newsletter to get the latest posts delivered right to your email. Pure inspiration, zero spam.

Discover more from Discuss Data Science, Machine Learning and Analytics

Subscribe now to keep reading and get access to the full archive.

Continue reading