The correlated subqueries guide explains the idea: a subquery that cannot run on its own because it reaches back into each outer row, so it executes once per row rather than once for the whole query. This workbook makes that concrete with a tiny customers, purchases and products dataset. You will write correlated subqueries in SELECT, in WHERE, with EXISTSand NOT EXISTS, with aggregates, and finally rewrite one as a join to see the performance trade-off the article warns about. Run Example 1 first; the rest depend on it.
1. Create the sample tables
Run this once. It builds three small tables so every later query has something to read. The data is deliberately tiny so you can verify each result by eye: four customers (one of whom never buys anything), four products (one never sold), and six purchases.
-- Run once. Change `scratch` to any dataset you can write to.CREATE OR REPLACE TABLE scratch.customers ASSELECT * FROM UNNEST([ STRUCT(1 AS customer_id, 'Ava' AS name, 'UK' AS country, DATE '2024-01-10' AS signup_date), STRUCT(2, 'Ben', 'UK', DATE '2024-02-15'), STRUCT(3, 'Chloe', 'US', DATE '2024-03-20'), STRUCT(4, 'Dan', 'US', DATE '2024-04-05') -- never makes a purchase]);CREATE OR REPLACE TABLE scratch.products ASSELECT * FROM UNNEST([ STRUCT(101 AS product_id, 'Keyboard' AS name, 80 AS price), STRUCT(102, 'Monitor', 300), STRUCT(103, 'Mouse', 25), STRUCT(104, 'Webcam', 60) -- never purchased]);CREATE OR REPLACE TABLE scratch.purchases ASSELECT * FROM UNNEST([ STRUCT(1 AS purchase_id, 1 AS customer_id, 102 AS product_id, 1 AS quantity, DATE '2024-05-01' AS purchase_date, 300 AS amount), STRUCT(2, 1, 101, 1, DATE '2024-05-03', 80), STRUCT(3, 1, 103, 2, DATE '2024-06-01', 50), STRUCT(4, 2, 103, 1, DATE '2024-05-10', 25), STRUCT(5, 3, 102, 4, DATE '2024-05-12', 1200), STRUCT(6, 3, 101, 1, DATE '2024-06-15', 80)]);-- Peek at what you just builtSELECT * FROM scratch.purchases ORDER BY purchase_id;
For reference as you work through the rest: Ava has spent 430, Ben 25, Chloe 1280, and Dan nothing. The Webcam has never sold. Keep those numbers in mind so you can sanity-check every query below.
2. A regular subquery first, for contrast
Before the correlated kind, here is an ordinary uncorrelated subquery. It runs once for the whole query, produces a single value, and the outer query uses it. There is no reference back to the outer row.
-- The inner query runs ONCE and returns one number (the average price)SELECT product_id, name, priceFROM scratch.productsWHERE price > (SELECT AVG(price) FROM scratch.products);
The average price is 116.25, so only the Monitor comes back. The crucial point is independence: you could run that inner SELECT AVG(price) on its own and it would work. That is exactly what a correlated subquery cannot do.
3. A correlated subquery in SELECT
Now the real thing. This inner query references c.customer_id from the outer row, so it runs once per customer to count that customer’s purchases. It adds an inline metric without a join or a GROUP BY.
SELECT c.customer_id, c.name, (SELECT COUNT(*) FROM scratch.purchases AS p WHERE p.customer_id = c.customer_id) AS purchase_count -- runs per outer rowFROM scratch.customers AS c;
You get Ava 3, Ben 1, Chloe 2, Dan 0. Try to run the inner query alone and BigQuery will error on the unknown alias c, which is the defining feature: the subquery only has meaning in the context of an outer row.
4. Filtering with EXISTS
EXISTS asks a yes-or-no question: does at least one matching row exist? It is the natural correlated filter for presence. As soon as one match is found BigQuery stops looking, which is why you select a literal 1 rather than real columns.
SELECT c.customer_id, c.nameFROM scratch.customers AS cWHERE EXISTS ( SELECT 1 FROM scratch.purchases AS p WHERE p.customer_id = c.customer_id -- correlates to the outer customer);
This returns Ava, Ben and Chloe, the customers who have bought something. Dan is excluded because no purchase row matches his id. The SELECT 1 is a convention meaning “I only care whether a row exists, not what is in it.”
5. Excluding with NOT EXISTS: products never purchased
Flip EXISTS to NOT EXISTS to keep rows that have no match at all. This is the cleanest way to answer “which products have never sold.”
SELECT pr.product_id, pr.nameFROM scratch.products AS prWHERE NOT EXISTS ( SELECT 1 FROM scratch.purchases AS p WHERE p.product_id = pr.product_id);
Only the Webcam comes back. NOT EXISTS is the recommended way to express this in BigQuery because it handles the absence case cleanly, unlike NOT IN, which can behave surprisingly when the inner list contains a NULL.
6. NOT EXISTS: customers with no purchases
The same pattern from the other table answers a churn-style question: who has never bought anything? Dan is the only one.
SELECT c.customer_id, c.nameFROM scratch.customers AS cWHERE NOT EXISTS ( SELECT 1 FROM scratch.purchases AS p WHERE p.customer_id = c.customer_id);
Presence and absence checks like examples 4 to 6 are where correlated EXISTS and NOT EXISTS are at their best. They read like the question you are asking and they stop as soon as they have their answer.
7. A correlated aggregate in WHERE
The inner query can aggregate. Here it sums each customer’s spending and the outer WHERE keeps only those above a threshold, which is how you express “customers whose lifetime value exceeds 1000.”
SELECT c.customer_id, c.nameFROM scratch.customers AS cWHERE ( SELECT SUM(p.amount) FROM scratch.purchases AS p WHERE p.customer_id = c.customer_id) > 1000;
Only Chloe, at 1280, clears the bar. The subquery computes a per-customer total on the fly and compares it in the filter, all without grouping the outer query.
8. Comparing a row to its own group
This is the pattern correlated subqueries handle elegantly: compare each row to an aggregate of the group it belongs to. Here we keep each purchase that is above that same customer’s average purchase amount.
SELECT p.purchase_id, p.customer_id, p.amountFROM scratch.purchases AS pWHERE p.amount > ( SELECT AVG(p2.amount) FROM scratch.purchases AS p2 WHERE p2.customer_id = p.customer_id -- average for THIS purchase's customer);
The result is purchase 1 (Ava’s 300, above her average of about 143) and purchase 5 (Chloe’s 1200, above her average of 640). The same table appears twice under two aliases, p and p2, so the inner average is scoped to the outer row’s customer.
9. A correlated scalar as a calculated column
Combine the ideas from examples 3 and 8 to add a derived percentage: what share of each customer’s total spend does a single purchase represent? The inner query supplies the per-customer total for the division.
SELECT p.purchase_id, p.customer_id, p.amount, ROUND(p.amount / ( SELECT SUM(p2.amount) FROM scratch.purchases AS p2 WHERE p2.customer_id = p.customer_id ) * 100, 1) AS pct_of_customer_totalFROM scratch.purchases AS pORDER BY p.customer_id, p.amount DESC;
Chloe’s Monitor shows about 93.8 percent of her spend, her Keyboard about 6.3. A single inline metric like this is a textbook good use of a correlated subquery: one clear, per-row number that would be awkward to express any other way.
10. Rewriting as a JOIN for performance
The article’s main caution is that a correlated subquery runs once per outer row, which scales badly on large tables. When you need the same metric for many rows, a join with GROUP BY usually computes it in one pass instead. Here is the per-customer count from Example 3, rewritten.
-- Same answer as Example 3, computed in a single passSELECT c.customer_id, c.name, COUNT(p.purchase_id) AS purchase_countFROM scratch.customers AS cLEFT JOIN scratch.purchases AS p ON p.customer_id = c.customer_idGROUP BY c.customer_id, c.name;
The LEFT JOIN keeps Dan with a count of zero, matching the correlated version exactly. On the four rows here the difference is invisible, but on millions of rows the join is typically far cheaper because BigQuery aggregates once rather than re-running a subquery per customer.
A few rules of thumb from the guide to take away. Prefer EXISTS and NOT EXISTS over IN and NOT IN for presence and absence checks, since they are clearer and avoid the NULL pitfalls of NOT IN. Filter early, so the outer query has fewer rows before any per-row subquery fires. Reach for a correlated subquery when you want a single inline metric or a presence check and the outer row count is modest. When you need the same aggregate across a large table, rewrite it as a join with GROUP BY. Used with that judgement, correlated subqueries stay readable where they shine and you sidestep them where they would be slow.
See you soon.
[…] BigQuery Correlated Subqueries: 10 Code-Along Examples […]