BigQuery Correlated Subqueries: 10 Code-Along Examples

The correlated subqueries guide explains the idea: a subquery that cannot run on its own because it reaches back into each outer row, so it executes once per row rather than once for the whole query. This workbook makes that concrete with a tiny customers, purchases and products dataset. You will write correlated subqueries in SELECT, in WHERE, with EXISTSand NOT EXISTS, with aggregates, and finally rewrite one as a join to see the performance trade-off the article warns about. Run Example 1 first; the rest depend on it.

1. Create the sample tables

Run this once. It builds three small tables so every later query has something to read. The data is deliberately tiny so you can verify each result by eye: four customers (one of whom never buys anything), four products (one never sold), and six purchases.

			
-- Run once. Change `scratch` to any dataset you can write to.
CREATE OR REPLACE TABLE scratch.customers AS
SELECT * FROM UNNEST([
  STRUCT(1 AS customer_id, 'Ava'   AS name, 'UK' AS country, DATE '2024-01-10' AS signup_date),
  STRUCT(2, 'Ben',   'UK', DATE '2024-02-15'),
  STRUCT(3, 'Chloe', 'US', DATE '2024-03-20'),
  STRUCT(4, 'Dan',   'US', DATE '2024-04-05')   -- never makes a purchase
]);
CREATE OR REPLACE TABLE scratch.products AS
SELECT * FROM UNNEST([
  STRUCT(101 AS product_id, 'Keyboard' AS name, 80 AS price),
  STRUCT(102, 'Monitor', 300),
  STRUCT(103, 'Mouse', 25),
  STRUCT(104, 'Webcam', 60)                      -- never purchased
]);
CREATE OR REPLACE TABLE scratch.purchases AS
SELECT * FROM UNNEST([
  STRUCT(1 AS purchase_id, 1 AS customer_id, 102 AS product_id, 1 AS quantity, DATE '2024-05-01' AS purchase_date, 300 AS amount),
  STRUCT(2, 1, 101, 1, DATE '2024-05-03', 80),
  STRUCT(3, 1, 103, 2, DATE '2024-06-01', 50),
  STRUCT(4, 2, 103, 1, DATE '2024-05-10', 25),
  STRUCT(5, 3, 102, 4, DATE '2024-05-12', 1200),
  STRUCT(6, 3, 101, 1, DATE '2024-06-15', 80)
]);
-- Peek at what you just built
SELECT * FROM scratch.purchases ORDER BY purchase_id;

		

For reference as you work through the rest: Ava has spent 430, Ben 25, Chloe 1280, and Dan nothing. The Webcam has never sold. Keep those numbers in mind so you can sanity-check every query below.

2. A regular subquery first, for contrast

Before the correlated kind, here is an ordinary uncorrelated subquery. It runs once for the whole query, produces a single value, and the outer query uses it. There is no reference back to the outer row.

			
-- The inner query runs ONCE and returns one number (the average price)
SELECT product_id, name, price
FROM scratch.products
WHERE price > (SELECT AVG(price) FROM scratch.products);

The average price is 116.25, so only the Monitor comes back. The crucial point is independence: you could run that inner SELECT AVG(price) on its own and it would work. That is exactly what a correlated subquery cannot do.

3. A correlated subquery in SELECT

Now the real thing. This inner query references c.customer_id from the outer row, so it runs once per customer to count that customer’s purchases. It adds an inline metric without a join or a GROUP BY.

			
SELECT
  c.customer_id,
  c.name,
  (SELECT COUNT(*)
   FROM scratch.purchases AS p
   WHERE p.customer_id = c.customer_id) AS purchase_count   -- runs per outer row
FROM scratch.customers AS c;

		

You get Ava 3, Ben 1, Chloe 2, Dan 0. Try to run the inner query alone and BigQuery will error on the unknown alias c, which is the defining feature: the subquery only has meaning in the context of an outer row.

4. Filtering with EXISTS

EXISTS asks a yes-or-no question: does at least one matching row exist? It is the natural correlated filter for presence. As soon as one match is found BigQuery stops looking, which is why you select a literal 1 rather than real columns.

			
SELECT c.customer_id, c.name
FROM scratch.customers AS c
WHERE EXISTS (
  SELECT 1
  FROM scratch.purchases AS p
  WHERE p.customer_id = c.customer_id   -- correlates to the outer customer
);

		

This returns Ava, Ben and Chloe, the customers who have bought something. Dan is excluded because no purchase row matches his id. The SELECT 1 is a convention meaning “I only care whether a row exists, not what is in it.”

5. Excluding with NOT EXISTS: products never purchased

Flip EXISTS to NOT EXISTS to keep rows that have no match at all. This is the cleanest way to answer “which products have never sold.”

			
SELECT pr.product_id, pr.name
FROM scratch.products AS pr
WHERE NOT EXISTS (
  SELECT 1
  FROM scratch.purchases AS p
  WHERE p.product_id = pr.product_id
);

		

Only the Webcam comes back. NOT EXISTS is the recommended way to express this in BigQuery because it handles the absence case cleanly, unlike NOT IN, which can behave surprisingly when the inner list contains a NULL.

6. NOT EXISTS: customers with no purchases

The same pattern from the other table answers a churn-style question: who has never bought anything? Dan is the only one.

			
SELECT c.customer_id, c.name
FROM scratch.customers AS c
WHERE NOT EXISTS (
  SELECT 1
  FROM scratch.purchases AS p
  WHERE p.customer_id = c.customer_id
);

		

Presence and absence checks like examples 4 to 6 are where correlated EXISTS and NOT EXISTS are at their best. They read like the question you are asking and they stop as soon as they have their answer.

7. A correlated aggregate in WHERE

The inner query can aggregate. Here it sums each customer’s spending and the outer WHERE keeps only those above a threshold, which is how you express “customers whose lifetime value exceeds 1000.”

			
SELECT c.customer_id, c.name
FROM scratch.customers AS c
WHERE (
  SELECT SUM(p.amount)
  FROM scratch.purchases AS p
  WHERE p.customer_id = c.customer_id
) > 1000;

		

Only Chloe, at 1280, clears the bar. The subquery computes a per-customer total on the fly and compares it in the filter, all without grouping the outer query.

8. Comparing a row to its own group

This is the pattern correlated subqueries handle elegantly: compare each row to an aggregate of the group it belongs to. Here we keep each purchase that is above that same customer’s average purchase amount.

			
SELECT p.purchase_id, p.customer_id, p.amount
FROM scratch.purchases AS p
WHERE p.amount > (
  SELECT AVG(p2.amount)
  FROM scratch.purchases AS p2
  WHERE p2.customer_id = p.customer_id   -- average for THIS purchase's customer
);

		

The result is purchase 1 (Ava’s 300, above her average of about 143) and purchase 5 (Chloe’s 1200, above her average of 640). The same table appears twice under two aliases, p and p2, so the inner average is scoped to the outer row’s customer.

9. A correlated scalar as a calculated column

Combine the ideas from examples 3 and 8 to add a derived percentage: what share of each customer’s total spend does a single purchase represent? The inner query supplies the per-customer total for the division.

			
SELECT
  p.purchase_id,
  p.customer_id,
  p.amount,
  ROUND(p.amount / (
    SELECT SUM(p2.amount)
    FROM scratch.purchases AS p2
    WHERE p2.customer_id = p.customer_id
  ) * 100, 1) AS pct_of_customer_total
FROM scratch.purchases AS p
ORDER BY p.customer_id, p.amount DESC;

		

Chloe’s Monitor shows about 93.8 percent of her spend, her Keyboard about 6.3. A single inline metric like this is a textbook good use of a correlated subquery: one clear, per-row number that would be awkward to express any other way.

10. Rewriting as a JOIN for performance

The article’s main caution is that a correlated subquery runs once per outer row, which scales badly on large tables. When you need the same metric for many rows, a join with GROUP BY usually computes it in one pass instead. Here is the per-customer count from Example 3, rewritten.

			
-- Same answer as Example 3, computed in a single pass
SELECT
  c.customer_id,
  c.name,
  COUNT(p.purchase_id) AS purchase_count
FROM scratch.customers AS c
LEFT JOIN scratch.purchases AS p
  ON p.customer_id = c.customer_id
GROUP BY c.customer_id, c.name;

		

The LEFT JOIN keeps Dan with a count of zero, matching the correlated version exactly. On the four rows here the difference is invisible, but on millions of rows the join is typically far cheaper because BigQuery aggregates once rather than re-running a subquery per customer.

A few rules of thumb from the guide to take away. Prefer EXISTS and NOT EXISTS over IN and NOT IN for presence and absence checks, since they are clearer and avoid the NULL pitfalls of NOT IN. Filter early, so the outer query has fewer rows before any per-row subquery fires. Reach for a correlated subquery when you want a single inline metric or a presence check and the outer row count is modest. When you need the same aggregate across a large table, rewrite it as a join with GROUP BY. Used with that judgement, correlated subqueries stay readable where they shine and you sidestep them where they would be slow.

See you soon.

User-Defined Functions in BigQuery, Feynman Technique

BigQuery UDFs Cheatsheet

User-Defined Functions in BigQuery: 10 Code-Along Examples

CSS Font Properties: 10 Code-Along Examples

BigQuery Correlated Subqueries: 10 Code-Along Examples

1. Create the sample tables

2. A regular subquery first, for contrast

3. A correlated subquery in SELECT

4. Filtering with EXISTS

5. Excluding with NOT EXISTS: products never purchased

6. NOT EXISTS: customers with no purchases

7. A correlated aggregate in WHERE

8. Comparing a row to its own group

9. A correlated scalar as a calculated column

10. Rewriting as a JOIN for performance

Related

Leave a ReplyCancel reply

Recommended for You

Python Iterators, Comprehensions and Generators: 10 Code-Along Examples

Responsive Design with SCSS Media Queries: 10 Code-Along Examples

User-Defined Functions in BigQuery, Feynman Technique

BigQuery UDFs Cheatsheet

User-Defined Functions in BigQuery: 10 Code-Along Examples

CSS Font Properties: 10 Code-Along Examples

BigQuery Correlated Subqueries: 10 Code-Along Examples

1. Create the sample tables

2. A regular subquery first, for contrast

3. A correlated subquery in SELECT

4. Filtering with EXISTS

5. Excluding with NOT EXISTS: products never purchased

6. NOT EXISTS: customers with no purchases

7. A correlated aggregate in WHERE

8. Comparing a row to its own group

9. A correlated scalar as a calculated column

10. Rewriting as a JOIN for performance

Related

Leave a ReplyCancel reply

Subscribe to My Newsletter

Recommended for You

Python Iterators, Comprehensions and Generators: 10 Code-Along Examples

Responsive Design with SCSS Media Queries: 10 Code-Along Examples

Discover more from Discuss Data Science, Machine Learning and Analytics