BigQuery Correlated Subqueries

Learn BigQuery correlated subqueries: per-row filtering with EXISTS and NOT EXISTS, inline aggregates in SELECT, and when to use a JOIN with GROUP BY instead.

A correlated subquery is the one that cannot stand on its own. Where an ordinary subquery runs once and hands its result up to the outer query, a correlated subquery runs once for every row of the outer query and reaches back into that row for a value. That row-by-row dependency is exactly what makes it powerful: it lets you test, filter, or calculate against related data on a per-row basis, asking a fresh question of each row rather than one question of the whole table. This article builds on the broader tour in BigQuery Subqueries and Nested Queries and focuses on the correlated form, using a schema of customerspurchases, and products.

What Makes a Subquery Correlated

Three things define it. It runs once per row in the outer query, it references columns from that outer row, and it usually lives in the WHERE, SELECT, or HAVING clause. The reference back to the outer query is the giveaway. The general shape looks like this:

SELECT outer_col1, outer_col2
FROM outer_table o
WHERE EXISTS (
SELECT 1
FROM inner_table i
WHERE i.foreign_key = o.primary_key
)

The inner query depends on o.primary_key, a value that only exists once you are looking at a specific outer row, which is why it cannot be run in isolation.

A First Example: Customers With a High-Value Purchase

The most common use is filtering with EXISTS. This returns every customer who has made at least one purchase over 1,000.

SELECT c.customer_id, c.name
FROM `analytics-warehouse.shop.customers` c
WHERE EXISTS (
SELECT 1
FROM `analytics-warehouse.shop.purchases` p
WHERE p.customer_id = c.customer_id
AND p.purchase_total > 1000
)

For each customer, the inner query looks only at that customer’s purchases and checks whether any exceeded the threshold. The SELECT 1 is conventional: EXISTS only cares whether a matching row exists, not what it contains, so there is nothing to gain from selecting real columns.

A Correlated Subquery in the SELECT Clause

Beyond filtering, a correlated subquery can compute a value for each row as an inline column. Here every customer gets a count of their purchases.

SELECT
c.customer_id,
c.name,
(
SELECT COUNT(*)
FROM `analytics-warehouse.shop.purchases` p
WHERE p.customer_id = c.customer_id
) AS total_purchases
FROM `analytics-warehouse.shop.customers` c

This reads cleanly and is perfect for a report, but it is worth knowing what it costs. The same result can be expressed as a join with a group-by:

SELECT c.customer_id, c.name, COUNT(p.purchase_id) AS total_purchases
FROM `analytics-warehouse.shop.customers` c
LEFT JOIN `analytics-warehouse.shop.purchases` p
ON c.customer_id = p.customer_id
GROUP BY c.customer_id, c.name

The two produce the same numbers, and the choice between them is a genuine trade-off. The join with GROUP BYgenerally scales better and is more flexible when you need several aggregates at once, so it is the better default on large data. The correlated subquery is cleaner and more readable for a single per-row value, especially in reporting where clarity matters more than raw throughput. Reach for the join when performance leads, and the subquery when readability does.

Excluding Rows With NOT EXISTS

Flip EXISTS to NOT EXISTS and you get the opposite question: which rows have no match at all. This returns products that have never been purchased.

SELECT p.product_id, p.name
FROM `analytics-warehouse.shop.products` p
WHERE NOT EXISTS (
SELECT 1
FROM `analytics-warehouse.shop.purchases` s
WHERE s.product_id = p.product_id
)

NOT EXISTS is the natural way to find the absence of related data, things like products never ordered, customers who never logged in, or accounts with no activity, and it is usually clearer and faster than the NOT IN alternative, which also behaves awkwardly when nulls are involved.

Correlated Subqueries With Aggregation

The inner query can aggregate as well as check existence, which lets you filter the outer rows by a computed total. This returns customers whose lifetime spend exceeds 1,000.

SELECT c.customer_id, c.name
FROM `analytics-warehouse.shop.customers` c
WHERE (
SELECT SUM(p.purchase_total)
FROM `analytics-warehouse.shop.purchases` p
WHERE p.customer_id = c.customer_id
) > 1000

For each customer, the inner query sums only their purchases and the outer WHERE keeps those above the threshold. This is expressive, but it is also the pattern most likely to be slow on big tables, since the sum is recomputed per customer. On large data, computing the totals once with a grouped subquery or CTE and then filtering is often the faster route.

Best Practices and Pitfalls

A few habits keep correlated subqueries efficient. Filter as early as possible inside the inner query, so it touches the least data per outer row. Prefer EXISTS and NOT EXISTS over IN and NOT IN on large datasets, because they can stop at the first match and handle nulls more predictably. When the data is large and performance leads, rewrite the logic as a join with GROUP BY, which the engine can optimise more aggressively. And always test the inner query’s logic on its own first, since a correlated subquery is harder to debug once it is entangled with the outer row.

The pitfalls are the mirror image of those practices. A subquery used with a plain comparison like = will fail if it can return more than one row, so use EXISTSIN, or an aggregate that collapses to a single value. Poor performance on large datasets is the classic symptom of an over-used correlated subquery, and the cure is usually the join-and-group rewrite. And wrong results almost always trace back to faulty inner logic, which is exactly why isolating and testing the subquery first pays off.

When to Reach for One

Correlated subqueries shine for per-row questions about related data. Use a WHERE form with EXISTS to keep rows that have a matching record, such as customers with recent activity. Use NOT EXISTS to find rows with no match, such as customers with no recent orders. Use a SELECT form to add a per-row calculated column like a count or sum of related rows. And use the aggregating WHERE form to filter rows by a related total, such as customers above a spending threshold.

ClauseStructureWhat it does
WHEREWHERE EXISTS (SELECT ... WHERE related = outer.key)Keeps rows with a matching inner record
WHEREWHERE NOT EXISTS (...)Excludes rows that have a match
SELECT(SELECT COUNT(*) FROM ... WHERE related = outer.key)Adds an inline calculated column
WHERE / HAVINGWHERE (SELECT SUM(...) FROM ...) > XFilters rows by a related aggregate

Conclusion

A correlated subquery runs once per outer row and reaches back into that row for a value, which makes it the right tool for per-row checks and calculations against related data. Use EXISTS and NOT EXISTS to keep or exclude rows by the presence or absence of a match, put a subquery in SELECT for an inline per-row metric, and aggregate inside the subquery to filter by a related total. Just keep performance in mind: the same logic almost always has a join-and-group-by equivalent, and on large data that rewrite is usually faster, so save the correlated form for the cases where its readability genuinely earns its place.

View Comments (1)

Leave a Reply

Prev Next

Subscribe to My Newsletter

Subscribe to my email newsletter to get the latest posts delivered right to your email. Pure inspiration, zero spam.

Discover more from Discuss Data Science, Machine Learning and Analytics

Subscribe now to keep reading and get access to the full archive.

Continue reading