A correlated subquery is the one that cannot stand on its own. Where an ordinary subquery runs once and hands its result up to the outer query, a correlated subquery runs once for every row of the outer query and reaches back into that row for a value. That row-by-row dependency is exactly what makes it powerful: it lets you test, filter, or calculate against related data on a per-row basis, asking a fresh question of each row rather than one question of the whole table. This article builds on the broader tour in BigQuery Subqueries and Nested Queries and focuses on the correlated form, using a schema of customers, purchases, and products.
What Makes a Subquery Correlated
Three things define it. It runs once per row in the outer query, it references columns from that outer row, and it usually lives in the WHERE, SELECT, or HAVING clause. The reference back to the outer query is the giveaway. The general shape looks like this:
SELECT outer_col1, outer_col2FROM outer_table oWHERE EXISTS ( SELECT 1 FROM inner_table i WHERE i.foreign_key = o.primary_key)
The inner query depends on o.primary_key, a value that only exists once you are looking at a specific outer row, which is why it cannot be run in isolation.
A First Example: Customers With a High-Value Purchase
The most common use is filtering with EXISTS. This returns every customer who has made at least one purchase over 1,000.
SELECT c.customer_id, c.nameFROM `analytics-warehouse.shop.customers` cWHERE EXISTS ( SELECT 1 FROM `analytics-warehouse.shop.purchases` p WHERE p.customer_id = c.customer_id AND p.purchase_total > 1000)
For each customer, the inner query looks only at that customer’s purchases and checks whether any exceeded the threshold. The SELECT 1 is conventional: EXISTS only cares whether a matching row exists, not what it contains, so there is nothing to gain from selecting real columns.
A Correlated Subquery in the SELECT Clause
Beyond filtering, a correlated subquery can compute a value for each row as an inline column. Here every customer gets a count of their purchases.
SELECT c.customer_id, c.name, ( SELECT COUNT(*) FROM `analytics-warehouse.shop.purchases` p WHERE p.customer_id = c.customer_id ) AS total_purchasesFROM `analytics-warehouse.shop.customers` c
This reads cleanly and is perfect for a report, but it is worth knowing what it costs. The same result can be expressed as a join with a group-by:
SELECT c.customer_id, c.name, COUNT(p.purchase_id) AS total_purchasesFROM `analytics-warehouse.shop.customers` cLEFT JOIN `analytics-warehouse.shop.purchases` p ON c.customer_id = p.customer_idGROUP BY c.customer_id, c.name
The two produce the same numbers, and the choice between them is a genuine trade-off. The join with GROUP BYgenerally scales better and is more flexible when you need several aggregates at once, so it is the better default on large data. The correlated subquery is cleaner and more readable for a single per-row value, especially in reporting where clarity matters more than raw throughput. Reach for the join when performance leads, and the subquery when readability does.
Excluding Rows With NOT EXISTS
Flip EXISTS to NOT EXISTS and you get the opposite question: which rows have no match at all. This returns products that have never been purchased.
SELECT p.product_id, p.nameFROM `analytics-warehouse.shop.products` pWHERE NOT EXISTS ( SELECT 1 FROM `analytics-warehouse.shop.purchases` s WHERE s.product_id = p.product_id)
NOT EXISTS is the natural way to find the absence of related data, things like products never ordered, customers who never logged in, or accounts with no activity, and it is usually clearer and faster than the NOT IN alternative, which also behaves awkwardly when nulls are involved.
Correlated Subqueries With Aggregation
The inner query can aggregate as well as check existence, which lets you filter the outer rows by a computed total. This returns customers whose lifetime spend exceeds 1,000.
SELECT c.customer_id, c.nameFROM `analytics-warehouse.shop.customers` cWHERE ( SELECT SUM(p.purchase_total) FROM `analytics-warehouse.shop.purchases` p WHERE p.customer_id = c.customer_id) > 1000
For each customer, the inner query sums only their purchases and the outer WHERE keeps those above the threshold. This is expressive, but it is also the pattern most likely to be slow on big tables, since the sum is recomputed per customer. On large data, computing the totals once with a grouped subquery or CTE and then filtering is often the faster route.
Best Practices and Pitfalls
A few habits keep correlated subqueries efficient. Filter as early as possible inside the inner query, so it touches the least data per outer row. Prefer EXISTS and NOT EXISTS over IN and NOT IN on large datasets, because they can stop at the first match and handle nulls more predictably. When the data is large and performance leads, rewrite the logic as a join with GROUP BY, which the engine can optimise more aggressively. And always test the inner query’s logic on its own first, since a correlated subquery is harder to debug once it is entangled with the outer row.
The pitfalls are the mirror image of those practices. A subquery used with a plain comparison like = will fail if it can return more than one row, so use EXISTS, IN, or an aggregate that collapses to a single value. Poor performance on large datasets is the classic symptom of an over-used correlated subquery, and the cure is usually the join-and-group rewrite. And wrong results almost always trace back to faulty inner logic, which is exactly why isolating and testing the subquery first pays off.
When to Reach for One
Correlated subqueries shine for per-row questions about related data. Use a WHERE form with EXISTS to keep rows that have a matching record, such as customers with recent activity. Use NOT EXISTS to find rows with no match, such as customers with no recent orders. Use a SELECT form to add a per-row calculated column like a count or sum of related rows. And use the aggregating WHERE form to filter rows by a related total, such as customers above a spending threshold.
| Clause | Structure | What it does |
|---|---|---|
| WHERE | WHERE EXISTS (SELECT ... WHERE related = outer.key) | Keeps rows with a matching inner record |
| WHERE | WHERE NOT EXISTS (...) | Excludes rows that have a match |
| SELECT | (SELECT COUNT(*) FROM ... WHERE related = outer.key) | Adds an inline calculated column |
| WHERE / HAVING | WHERE (SELECT SUM(...) FROM ...) > X | Filters rows by a related aggregate |
Conclusion
A correlated subquery runs once per outer row and reaches back into that row for a value, which makes it the right tool for per-row checks and calculations against related data. Use EXISTS and NOT EXISTS to keep or exclude rows by the presence or absence of a match, put a subquery in SELECT for an inline per-row metric, and aggregate inside the subquery to filter by a related total. Just keep performance in mind: the same logic almost always has a join-and-group-by equivalent, and on large data that rewrite is usually faster, so save the correlated form for the cases where its readability genuinely earns its place.
[…] BigQuery Correlated Subqueries […]