Aggregates and Joins in BigQuery: COUNT, SUM, AVG, and JOINs

Summarise rows and combine tables, the two core moves of SQL analysis. Master COUNT, SUM, AVG, GROUP BY, and the four join types in BigQuery, plus avoiding join fan-out.

Most real analysis in BigQuery comes down to two moves. You summarise many rows into a few meaningful numbers, and you combine information that lives in separate tables. The first is the job of aggregate functions like COUNT, SUM, and AVG, and the second is the job of joins. Master both, and most questions you will be asked of a dataset become answerable. This guide covers each in turn, then shows how they work together, which is where the real power, and a few common traps, live.

Counting rows with COUNT

COUNT returns how many rows match. The detail that trips people up is the difference between counting all rows and counting a specific column. Counting a column ignores nulls in that column, which is sometimes exactly what you want, for instance to count how many users actually supplied an email address.

SELECT COUNT(email) AS total_emails
FROM `myproject.crm.users`
WHERE email IS NOT NULL

To count every row regardless of nulls, use the wildcard form, COUNT(*), which is also the fastest way to get a simple row total. COUNT becomes far more useful paired with GROUP BY, which splits the table into groups and counts within each.

SELECT country, COUNT(*) AS users_per_country
FROM `myproject.crm.users`
GROUP BY country

That pattern, group by a category and count, answers a huge share of everyday questions: users per country, orders per day, errors per service.

Adding values with SUM

SUM adds up the numbers in a column, and on its own it gives a single grand total.

SELECT SUM(amount) AS total_sales
FROM `myproject.sales.orders`

Far more often you want a total per group, which again is GROUP BY doing the work, here giving each customer’s total spend.

SELECT customer_id, SUM(order_total) AS customer_spend
FROM `myproject.sales.orders`
GROUP BY customer_id

One thing to know is that SUM quietly skips nulls. That is usually what you want, but if a null should really count as zero in your logic, wrap the column in IFNULL to make that explicit.

Averaging with AVG

AVG returns the mean of a numeric column, and like SUM it skips nulls, which matters more here than for a sum.

SELECT course, AVG(score) AS avg_score
FROM `myproject.assessments.exams`
GROUP BY course

The subtlety is that because AVG ignores nulls, a missing score is left out of both the total and the count, so it does not drag the average down. If your interpretation is that a missing score should count as a zero, you must say so with IFNULL, because doing it changes the answer.

SELECT course, AVG(IFNULL(score, 0)) AS avg_score
FROM `myproject.assessments.exams`
GROUP BY course

Deciding whether a missing value means “unknown, leave it out” or “zero, count it in” is a real analytical choice, not a technicality, and the two give different numbers.

The GROUP BY rule

There is one rule that governs all of these. Whenever you mix aggregated and non-aggregated columns in a SELECT, every non-aggregated column must appear in the GROUP BY. If you select a category alongside a COUNT but forget to group by that category, BigQuery will reject the query. The fix is simply to add the plain columns to the GROUP BY, which is the single most common aggregate error and the easiest to correct.

Combining tables with joins

Joins stitch rows from two tables together on a shared key, and BigQuery offers the four standard types, which differ only in what they do with rows that have no match on the other side.

An inner join is the strictest. It returns only rows that match in both tables, so anything without a partner is dropped.

SELECT a.user_id, a.name, b.total_orders
FROM `myproject.crm.users` AS a
JOIN `myproject.sales.orders` AS b
ON a.user_id = b.user_id

A left join keeps every row from the left table whether or not it has a match, filling the right-side columns with nulls when there is none. This is the right choice when you want to retain all of one side, such as listing every user even those who have never ordered.

SELECT u.user_id, u.name, o.order_total
FROM `myproject.crm.users` AS u
LEFT JOIN `myproject.sales.orders` AS o
ON u.user_id = o.user_id

A right join is the mirror image, keeping every row from the right table instead, and in practice you can always rewrite it as a left join by swapping the table order, which most people find easier to read.

SELECT u.name, o.order_id
FROM `myproject.sales.orders` AS o
RIGHT JOIN `myproject.crm.users` AS u
ON o.user_id = u.user_id

A full join keeps everything from both tables, matching where it can and filling nulls on either side where it cannot, which is useful when you want a complete picture of two sets and the gaps between them.

SELECT a.name, b.order_total
FROM `myproject.crm.users` AS a
FULL JOIN `myproject.sales.orders` AS b
ON a.user_id = b.user_id

Across all of these, the short table aliases, a and b or u and o, are not just brevity. They make every column reference unambiguous about which table it came from, which becomes essential the moment both tables share a column name.

Bringing aggregates and joins together

The two skills combine into the query you will write most often: join the tables to bring related data side by side, then aggregate to summarise it. Total spend per customer is the classic example, joining users to their orders and summing within each customer.

SELECT u.user_id, u.name, SUM(o.order_total) AS total_spent
FROM `myproject.crm.users` AS u
JOIN `myproject.sales.orders` AS o
ON u.user_id = o.user_id
GROUP BY u.user_id, u.name
ORDER BY total_spent DESC

Notice that both plain columns, the user id and the name, appear in the GROUP BY, satisfying the rule from earlier, while the sum does the aggregating and the order by ranks customers by spend.

The trap to watch: join fan-out

The most insidious mistake when combining joins and aggregates is duplicated rows. If the join key is not unique on the side you are joining to, a single row on one side matches many on the other, and your sums and counts silently inflate. The defence is to understand the relationship between your tables before joining, knowing which side has unique keys and which can have many matching rows. If a total looks suspiciously high after a join, fan-out is the first thing to check, and the cure is usually to aggregate one table down to one row per key before joining, or to count distinct rather than count all.

The habits that keep this clean

A few small practices prevent most of the errors. Use COUNT(*) for a true row count and a specific column only when you mean to ignore its nulls. Decide deliberately whether nulls in a sum or average should be skipped or treated as zero, and reach for IFNULL when you want the latter. Always put every non-aggregated column in the GROUP BY. Give your tables short aliases so references stay readable. And before trusting an aggregate that sits on top of a join, confirm the join key is unique on at least one side so fan-out is not quietly doubling your numbers. Get those right and aggregation across joined tables, the backbone of analytical SQL, becomes both reliable and routine.

See you soon.

View Comments (1)

Leave a Reply

Prev Next

Subscribe to My Newsletter

Subscribe to my email newsletter to get the latest posts delivered right to your email. Pure inspiration, zero spam.

Discover more from Discuss Data Science, Machine Learning and Analytics

Subscribe now to keep reading and get access to the full archive.

Continue reading