SQL Joins: A Comprehensive Guide to Data Combinations

Real databases rarely store everything in one giant table. That would be wasteful and a maintenance nightmare. Instead, they split data into related tables, one for countries, one for cities, one for languages, each holding one kind of thing and referencing the others through shared key columns. The skill that brings it all back together is knowing how to combine these tables to answer questions that span them. This guide covers the three main ways to do it: joins, which place tables side by side; set operations, which stack them on top of each other; and subqueries, which feed the result of one query into another.

INNER JOIN: Keeping Only the Matches

The most common join is the inner join, which returns only rows that have a match in both tables.

			
SELECT cities.name AS city, countries.name AS country, countries.region
FROM cities
INNER JOIN countries
ON cities.country_code = countries.code;

Imagine the cities and countries tables as two stacks of index cards. Each city card has a country_code written on it, and each country card has a code. The join condition, ON cities.country_code = countries.code, says to pair up cards where these two values match and merge each pair into one wider row. Cards with no matching partner on the other side are discarded, and that discarding is the “inner” part: you only see rows where a match exists in both tables.

Notice the dotted notation cities.name. When both tables have a column called name, you must specify which table’s nameyou mean, or SQL throws an “ambiguous column” error.

Writing full table names repeatedly gets tedious, so in practice almost everyone uses aliases:

			
SELECT c.code AS country_code, c.name, e.year, e.inflation_rate
FROM countries AS c
INNER JOIN economies AS e
ON c.code = e.code;

The AS c gives the table a short nickname for the rest of the query, so c.name now means countries.name. Aliases cut the visual noise roughly in half and are standard in real code. The AS keyword is optional in many databases, but writing it out keeps things unambiguous.

When the matching column has the same name in both tables, there is an even shorter form using USING:

			
SELECT c.name AS country, l.name AS language, official
FROM countries AS c
INNER JOIN languages AS l
USING (code);

USING (code) is shorthand for ON c.code = l.code. As a bonus, it merges the two code columns into one in the output. The rule is simple: use USING when the column names match, and fall back to ON when they differ.

Joining Three or More Tables, and a Hidden Trap

You can chain joins as many times as you need, each one stitching another table onto the growing result. But this is where one of SQL’s quietest bugs lives:

			
SELECT name, e.year, fertility_rate, e.unemployment_rate
FROM countries AS c
INNER JOIN populations AS p
    ON c.code = p.country_code
INNER JOIN economies AS e
    ON c.code = e.code
    AND p.year = e.year;

		

The crucial line is AND p.year = e.year. Without it, SQL would happily pair one country’s 2010 fertility data with the same country’s 2015 unemployment data in a single row, mixing values from different years that have no business sitting together. SQL does not catch this; it silently produces nonsense with no error message. Whenever you join tables that contain a time or snapshot dimension, you must explicitly match on it too, or you get silent data corruption. This is a real-world bug that ships to production constantly, so it is worth burning into memory.

LEFT, RIGHT, and FULL JOIN: Keeping the Non-Matches

Inner joins drop anything without a match. The outer joins, left, right, and full, let you keep rows that have no partner, which is often exactly what you want.

A left join keeps every row from the left table no matter what. If a matching row exists on the right, its values are attached; if not, the right-side columns are filled with NULL.

			
SELECT c1.name AS city, code, c2.name AS country, region, city_proper_pop
FROM cities AS c1
LEFT JOIN countries AS c2
    ON c1.country_code = c2.code
ORDER BY code DESC;

		

This matters because it prevents silent data loss. With an inner join, a city whose country_code matches no country would simply vanish from the results, with no warning. With a left join, it stays in the output but with NULLs in the country columns, visible and accounted for. That visibility is frequently the whole point, answering questions like “which cities have no country data on file?”

Left joins combine naturally with the full aggregation pipeline:

			
SELECT region, AVG(gdp_percapita) AS avg_gdp
FROM countries AS c
LEFT JOIN economies AS e
    USING(code)
WHERE year = 2010
GROUP BY region
ORDER BY avg_gdp DESC
LIMIT 10;

		

SQL processes this in order: stitch all countries to their economies while keeping countries even without economy data, drop everything except 2010, bucket by region, average the GDP per bucket, sort, and take the top ten. Reading these longer queries becomes easy once you mentally walk the pipeline top to bottom.

A right join is the mirror image, keeping all rows from the right table and dropping unmatched rows from the left:

			
SELECT countries.name AS country, languages.name AS language, percent
FROM languages
RIGHT JOIN countries
    USING(code)
ORDER BY language;

		

In practice, almost nobody uses right joins, because every right join can be rewritten as a left join simply by swapping the table order, and the left join reads more naturally in English. The equivalent and preferred form is:

			
SELECT countries.name AS country, languages.name AS language, percent
FROM countries
LEFT JOIN languages
    USING(code)
ORDER BY language;

		

Same result, but it reads as “start with countries, attach languages where they match.” Convention beats raw equivalence here, so reach for the left join form.

A full join keeps everything from both tables, NULL-filling whichever side has no match:

			
SELECT name AS country, code, region, basic_unit
FROM countries
FULL JOIN currencies
    USING (code)
WHERE region LIKE 'North America'
   OR region IS NULL
ORDER BY region;

		

This shows you countries with currencies, countries without currencies, and currencies belonging to no country in your table. The OR region IS NULL clause is a neat trick that specifically surfaces the unmatched currencies, the ones with no country and therefore no region. Full joins are ideal for data-audit queries where you want to see what is missing on either side.

The Four Joins at a Glance

It helps to picture two small tables, A with values 1, 2, 3 and B with values 1, 2, 4. Values 1 and 2 match on both sides, 3 exists only in A, and 4 only in B. An inner join returns rows 1 and 2, the matches only. A left join returns 1, 2, and 3, keeping all of A plus its matches from B. A right join returns 1, 2, and 4. A full join returns all of 1, 2, 3, and 4. The difference between them is entirely about which non-matching rows survive.

CROSS JOIN: Every Possible Combination

A cross join has no match condition at all. It pairs every row from the first table with every row from the second, like multiplication.

			
SELECT c.name AS country, l.name AS language
FROM countries AS c
CROSS JOIN languages AS l
WHERE c.code IN ('PAK', 'IND')
  AND l.code IN ('PAK', 'IND');

		

If one table has five rows and the other has ten, you get fifty output rows. This is the Cartesian product, the mathematical term for every possible pairing. It sounds wasteful, but it is exactly right when you genuinely need every combination: tournament schedules pairing every team against every other, pricing matrices of every product across every region, or shift plans of every employee across every day.

One warning: forgetting the ON clause on a regular join causes many databases to silently treat it as a cross join, exploding your row count. Always double-check that your joins have their conditions.

SELF JOIN: Comparing a Table to Itself

The self join is the one that bends people’s brains at first. You have a single table and want to compare rows within it, placing two related rows side by side in one output row.

			
SELECT
    p1.country_code,
    p1.size AS size2010,
    p2.size AS size2015
FROM populations AS p1
INNER JOIN populations AS p2
    USING(country_code)
WHERE p1.year = 2010
  AND p2.year = 2015;

		

The populations table has one row per country-and-year pair, and you want each country’s 2010 population next to its 2015 population. The trick is to pretend it is two tables by giving the same table two different aliases. Here p1 is the “2010 view” and p2 is the “2015 view.” You join them on country code, pin p1 to 2010 and p2 to 2015, and each output row now carries both years’ data together. Without aliasing the same table twice, SQL has no way to distinguish the 2010 row from the 2015 row, since they live in the same table. This single pattern unlocks year-over-year comparisons, employee-and-manager hierarchies where managers are themselves employees in the same table, and any “compare one row to another within a table” question.

Set Operations: Stacking Results

Joins widen your result by adding columns. Set operations do the opposite: they stack the results of two queries on top of each other, adding rows. The one requirement is that both queries produce the same number of columns with compatible types.

UNION stacks two result sets and removes any duplicate rows:

			
SELECT * FROM economies2015
UNION
SELECT * FROM economies2019
ORDER BY code, year;

This is useful when data is spread across multiple historical tables and you want them combined into one. If the same row appears in both, you get it once.

UNION ALL does the same thing but skips the deduplication, keeping every row including duplicates:

			
SELECT code, year FROM economies
UNION ALL
SELECT country_code, year FROM populations
ORDER BY code, year;

Because it does no extra work, UNION ALL is faster, and it preserves duplicates where they are meaningful. A good rule of thumb is to default to UNION ALL and only reach for UNION when you specifically need duplicates removed. Many people use UNION out of habit and pay a performance cost they did not need.

INTERSECT returns only rows that appear in both queries, the set-theory “AND”:

			
SELECT name FROM cities
INTERSECT
SELECT name FROM countries;

This finds names that exist as both a city and a country, like Mexico, Luxembourg, or Singapore. It is the tool for finding overlap between two lists.

EXCEPT returns rows in the first query that do not appear in the second, the equivalent of set subtraction:

			
SELECT name FROM cities
EXCEPT
SELECT name FROM countries
ORDER BY name;

This finds city names that are not also country names. Order matters here: A EXCEPT B is not the same as B EXCEPT A.

Subqueries: Queries Inside Queries

A subquery is one query used as input to another, much like nested function calls in code where you write outer(inner()). The inner query runs first, produces a result, whether a single value, a list, or a whole table, and the outer query uses that result. Subqueries let you express “find X based on the answer to Y” in a single statement, without manually running Y, copying its answer, and pasting it into a second query.

Subqueries in WHERE: Filtering by Another Query

The most common use is filtering. Here the subquery generates a list of country codes, and the outer query keeps only languages whose code is in that list:

			
SELECT DISTINCT name
FROM languages
WHERE code IN
    (SELECT code
     FROM countries
     WHERE region = 'Middle East')
ORDER BY name;

		

Think of it in two steps. First the subquery lists every country code in the Middle East. Then the outer query returns all distinct languages whose code appears in that list. You could have typed the codes by hand, but the query would break the moment a country was added or removed from the database. The subquery makes it self-updating. This pattern is often called a semi join, because you do not actually pull any columns from countries; you only use it to filter.

The inverse is the anti join, using NOT IN:

			
SELECT code, name
FROM countries
WHERE continent = 'Oceania'
  AND code NOT IN
    (SELECT code
     FROM currencies);

		

This finds Oceania countries whose code is not in the list of codes that have currencies, in other words Oceania countries with no currency on file. Watch the NULL gotcha here: NOT IN returns zero rows if the subquery’s result contains any NULLs, so if you are unsure, add WHERE code IS NOT NULL inside the subquery.

Subqueries can also return a single number for comparison:

			
SELECT *
FROM populations
WHERE year = 2015
  AND life_expectancy > 1.15 *
    (SELECT AVG(life_expectancy)
     FROM populations
     WHERE year = 2015);

		

The subquery computes the global average life expectancy in 2015, and the outer query keeps only countries above 115% of that average. This is the classic “find outliers above the mean” pattern, and it is impossible without a subquery, because you do not know the average until you compute it on the fly.

Subqueries in SELECT: A Value per Row

You can place a subquery in the SELECT list to compute a value for each row of the outer query:

			
SELECT countries.name AS country,
    (SELECT COUNT(*)
     FROM cities
     WHERE countries.code = cities.country_code) AS cities_num
FROM countries
ORDER BY cities_num DESC, country
LIMIT 9;

		

This is a correlated subquery, the most powerful and the slowest kind. For each row of the outer query, each country, SQL re-runs the inner query, but the inner query references countries.code from the current outer row. So for the USA the inner query effectively becomes a count of cities where the country code is the USA, and for Spain it becomes the same for Spain. The result is one city count per country, attached to that country’s row. Correlated subqueries can almost always be rewritten as a LEFT JOIN with GROUP BY, often faster, so watch performance: if the outer table has a million rows, the inner query runs a million times.

Subqueries in FROM: A Temporary Table

A subquery in the FROM clause creates a temporary table on the fly that you can then use like any other table:

			
SELECT local_name, sub.lang_num
FROM countries,
    (SELECT code, COUNT(*) AS lang_num
     FROM languages
     GROUP BY code) AS sub
WHERE countries.code = sub.code
ORDER BY lang_num DESC;

		

Here the subquery, aliased sub, computes how many languages each country code has, and the outer query joins it back to countries to attach the country names. This derived-table pattern is the workhorse of multi-step analysis: SQL computes the counts, produces a temporary table, then joins and displays. A subquery in FROM must always have an alias.

Two modern notes on this example. First, the comma between countries and the subquery is old-style implicit join syntax; today you would write it as an explicit INNER JOIN ... ON countries.code = sub.code. Second, this whole nested approach has a cleaner modern replacement in Common Table Expressions, the WITH ... AS (...) syntax, which does the same job far more readably and is well worth learning next.

Reading a Real Query as a Sentence

Putting the pieces together, a query like this answers a genuine business question:

			
SELECT name, country_code, city_proper_pop, metroarea_pop,
       city_proper_pop / metroarea_pop * 100 AS city_perc
FROM cities
WHERE name IN
    (SELECT capital
     FROM countries
     WHERE continent = 'Europe'
        OR continent LIKE '%America')
  AND metroarea_pop IS NOT NULL
ORDER BY city_perc DESC
LIMIT 10;

		

Read as one English sentence, this asks: of the capital cities in Europe or the Americas, which have the city proper making up the largest share of their metro area, and give me the top ten. It computes a per-row percentage, filters to capitals via the subquery, restricts to Europe or any “-America” continent using the LIKE '%America' pattern, drops rows with a missing metro population to avoid NULLs in the calculation, then sorts and limits. When you can read a multi-line query as a single sentence describing a real-world question, you have internalised the language.

One thing to watch in that percentage: city_proper_pop / metroarea_pop * 100 can produce zero in databases that do integer division when both columns are integers. Force decimals by writing city_proper_pop * 100.0 / metroarea_pop, the same .0 trick from working with aggregates.

Semi Join Versus Anti Join

Two terms from relational algebra come up often enough to be worth naming. A semi join includes rows that have a partner in another query, expressed as WHERE code IN (subquery). An anti join includes rows that do not have a partner, expressed as WHERE code NOT IN (subquery). Same shape, opposite meaning: one keeps the rows with a match, the other keeps the rows without. You will hear these terms in database theory and query optimisation.

Quick Reference

			
-- INNER JOIN
FROM table_a AS a
INNER JOIN table_b AS b ON a.id = b.id;
-- USING (when the column name is the same)
INNER JOIN table_b AS b USING (id);
-- LEFT and FULL JOIN
LEFT JOIN table_b AS b USING (id);
FULL JOIN table_b AS b USING (id);
-- CROSS JOIN (no condition)
CROSS JOIN table_b AS b;
-- SELF JOIN
FROM populations AS p1
INNER JOIN populations AS p2 USING(country_code)
WHERE p1.year = p2.year - 5;
-- Set operations
SELECT col FROM table_a
UNION          -- or UNION ALL / INTERSECT / EXCEPT
SELECT col FROM table_b;
-- Subqueries
WHERE code IN (SELECT code FROM countries WHERE ...);
WHERE value > (SELECT AVG(value) FROM table);
SELECT name, (SELECT COUNT(*) FROM cities WHERE ...) AS num
FROM (SELECT code, COUNT(*) AS n FROM languages GROUP BY code) AS sub;

		

Conclusion

Combining data across tables comes down to three ideas. Joins place tables side by side: inner joins keep only matches, left joins keep everything from one side and fill the gaps with NULL, full joins keep everything from both, cross joins produce every combination, and a self join compares a table to itself by aliasing it twice. Set operations stack results instead, with UNION and UNION ALL to combine, INTERSECT for overlap, and EXCEPT for difference. Subqueries feed one query’s result into another, whether to filter rows, compute a value per row, or build a temporary table to query from. Keep an eye on the recurring traps along the way: matching on a time dimension when joining snapshot tables, the NOT IN and NULL interaction, and integer division needing a .0. With these tools, no question that spans your tables is out of reach.

See you soon.

Testing Strategies for CRO: A Practitioner’s Guide

Unsupervised Machine Learning: Clustering, Dimensionality Reduction, and Topic Modeling

Supervised Machine Learning with scikit-learn

Python Testing with pytest and unittest

SQL Joins, Set Operations, and Subqueries: Combining Data Across Tables

INNER JOIN: Keeping Only the Matches

Joining Three or More Tables, and a Hidden Trap

LEFT, RIGHT, and FULL JOIN: Keeping the Non-Matches

The Four Joins at a Glance

CROSS JOIN: Every Possible Combination

SELF JOIN: Comparing a Table to Itself

Set Operations: Stacking Results

Subqueries: Queries Inside Queries

Subqueries in WHERE: Filtering by Another Query

Subqueries in SELECT: A Value per Row

Subqueries in FROM: A Temporary Table

Reading a Real Query as a Sentence

Semi Join Versus Anti Join

Quick Reference

Conclusion

Related

Leave a ReplyCancel reply

Recommended for You

Working with GitHub: Collaboration, Workflow, and the Wider Ecosystem

Feature Engineering and Preprocessing: Getting Data Ready for a Model

Testing Strategies for CRO: A Practitioner’s Guide

Unsupervised Machine Learning: Clustering, Dimensionality Reduction, and Topic Modeling

Supervised Machine Learning with scikit-learn

Python Testing with pytest and unittest

SQL Joins, Set Operations, and Subqueries: Combining Data Across Tables

INNER JOIN: Keeping Only the Matches

Joining Three or More Tables, and a Hidden Trap

LEFT, RIGHT, and FULL JOIN: Keeping the Non-Matches

The Four Joins at a Glance

CROSS JOIN: Every Possible Combination

SELF JOIN: Comparing a Table to Itself

Set Operations: Stacking Results

Subqueries: Queries Inside Queries

Subqueries in WHERE: Filtering by Another Query

Subqueries in SELECT: A Value per Row

Subqueries in FROM: A Temporary Table

Reading a Real Query as a Sentence

Semi Join Versus Anti Join

Quick Reference

Conclusion

Related

Leave a ReplyCancel reply

Subscribe to My Newsletter

Recommended for You

Working with GitHub: Collaboration, Workflow, and the Wider Ecosystem

Feature Engineering and Preprocessing: Getting Data Ready for a Model

Discover more from Datalad - Data Science and ML